Friday, April 27, 2012

EE 547B HW2


3/14/12 (Solution)
Problem 1: Hyper-Cube Interconnection Network (6 points)

Answer the following questions based on the Hyper-Cube interconnection network

1)      What is the percentage of links in an 8-cube used to implement an equivalent 2-D mesh (16x16) interconnection network? (3 points)

There are totally 8*256/2 links.
There are 15*16*2 links of a 16X16 mesh.
Percentage =(15*16*2)/(4*256)=15/32=46.875%




2)      If the links are duplex (i.e., 1->3, and 3->1 can be done concurrently), and each node cannot receive more than 1 message in a step (i.e., 1->3 and 2->3 cannot be done concurrently). Show how to use 3 steps to complete the following communication permutation. (3 points)

Sender -> Receiver
Step 1
Step 2
Step 3
0->1
0->1


1->2
1->0
0->2

2->3
2->3


3->4
3->2
2->0
0->4
4->5
4->5


5->6
5->4
4->6

6->7
6->7


7->0
7->6
6->4
4->0


Problem 2: Omega Network (4 points)

Given a multiprocessor with 4096 nodes, an Omega Network is used for the interconnection of the multiprocessor. 16X16 switches are used.

1)      How many switches are required for this Omega network? (2 points)


Log16(4096)=3 (stages)
            4096/16=256
           
            Totally 256*3=768


2)      Show how Node 1073 can send a message to Node 2055. (Hint: You may use Hexadecimal representation of the Node IDs.) (2 points)

1073= 0100 0011 0001two= 411 Hex, 2055= 1000 0000 0111 two= 807 Hex

                        EX                   SH                   EX                   SH                   EX
             431     ->         438      ->         384      ->         380      ->         803      ->         807




Problem 3: Synchronization (5 points)

The following C program can be parallelized.

C programming

int A[5];
int index=0;
while(index != 4) {
index=index+1;
A[index-1]= A[index];
      }

Assume that the initial values of the shared variable array are A[5]={1, 2, 3, 4, 5}. With a single processor, the resulting values of the array after executing the program will become A[5]= {2, 3, 4, 5, 5}.


1)      Give an example to show that the above code results in a racing if two processors are involved. (2 points)
See the following events:

The key problem is that the index increments and the A matrix updates could be performed in any order.
For instance, if P0 reads index=0, then writes index=1. Immediately following that, P1 reads index=1 and writes index=2. Then, if P1 performs A[1]=A[2] before P0 performs A[0]=A[1], then, the resulting A[0]=3 which is not the same as it from sequential execution. Thus, there is a racing problem.


                                   
2)      The implementation of the program using assembly is shown below:

Assume that the base address of array A is in register R1, so, A[0], A[1], …, A[4] can be accessed by 0(R1),  4(R1), … 16(R1) respectively since each integer number is 4 bytes long. Also, assume that the variable index is stored at 0(R2). The initial value of 0(R2) is 0 since index is initialized to 0.

                  li          R0, #4             # load immediate R0=4
Loop:   lw        R3, 0(R2)        # R3=index
            beq      R3, R0, Exit    # if(index= =4) goto Exit
            addi     R3, R3, 1         # R3=index+1
            sw        R3, 0(R2)        # index= index+1;
            mult     R4, R3, 4         # R4=4*R3. Note that each integer number is
#4 bytes long!!
            add      R5, R1, R4      # R5=R1+R4, i.e., R5 is the address of A[index]
            lw        R3, 0(R5)        # R3=A[index]
            sw        R3, -4 (R5)      # A[index-1] =A[index];
            j           Loop                # goto Loop   
Exit:

How do you resolve the problem using the atomic exchange, ll & sc so the resulting values of the array will be the same as the execution on a single processor? (3 points)

A possible solution is.
li R6, #4                       # load immediate R6=4
Loop:   ll R3, 0(R2)                  # R3=index
beq R3, R6, Exit         # if(index= =4) goto Exit
addi R3, R3, 1             # R3=index+ 1
mult R4, R3, 4             # R4=4*R3. Note that each integer number is 4 bytes long!!
add R5, R1, R4           # R5=R1 +R4, i.e., R5 is the address of A[index]
lw R4, 0(R5)                # R4=A[index] R3 buffers the value of A[index] before it
                                    # could be probably updated by the other processor.
sc R3, 0 (R2)              # A[index-1] =A[index]; Only one successful write will occur.
beqz R3, Loop             # if R3 is 0, index has been modified by the other processor.
# Then, abort array update operation
sw R4,-4(R5)              # A[index-1] = A[index];
j Loop                          # next iteration

            The trick here is that the shared variable, index, is not updated until A[index] is loaded. Thus, given the previous example, if P0 reads index=0, then neither of the processors will be able to read index=1 before A[1] is buffered by R4.
Thus, the right value of A[1] is preserved by P0 even if A[1]=A[2] is performed before A[0]=A[1]! (Note that the access order of these two instructions is not guaranteed.



No comments:

Post a Comment