### Problem 1. (12 points): Consider the source code below, where M and N are constants declared with #define. ``` int mat1[M][N]; int mat2[N][M]; int sum_element(int i, int j) { return mat1[i][j] + mat2[i][j]; } ``` A. Suppose the above code generates the following assembly code: ``` sum_element: pushl %ebp mov1 %esp,%ebp mov1 8(%ebp),%eax mov1 12(%ebp),%ecx sall $2,%ecx leal 0(,%eax,8),%edx subl %eax,%edx leal (%eax,%eax,4),%eax mov1 mat2(%ecx,%eax,4),%eax addl mat1(%ecx,%edx,4),%eax mov1 %ebp,%esp popl %ebp ret ``` What are the values of M and N? $$M = 5$$ $$N = 7$$ #### Problem 2. (15 points): Consider the following C declaration, assuming a 32 bit machine (1 byte size for char, 4 byte size for int, 8 bytes for double): ``` struct Node{ char c; double d; struct Node* n; int i; struct Node* l; struct Node* r; }; typedef struct Node* pNode; /* NodeTree is an array of N pointers to Node structs */ pNode NodeTree[N]; ``` A. Using the template below (allowing a maximum of 32 bytes), indicate the allocation of data for a Node struct. Mark off and label the areas for each individual element (there are 6 of them) using the letter of the variable to indicate the byte positions spanned by the element (for example, for element "int i" you would have iiii across four spaces). Indicate with an "x" the parts that are allocated to the struct, but not used for storing meaningful data (i.e. space within the struct allocated to satisfy alignment). Assume the Linux alignment rules discussed in class (1 byte alignment for char, 4 byte alignment for int, double, and pointers). Clearly indicate the right hand boundary of the data structure with a vertical line. B. For each of the four C references below, please indicate which assembly code section (labeled A-F) places the value of that C reference into register eax. If no match is found, please write "NONE" next to the C reference. The initial register-to-variable mapping for each assembly code section is: %eax = starting address of the NodeTree array %edx = i # Problem 3. (12 points): Assume the following sizes for this problem: double: 8 bytes int/unsigned: 4 bytes short: 2 bytes char: 1 byte ## Consider the following C declaration: union Uni { char c[20]; double d[2]; short s[3]; unsigned u; int i; } unil; A. How much memory space in bytes would need to be allocated for uni1? since cherc (20): 20 bytes double of [2] 16 bytes short s[3] objects. U: abytes, i 4 bytes. B. Assume uni1.d[0] is initialized to -5.5, and then the value of uni1.s[0] is updated from 0xC0B0 to 0x40B0. What is the value of uni1.d[0] after the update, assuming a big-endian representation? sine d co] nos -55 = Mx = (1) E=2. M= 11/8 for \$ S[0]. OXCOBO = OXAUBO. There fore, the first byte charge from co > 40. Thus the only differen is Page 5 of 16 the first bit, 00 Thus d [0] = 5.5. C. Which of the following functions (assert1, assert2, assert3) always returns true (i.e. never returns 0)? Write the name of the function that satisfies this condition in the blank below, or write "none" if you believe that none of these functions always returns true. assert 1 ``` short assert1(Uni uni, int num) { if (uni.i == num) { return ((int) uni.u) == num; } } short assert2(Uni uni, short sh) { if (uni.s[0] == sh) { return ((short) uni.i) == sh; } } short assert3(Uni uni, double dbl) { if (uni.d[0] == dbl) { return ((double) uni.i) == dbl; } } ``` ## Problem 4. (13 points): Consider the following C code involving function pointers: ``` #include <stdio.h> int (*fp_a)(int); int (*fp_b)(int); int rand(void); int execute_funcs(int arg) { int temp; ful 2 . temp = for (arg); return fp_b(temp); int func1(int i) { return i + 1; int func2(int i) { return i * 2; void create_func_pointers(int test) { if (test >= 0 ) fp_a = func1; else fp_a = func2; fp_b = func1; void main() { int a, b; a = rand(); b = 2; create_func_pointers(a); execute_funcs(b); ``` temp=[2+] vetum temp+1 fix2)+1 vetumm temp+ Sime fp-a and fpb many be identical, therefore, in create tune pointer memony aliasing can occur. () B. What value would be returned from execute\_funcs if no memory aliasing has occurred? If no menery alasky occured, the points to fune 2, and to-a points to fune or fune 2 depeads on the & value of a, so compiler will return (it)+1 for a >0 C. What value would be returned from execute\_funcs if memory aliasing has occurred? and (1x2)+1 for aco If memory alicering occurred, fp-b and flo will alway point to fance 1 Verturn {+2 os always, Even though the correct consider the following re-implementation of execute funcs: may be (1x2) +1 int execute\_funcs(int arg) { int temp; temp = funcl(arg); return func2 (temp); A. Explain how memory aliasing can occur in execute\_funcs(). D. Would this new implementation of execute\_funcs allow a greater degree of compiler optimization, a lesser degree of compiler optimization, or would this update have no effect on the compiler's ability to optimize the code? The update will lead to a greater degree of compiler uptimization since the function calls are fixed, and compiler can simply return a (erg +1) ×2. ## **Performance Optimization** # Problem 5. (14 points): The following problem concerns optimizing a procedure for maximum performance on an Intel Pentium III. The following are the performance characteristics of the functional units for this machine across various operations: | Operation | Latency | Issue Time | | | | |---------------------------|---------|------------|--|--|--| | Shifts | 1 | 1 | | | | | Integer Add | 1 | 1 | | | | | Integer Multiply | 4 | 1 | | | | | Integer Divide | 36 | 36 | | | | | Floating Point Add | 3 | 1 | | | | | Floating Point Multiply | 5 | 2 | | | | | Floating Point Divide | 38 | 38 | | | | | Load or Store (Cache Hit) | 1 | 1 | | | | You've just joined a programming team that is trying to develop the world's fastest factorial routine. Starting with recursive factorial, they've converted the code to use iteration: ``` int fact(int n) { int i; int result = 1; for (i = n; i > 0; i--) result = result * i; return result; } ``` By doing so, they have reduced the number of cycles per element (CPE) for the function from around 63 to around 4 (really!). Still, they would like to do better. A. Explain why the iterative version of fact would perform so much better than a recursive version. int fact (int n) I if (n < 1) return 1 2 return 1 return 2 return 2 return 2 return 2 return 3 return 3 return 4 return 3 return 4 return 4 return 4 return 4 return 6 return 2 return 4 return 6 return 1 retu One of the programmers heard about loop unrolling. He generated the following code: ``` int fact_u2(int n) { int i; int result = 1; for (i = n; i > 0; i-=2) { result = (result * i) * (i-1); } return result; } ``` Unfortunately, the team has discovered that this code returns 0 for some values of argument n. B. For what values of n will fact\_u2 and fact return different values? When n is odd, since it will be decrement by 2 Every fine, So when n is reduced to 1, it will still be passed C. Show how to fix fact\_u2 so that its behavior is identical to fact. Your update must only change a single character of the original code. ford(i = n; i7); i-=2) A change the dior (int D. Suppose it could be assumed that the value of the parameter n would always be greater than I, and that the function would be extremely heavily used to the point where even saving a couple of loop iterations and/or cycles of latency would help overall performance. A new version of the "fact" function below, fact\_ng1(int n), attempts to take advantage of this knowledge by taking the code for n=2 outside of the loop. Fill in the blanks in the return statement to incorporate the same effect as the n=2 loop iteration had in the original function, incorporating a reduction in strength optimization. Each blank should contain either an operator, a variable name, or a constant value. Page 10 of 16 The following problem concerns optimizing a procedure for maximum performance on an Intel Pentium III. Using the same performance characteristics for operations as in Problem 5, consider the following two procedures: | Loop 1 | Loop 2 | | | | | | | |--------------------------------------------|--------------------------------------|--|--|--|--|--|--| | <pre>int loop1(int *a, int x, int n)</pre> | int loop2(int *a, int x, int n) | | | | | | | | <pre>{ int y = x*x;</pre> | <pre>int y = x*x;</pre> | | | | | | | | int i;<br>for (i = 0; i < n; i++) | int i;<br>for (i = 0; i < n; i++) | | | | | | | | <pre>x = y * a[i]; return x*y;</pre> | <pre>x = x * a[i]; return x*y;</pre> | | | | | | | | 1 | } | | | | | | | When compiled with GCC, we obtain the following assembly code for the inner loop: | Loop 1 | Loop 2 | |---------------------------------------------------------------------------------------------|--------------------------------| | .L21: movl %ecx,%eax { imull (%esi,%edx,4),%eax incl %edx { cmpl %ebx,%edx il .L21 | .L27: imull (%esi,%edx,4),%eax | Problem 6. (9 points): Running on a Pentium III Xeon server, we find that Loop 1 requires 3.0 clock cycles per iteration, while Loop 2 requires 4.0. A. Explain how it is that Loop 1 is faster than Loop 2, even though it has one more instruction. (Hint: consider how the latency of the multiplication operation might affect one of the loops more than another, given that pipelined processors have multiple instructions in execution at once whenever possible and that all operands must be known for an instruction to enter the pipeline). since the int multiplication laterum is 4 and issue time is 1. and sine in lop 1, Y is predetermined outside the lap, so that pipeline processors can carry out multiplication every club cycle, but for lup 2, x is updated every time B. By using the compiler flag -funroll-loops, we can compile the code to use 4-way loop unrolling. This speeds up Loop 1. Explain why. before starts the next. The loop 1 is optimized become every multiplication to finish & the dues not depend on each other. Therefore lapunrolling whole multiplicate can make pressur to operater more instruction at the same time, which speeds my Loop 1. C. Even with loop unrolling, we find the performance of Loop 2 remains the same. Explain why. The lop 2 is unchange & since in X= XX a C V A is depend on previous X, so that the processor has to know the previous x before execute a now multiplication. Thro even with wordling it will not holy to speed my lap 2. ### Problem 7. (12 points): The following problem concerns basic cache lookups. - The memory is byte addressable. - Memory accesses are to 1-byte words (not 4-byte words). - Physical addresses are 12 bits wide. - The cache is 4-way set associative, with a 2-byte block size and 32 total lines. In the following tables, all numbers are given in hexadecimal. The contents of the cache are as follows: | 4-way Set Associative Cache | | | | | | | | | | | | | | | | | |-----------------------------|-----|-------|--------|--------|-----|-------|--------|--------|-----|-------|--------|--------|-----|-------|--------|--------| | Index | Tag | Valid | Byte 0 | Byte 1 | Tag | Valid | Byte 0 | Byte 1 | Tag | Valid | Byte 0 | Byte 1 | Tag | Valid | Byte 0 | Byte 1 | | 0 | 29 | 0 | 34 | 29 | 87 | 0 | 39 | AE | 7D | 1 | 68 | F2 | 8B | 1 | 64 | 38 | | 1 | F3 | 1 | 0D | 8F | 3D | 1 | 0C | 3A | 4A | 1 | A4 | DB | D9 | 1 | A5 | 3C | | 2 | A7 | 1 | E2 | 04 | AB | 1 | D2 | 04 | E3 | 0 | 3C | A4 | 01 | 0 | EE | 05 | | 3 | 3B | 0 | AC | _1F | E0 | 0 | B5 | 70 | 3B | 1 | 66 | 95 | 37 | 1 | 49 | F3 | | 4 | 80 | 1 | 60 | 35 | 2B | 0 | 19 | 57 | 49 | 1 | 8D | 0E | 00 | 0 | 70 | AB | | 5 | EA | 1 | B4 | 17 | CC | 1 | 67 | DB | 8A | 0 | DE | AA | 18 | 1 | 2C | D3 | | 6 | 1C | 0 | 3F | A4 | 01 | 0 | 3A | C1 | F0 | 0 | 20 | 13 | 7F | 1 | DF | 05 | | 7 | 0F | 0 | 00 | FF | AF | 1 | B1 | 5F | 99 | 0 | AC | 96 | 3A | 1 | 22 | 79 | #### Part I The box below shows the format of a physical address. Indicate (by labeling the diagram) the fields that would be used to determine the following: - CO The block offset within the cache line - CI The cache index - CT The cache tag 7 \$ ### Part II For the given physical address, indicate the cache entry accessed and the cache byte value returned in hex. Indicate whether a cache miss occurs. Parameter Value Cache Offset (CO) 0x 🗸 Cache Index (CI) 0x | | Cache Tag (CT) Cache Hit? (Y/N) Cache Byte returned 4 (8) Wx16= 2+6×4= \$6 10.24 Problem 8. (12 points): A bitmap image is composed of pixels. Each pixel in the image is represented as four values: three for the primary colors(red, green and blue - RGB) and one for the transparency information defined as an alpha channel. In this problem, you will compare the performance of direct mapped and 4-way associative caches for a square bitmap image initialization. Both caches have a size of 128 bytes. The direct mapped cache has 8-byte blocks while the 4-way associative cache has 4-byte blocks. You are given the following definitions typedef struct{ unsigned char r; unsigned char g; unsigned char b; unsigned char a; }pixel\_t; pixel\_t pixel[16][16]; register int i, j; Also assume that - sizeof(unsigned char) = 1 - pixel begins at memory address 0 - Both caches are initially empty - The array is stored in row-major order - Variables i, j are stored in registers and any access to these variables does not cause a cache miss D 128 by b=7 $128 = 1 \times 3 \times 3$ $4 = 1 \times 3 \times 3$ $5 5 = A. What fraction of the writes in the following code will result in a miss in the direct mapped cache? for (i = 0; i < 16; i ++) { for (j = 0; j < 16; j ++) { pixel[i][j].r = 0; pixel[i][j].g = 0; pixel[i][j].b = 0; pixel[i][j].a = 0; }</pre> Miss rate for writes to pixel: 2 9% Page 15 of 16 C. What fraction of the writes in the following code will result in a miss in the direct mapped cache? for (i = 0; i < 16; i ++) { for (j = 0; j < 16; j ++) { pixel[j][i].r = 0; pixel[j][i].g = 0; pixel[j][i].b = 0; pixel[j][i].a = 0; } }</pre> Miss rate for writes to pixel: 25 %