ERKELEY • DAVIS • IRVINE • LOS ANGELES • RIVERSIDE • SAN DIEGO • SAN FRANCISCO ## CS M151B/EE M116C Midterm Exam Before you start, make sure you have all 13 pages attached to this cover sheet. All work and answers should be written directly on these pages, use the backs of pages if needed. This is an open book, open notes quiz - but you cannot share books or notes. We will follow the departmental guidelines on reporting incidents of academic dishonesty – do not make us enforce the rules. Keep your eyes on your own exam! | | 4 | |---------|---| | NAME: _ | | | ID: | | Do not write anything in the area below on this page: (out of 100) | Problem 1: | 10 | _(12) | | |--------------|------|-------|--| | Problem 2: | 4 | _(8) | | | Problem 3: _ | 17.5 | _(20) | | | Problem 4: | 20 | _(20) | | | Problem 5: | 10 | _(20) | | | Droblem 6: | 17 | (20) | | 10 Flipping Through Your Notes? (12 points): For the following implementation choices, list ONE benefit and ONE drawback to the given choice. | use | instead of | comments | | |------------------------------------------------------------------------------|---------------------|--------------------------------------|------------| | EXAMPLE<br>adds and shifts<br>benefit: lower CPI<br>drawback: higher instruc | multiplies | - ) and point it & | 1 column w | | CICC / | Proc | | tino m | | CISC | RISC | la III na | | | benefit: More instruction | is more para | , fisc | | | drawback: Less pipelin | now fever instruc | trons | | | Fixed Length ISA | Variable Length | ISA | | | benefit: easy fetch a<br>decode<br>drawback: instruction | ind more flexible | and compared | | | drawback: instruction | , vequire multi | -step feten | | | bits are scare | and dec | e//_ | | | Booth's algorithm | version of mu | ltiply algorithm | | | Booth's algorithm benefit: Speech | lose less comple | x hardwart | | | drawback: required to find strings of 13 | I stoner if | of 185. | | | Multicycle Datapath | Single Cycle Data | path | | | henefit faster rastruction | simpler o | lesign | | | drawback: more | all porrock | line to execute as | | | Ripple Carry Adder | / Partial Carry Loa | okahead Adder | | | benefit: Sirph horder | ere better so | 2001 | | | benefit: Siry W | 1 | | | | drawback: Long laten | more compl | es hardware | | | 2-Address Code | 3-Address Code | (Assume your ISA or the other, but n | | | benefit: Shorter Instruction Use fess memory. drawback: Loss flexibility & | S Imore flexibil | store<br>esults | ot both) | | drawback: | 1 | 1 1005 | | | Loss flexibility & | leasibly longer ins | tructur avaliable | | | 1 | bits for | 1 | | 2. What's Happening Now (8 points): Consider the following code sequence: 1w \$t4, 8 (\$s1) 5 add \$t4, \$s2, \$t4 4 sw \$t4, 8 (\$s1) 4 a. If we executed this on the single cycle datapath, what would be happening in the 3<sup>rd</sup> cycle of execution? (Sw) The data at RESTUD is stored in mERESSISTS) b. If we executed this on the multicycle datapath, in what cycle would the sw actually write to memory? - 3. Got MAD? (20 points): This problem will make use of the multicycle datapath, using the design we covered in class. Suppose that you are targeting a specific application where the instructions are broken down as follows: 20% loads, 15% beq, 15% stores, 50% R-type. a. First, calculate the CPI for this application running on the multicycle datapath: $$(P1 = .7(5) + .15(3) + .15(4) + .5(4) = 4.05$$ $$(P1 = .7(5) + .15(3) + .15(4) + .5(4) = 4.05$$ Now let's add a new instruction to this architecture. This instruction, the memory add, will be an R-type instruction that works as follows: $$t1 = t2 + M[t3]$$ NOTE - we are using register indirect addressing here for the second operand We will replace every instance of in the application with where a, b, c, and x are registers. So we are putting two instructions in place of one instruction whenever possible. Assume that the MAD instruction takes 5 cycles. If 50% of loads can make use of this optimization, calculate the new CPI for this application workload and machine: Of course, we cannot really make a complete comparison between these two approaches without using execution time. Assume that the complexity of this optimization increases the cycle time by 10%. c. Calculate the percent speedup in execution time from this optimization: (if the optimization results in a slowdown, express this as a negative speedup). By way of comparison, calculate the CPI for the single cycle datapath with and without the MAD instruction: d. Single cycle datapath CPI<sub>original</sub>: \_\_\_\_\_ Single cycle datapath CPI<sub>mad</sub>: \_\_\_\_\_ Don't Get MAD, Get Even (20 points): Implement the memory add instruction on the single cycle datapath. Use the R-type instruction format. Implement your solution on the following two pages. For the example instruction: mad \$t1, \$t2, \$t3 70 which has functionality: $$t1 = t2 + M[t3]$$ the R-type fields would map as follows: \$11 would be in rd, \$12 would be in rs, and \$13 would be in rt. All other instructions must still work correctly after your modifications. You should **not** add any new ALUs, register file ports, or ports to memory. \$41 = \$72+m[]+3] p ( rack pers) 6 Main Controller | Input or Output | Signal Name | R-format | lw | sw | Beq | |-----------------|-------------|----------|----|----|-----| | | Op5 | 0 | 1 | 1 | 0 | | | Op4 | 0 | 0 | 0 | 0 | | Inputs | Op3 | 0 | 0 | 1 | 0 | | | Op2 | 0 | 0 | 0 | 1 | | | Opl | 0 | 1 | 1 | 0 | | | Op0 | 0 | 1 | 1 | 0 | | Outputs | RegDst | 1 | 0 | X | X | | | ALUSrc | 0 | 1 | 1 | 0 | | | MemtoReg | 0 | 1 | X | X | | | RegWrite | 1 | 1 | 0 | 0 | | | MemRead | 0 | 1 | 0 | 0 | | | MemWrite | 0 | 0 | 1 | 0 | | | Branch | 0 | 0 | 0 | 1 | | | ALUOp1 | 1 | 0 | 0 | 0 | | | ALUOp0 | 0 | 0 | 0 | 1 | ALU Controller | opcode | ALUOp | instruction | function | ALU Action | ALUCtrl | MADO | |--------|-------|--------------|----------|------------|---------|------| | lw | 00 | load word | XXXXXX | add | 010 | 0 | | sw | 00 | store word | XXXXXX | add | 010 | 0 | | beq ` | 01 | branch equal | XXXXXX | subtract | 110 | 0 | | R-type | 10 | add | 100000 | add | 010 | 6 | | R-type | 10 | subtract | 100010 | subtract | 110 | 6 | | R-type | 10 | AND | 100100 | AND | 000 | 0 | | R-type | 10 | OR | 100101 | OR | 001 | 0 | | R-type | 10 | SLT | 101010 | SLT | 111 | 0 | | Rtype | 1 10 | MAD | 1111++ | ladd | 010 | ) ( | Branching Out (20 points): Use the multicycle datapath to implement the memory jump and link (mjl) instruction. This instruction uses the I-type instruction format, and has the following behavior: will do the following: $$\begin{cases} \$s3 = PC+4 \\ PC = M[\$s4+20] \\ P(SS] \end{cases}$$ Note that this instruction uses base+displacement addressing (i.e. we use the value in register \$s4 plus the immediate value 20). Further note that we are storing PC+4 into \$s3 before we modify the PC in the next line. For the I-type format, the rs field will give the base address (i.e. \$s4 above), the immediate field will give the displacement (i.e. 20 above), and the rt field will give the register to store PC+4 (i.e. \$s3 above). Implement this instruction on the following two pages. All other instructions must still work correctly after your modifications. You should **not** add any new ALUs, register file ports, or ports to memory. PC M(RS PC My PC = M[\$541 x 2] - Your Problems Are Multiplying (20 points): Consider the 2-bit Booth's Algorithm that you did on your homework. Assume that shifts take S seconds and adds/subtracts take A seconds. - a. Assume that we are using 16-bit numbers. What is the worst case latency from this version of Booth's algorithm? Express in terms of A and S, and assume that shifting by two is the same cost as shifting by one – still S seconds. Further assume that you already have the multiplicand and the multiplicand<<1 stored in temporary registers before the start of the algorithm.</p> Now, we will try and quantify A and S a little further. We will look at a 16-bit ALU and a 16-bit right shifter. Assume that a multiplexor has delay 2T. However, your design template has trouble with AND/OR/XOR gates that use more than two inputs. Use the following table to get the delay for AND/OR/XOR gates wih different # of inputs: | # of inputs | Delay | | |-------------|-------|--| | 2 | T | | | 3 | 2T | | | 4 | 3T | | | 5 | 4T | | | 6 | 5T | | So the delay of a 2 input OR gate would be T, a 3 input OR gate would be 3T, a 4 input OR gate would be 3T, and a 5 input OR gate would be 4T. You must use these delay values for all parts of this question. Express all answers in terms of T. SHOW YOUR WORK ON THE DIAGRAMS! b. What is the delay of this 16-bit shifter? c. On the next page we have the same 16-bit hierarchical carry lookahead adder (CLA) from class. Using the same delay values as above, what is the delay to get the sum bits for this adder?