1. Flipping Through Your Notes? (12 points): For the following implementation choices, list ONE benefit and ONE drawback to the given choice. instead of use comments **EXAMPLE** adds and shifts multiplies benefit: lower CPI drawback: higher instruction count CISC RISC lower instruction count benefit: Complex instructions can have layer CPI drawback: Fixed Length ISA Variable Length ISA Simplifies decoding and fetch benefit: limited instruction Gits drawback: 3<sup>rd</sup> version of multiply algorithm Booth's algorithm hadles Signed multiplication benefit: longer worst age dely drawback: Single Cycle Datapath Multicycle Datapath better dock rate benefit: lage CPI drawback: Partial Carry Lookahead Adder Ripple Carry Adder simple inglemention, less hodore benefit: drawback: longer latency 3-Address Code (Assume your ISA supports one 2-Address Code or the other, but not both) more 6 its for immediates and opcodes benefit: One source operand must be the Lestimbon -> less reuse of drawback:

Source spends

2. What's Happening Now (8 points): Consider the following code sequence:

lw \$t4, 8 (\$s1) add \$t4, \$s2, \$t4 sw \$t4, 8 (\$s1)

a. If we executed this on the single cycle datapath, what would be happening in the 3<sup>rd</sup> cycle of execution?

Sw \$+4, 8 (\$s1)

b. If we executed this on the multicycle datapath, in what cycle would the sw actually write to memory?

13#

- 3. Got MAD? (20 points): This problem will make use of the multicycle datapath, using the design we covered in class. Suppose that you are targeting a specific application where the instructions are broken down as follows: 20% loads, 15% beq, 15% stores, 50% R-type.
  - a. First, calculate the CPI for this application running on the multicycle datapath:

Now let's add a new instruction to this architecture. This instruction, the memory add, will be an R-type instruction that works as follows:

$$t1 = t2 + M[t3]$$

NOTE – we are using register indirect addressing here for the second operand

We will replace every instance of

in the application with

where a, b, c, and x are registers. So we are putting two instructions in place of one instruction whenever possible. Assume that the MAD instruction takes 5 cycles.

b. If 50% of loads can make use of this optimization, calculate the new CPI for this application workload and machine:

$$\frac{100 + 220 + 45}{90} = 4.06$$

Of course, we cannot really make a complete comparison between these two approaches without using execution time. Assume that the complexity of this optimization increases the cycle time by 10%.

c. Calculate the percent speedup in execution time from this optimization: (if the optimization results in a slowdown, express this as a negative speedup).

By way of comparison, calculate the CPI for the single cycle datapath with and without the MAD instruction:

d. Single cycle datapath CPI<sub>original</sub>: 1,0 Single cycle datapath CPI<sub>mad</sub>: 1,0



mad: R[m] - R[rs] + M[R[rt]]

#4

Main Controller

| Input or Output | Signal Name | R-format | lw | sw | Beq |
|-----------------|-------------|----------|----|----|-----|
|                 | Op5         | 0        | 1  | 1  | 0   |
| Inputs          | Op4         | 0        | 0  | 0  | 0   |
|                 | Op3         | 0        | 0  | 1  | 0   |
|                 | Op2         | 0        | 0  | 0  | 1   |
|                 | Op1         | 0        | 1  | 1  | 0   |
|                 | Op0         | 0        | 1  | 1  | 0   |
| Outputs         | RegDst      | 1        | 0  | X  | X   |
|                 | ALUSrc      | 0        | 1  | 1  | 0   |
|                 | MemtoReg    | 0        | 1  | X  | X   |
|                 | RegWrite    | 1        | 1  | 0  | 0   |
|                 | MemRead     | 0        | 1  | 0  | 0   |
|                 | MemWrite    | 0        | 0  | 1  | 0   |
|                 | Branch      | 0        | 0  | 0  | 1   |
|                 | ALUOp1      | 1        | 0  | 0  | 0   |
|                 | ALUOp0      | 0        | 0  | 0  | 1   |

ALU Controller

| opcode | ALUOp | instruction  | function | ALU Action | ALUCtrl | mad |
|--------|-------|--------------|----------|------------|---------|-----|
| lw     | 00    | load word    | XXXXXX   | add        | 010     | 0   |
| sw     | 00    | store word   | XXXXXX   | add        | 010     | 0   |
| beq    | 01    | branch equal | XXXXXX   | subtract   | 110     | 0   |
| R-type | 10    | add          | 100000   | add        | 010     | 0   |
| R-type | 10    | subtract     | 100010   | subtract   | 110     | 0   |
| R-type | 10    | AND          | 100100   | AND        | 000     | 0   |
| R-type | 10    | OR           | 100101   | OR         | 001     | 0   |
| R-type | 10    | SLT          | 101010   | SLT        | 111     | 0   |
| R-type | 10    | mad          | 110000   | add        | 010     |     |





- 6. Your Problems Are Multiplying (20 points): Consider the 2-bit Booth's Algorithm that you did on your homework. Assume that shifts take S seconds and adds/subtracts take A seconds.
  - a. Assume that we are using 16-bit numbers. What is the worst case latency from this version of Booth's algorithm? Express in terms of A and S, and assume that shifting by two is the same cost as shifting by one still S seconds. Further assume that you already have the multiplicand and the multiplicand<<1 stored in temporary registers before the start of the algorithm.

There are 8 iterations
each iteration in the
worst case consists of
an add and a shift by two

.. worst case delay is 
$$\mathcal{E}(A+S)$$

Now, we will try and quantify A and S a little further. We will look at a 16-bit ALU and a 16-bit right shifter. Assume that a multiplexor has delay 2T. However, your design template has trouble with AND/OR/XOR gates that use more than two inputs.

Use the following table to get the delay for AND/OR/XOR gates wih different # of inputs:

| # of inputs | Delay |  |  |
|-------------|-------|--|--|
| 2           | T     |  |  |
| 3           | 2T    |  |  |
| 4           | 3T    |  |  |
| 5           | 4T    |  |  |
| 6           | 5T    |  |  |

So the delay of a 2 input OR gate would be T, a 3 input OR gate would be 3T, a 4 input OR gate would be 3T, and a 5 input OR gate would be 4T.

You must use these delay values for all parts of this question. Express all answers in terms of T. SHOW YOUR WORK ON THE DIAGRAMS!

b. What is the delay of this 16-bit shifter?



c. On the next page we have the same 16-bit hierarchical carry lookahead adder (CLA) from class. Using the same delay values as above, what is the delay to get the sum bits for this adder?

6c) (out of 10 points)

Delay in CLAZ

Ga= G3 + G2P3 + G1P2P3 + G0P, P2P3

Pa = Po Pi Pz Pz

delay G = T + 3T + 3T = 7Tdelay R = T + 3T = 4T

At heirarchical level: 1 there are two possible answers Option B is correct though option A was not penalized

Option A

Cq = Gz + CoPz (adds 2T additional delay)

delay Cq = 2T + 7T = 9T

C8 = Gp + GxPp + CoPxPp (adds 47 delay)

delay C8 = 4T + 7T = 11T

C12 = G8 + G8P8 + G8P8P8 + C6P2P8P8 (adds 6Tdelay) delay C12 = GT + 77 = 13T

C16 = G8 + G8P8 + G8P8P8 + G2P8P8P8 + GP2P8P8P8 (adds 87 delay) delay C16 = 8T + 7T = 15T

Delay in CLAE block

CIS = GI4 + GI3PI4 + GIZPI3PI4 + CIZPIZPI3PI4

 $dolay C_{15} = 3T + dolay C_{12} + 3T$  = 19T

calculation of Sis bit

if assume 3 input xor gate delay Sis= 2T+19T = 21T if assume two 2 input xor gate delay Sis= T+19T = 20T Option B Since delay God > delay Pod we ignore the contribution of the propagation chain delay Cy = T+7T = 8T

delay C8 = 3T+7T = 10T

delay C12 = 5T+7T = 12T

delay C16 = 7T + 7T = 14T

delay Cis = 6T + 12T = 18T

delay S15 = 2T + 18T = 20T)
delay S15 = T + 18T = 19T

These are the three possible unpenalized answers

It is interesting to note that if you limit the implementation to just two input

logic gates then delay of 14T can be obtained (see section 2 midterm answer)