BERKELEY DAVIS LOS RIVERSID SAN DIEGO · SAN FRANC ## CS M151B/EE M116C Midterm Exam #1 All work and answers should be written directly on these pages, use the backs of pages if needed. This is an open book, open notes quiz – but you cannot share books or notes. We will follow the departmental guidelines on reporting incidents of academic dishonesty – do not make us enforce the rules. Keep your eyes on your own exam! | NAME: | | |--------------------------|------------------------------| | ID: | | | Do not write anything in | the area below on this page: | | Problem 1: | (20) | | Problem 2: | (40) | | Problem 3: | (40) | | Total: | (out of 100) | - 1. Hardware Design ISA-bout Tradeoffs (20 points): Consider the following three properties: - Instruction Count - Cycle Time (assume that the CPI of any particular instruction will stay the same) - *Hardware Complexity* (i.e. the number of logic gates required to implement the hardware or the cost to verify the design the hardware is more difficult to design/verify) We assume that the source code of the application we are going to evaluate does not change – but that it may be compiled differently based on the changes made below. For the following design decisions, indicate how these three properties could be impacted on the *single cycle datapath* from class. Your answer for each should be that it either *could increase*, *could decrease*, *or will stay the same*. You may only circle one answer per property. The first one is done for you as an example. | | Example: Design Decis | ion: Use a carry lookah | ead adder instead of a | ripple carry adder. | |---|-----------------------|-------------------------|------------------------|---------------------| | • | Instruction Count | could increase | could decrease | will stay the same | | • | Cycle Time | could increase | could decrease | will stay the same | | • | Hardware Complexity | could increase | could decrease | will stay the same | a. *Design Decision*: Eliminate a complex addressing mode from our ISA – all other addressing modes will remain. Instructions using the removed addressing mode will either be removed from the ISA or will use a simpler addressing mode. | • | Instruction Count | could increase | could decrease | will stay the same | |---|---------------------|----------------|----------------|--------------------| | • | Cycle Time | could increase | could decrease | will stay the same | | • | Hardware Complexity | could increase | could decrease | will stay the same | b. *Design Decision*: Extend our ISA to include multiply-add instructions which can do a multiply followed by an add in a single instruction. | • | Instruction Count | could increase | could decrease | will stay the same | |---|---------------------|----------------|----------------|--------------------| | • | Cycle Time | could increase | could decrease | will stay the same | | • | Hardware Complexity | could increase | could decrease | will stay the same | c. *Design Decision*: Hardware based on a register-memory machine instead of a load-store machine. | • | Instruction Count | could increase | could decrease | will stay the same | |---|---------------------|----------------|----------------|--------------------| | • | Cycle Time | could increase | could decrease | will stay the same | | • | Hardware Complexity | could increase | could decrease | will stay the same | - 2. *Performance Anxiety (40 points)*: Suppose on a particular load-store machine-based processor there are four types of operations loads, stores, branches, and ALU operations (i.e. adds, subtracts, multiplies). This will be implemented by some other datapath **not** the single-cycle datapath we covered in class. Loads take 6 cycles, stores take 4 cycles, branches take 3 cycles, and ALU operations take 4 cycles. The architecture is not pipelined so no instruction latency is overlapped instructions are executed one at a time, in order. Answer the questions below show your work. - a. What is the CPI for an application with one million instructions that is 25% loads, 5% stores, 50% ALU operations, and 20% branches? CPI: 4.3 CPI = 6(0.25) + 4(0.05) + 3(0.2) + 4(0.5) = 4.3 b. The hardware running the application from part (a) has a cycle time of 312.5 ps (that is picoseconds). What is the execution time for that hardware to run the application? Execution Time = 10 6x 4.3x312.5x10-12 ET: 1.344 msec c. Now suppose that we can eliminate 50% of all ALU operations through an algorithmic change (i.e. we save intermediate values in lookup tables). Unfortunately, this requires more loads – in fact, for every two ALU operations we eliminate, we need to add one load instruction. What is the new CPI? 50,000 ALV ints +125000 250,000 ALV inst Loud Insts CPI: 4.63 d. What is the execution time for the same hardware from part (b) to run the new application from part (c)? Exe $Time = (4.63) \times (875000) \times (312.5 \times 10^{-12})$ ET: 1.266 msec 3. *Putting the CLA into UCLA (40 points)*: Assume for the rest of this problem that all logic gates have the following delays: | Fan In | Delay | |-----------|-------------| | 1 | Т | | 2 | 3T | | 3 | 4T | | 4 | 7T | | 5 | 9T | | 6 or more | 2T x fan-in | So a 2-input AND gate would have delay 3T and a 4-input OR gate would have delay 7T. For simplicity, assume that mux's have delay 4T regardless of fan-in. We will create a 32-bit adder out of some building blocks we've covered in class. We will use the 4-bit carry lookahead (4-bit CLA) that we covered in class as one basic building block of this design. And we will use it (as we did in class) to make a 16-bit hierarchical CLA (16-bit HCLA). And we will also use it to make a 12-bit hierarchical CLA (12-bit HCLA). But instead of connecting these in series to make a 32-bit adder, we will use carry select to speed up the 32-bit adder. The design will look as follows (be sure to note where we are using the 4-bit CLA and where we are using the 12-bit and 16-bit HCLAs): Note that the 4-bit CLA is computing the sum of inputs $A_0$ - $A_3$ and $B_0$ - $B_3$ (i.e. producing sums $S_0$ - $S_3$ ), the 12-bit HCLA is computing the sum of inputs $A_4$ - $A_{15}$ and $B_4$ - $B_{15}$ (i.e. producing sums $S_4$ - $S_{15}$ ), the 16-bit HCLA is computing the sum of inputs $A_{16}$ - $A_{31}$ and $B_{16}$ - $B_{31}$ (i.e. producing sums $S_{16}$ - $S_{31}$ ). The carry out of the 4-bit CLA selects the sums and carry out of the 12-bit HCLA. The carry out of the 12-bit HCLA selects the sums and carry out of the 16-bit HCLA. The 0 or 1 at the top of each HCLA is the hardwired $C_{in}$ . Multiplexers are shown as ovals. Your task is to find the maximal delay of this design – i.e. determine the delays of $S_{0-31}$ and $C_{32}$ – the maximal delay of these outputs will be the maximal delay of the entire design. To do this (and to help with possible partial credit) please use the diagrams on the following pages and fill in the tables in every page. ## 4-bit CLA: ## 12-bit HCLA: | 1 | 2 | -bi | t/ | HC | LA | |---------|----------|-----------|--------|-------|-----------| | Enteron | Concrete | earns, do | Alle B | D 483 | Controlle | | Output | Delay (in terms of T) | | |---------------------------|-----------------------|----------| | Gα | 17 | (2 point | | Ρα | 10 | (2 point | | C12 | 24 | (2 point | | C16 (just in this figure) | 28 | (2 point | | S15 | 42 | (2 point | $$P_{x} = P_{0}P_{1}P_{2}P_{3}$$ $$G_{x} = G_{3} + G_{2}P_{3} + G_{1}P_{2}P_{3} + G_{1}P_{2}P_{3}$$ $$S_{15} = a_{5} \oplus b_{15} \oplus c_{15}$$ ## 16-bit HCLA: Par = P28 P29 P30 P31 | 16-bit HC | LA | |-----------|----| |-----------|----| + Gz PyPw + | Output | Delay (in terms of T) | |---------------------------|-----------------------| | Gω | 17 | | Ρω | 10 | | C28 | 28 | | C32 (just in this figure) | 33 | | S31 | 46 | (2 points) (2 points) (2 points) (2 points) (2 points) Ga Px Py Pw +Cin Pa R Pp h 531 = A31 + B31 + C31 Entire Design: | Output | Delay (in terms of T) | 1 | |--------|-----------------------|------------| | C16 | 32 | (2 points) | | C32 | 37 | (2 points) | Find the maximum delay in terms of T of the 32-bit adder – take the maximum of all output bits – including the sum bits $(S_0-S_{31})$ and the final carry out $(C_{32})$ . Maximal Delay: \_\_\_\_\_ (2 points)