1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book ... more bits in the

Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book, Closed Notes, no Chit sheet, no Calculator Time: 04:30-07:20PM in THH102/THH301

Esperan Verilog Guide is allowed but not required Total points: 261 Name: Perfect score: 250 / 261

1 ( 90 points) 65 min.

Topic: pipeline design (refer to the block diagrams on pages 5 and 6 and the two OoO designs below)

1.1 HDU (in the ID stage) is overridden and told not to stall in some cases to avoid conflict with the operation of the branch instruction. If this applies, write YES otherwise write NO in the cells below.

1.1.1 _____________ (HDU_Br / HDU /both HDU and HDU_Br) ______ (is / are) called the guardian angel in the case of _________________________ (an early branch / a medium-delay branch / a late branch / multiple of these).

1.1.2 FU_Br in the early branch design checks to see if the senior instruction in the MEM stage is(a) a register-writing instruction(b) a register-writing R-Type instruction but not a lw instruction(c) neither of the above.

1.2 Compared to the late branch, an early branch has __________ (less / more) branch penalty due to flushing and this difference is seen in ________________________________ (successful branches / unsuccessful branches / both / neither). Compared to the late branch, an early branch causes __________ (less / more) RAW dependency stalls and this difference is seen in _____________________________ (successful branches / unsuccessful branches / both / neither).

IoI - OoE - OoC design IoI - OoE - IoC design

3+2 pts

2+1 pts

4+2 pts

1.3 Assembly language code can be written deliberately (i) to make a late-branch look better than an early branch True / False(ii) to make a late-branch look better than a medium-delay branch True / False(iii) to make a medium-delay look better than an early branch True / False

1.4 There are 4 5-bit comparison units in the ID stage of the lab 6 Part 5 design as shown below.This number 4 __________________ (includes / does not include) the two comparators needed for internally forwarding in the register file. What would this number be if it was the 7-stage pipeline of lab 6 part 4? _________ . This part 5 design provides time advantage in _________ (ID stage / EX stage / neither) and timing disadvantage in _________ (ID stage / EX stage / neither).

1.5 Assuming that both early branch and late branch designs are working at the same frequency, the control unit of the early branch has to finish decoding the code a little more quickly compared to the control unit of the late branch. True / False

1.5.1 We can be more confident to move the 5-bit wide multiplexer on the side from the current EX stage to the ID stage in the case of the ________ (early / late) branch design.This move saves in the ID/EX stage register 1-bit for RegDst control signal. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the current lab 6 design. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the lab 6 Part 5 design. This 5-bit multiplexer is in a time-critical path. True/False

1.6 Late branch can not be too late. WB stage in the 5-stage pipeline for the late branch execution is too late. True / FalseBranch outcome is announced from the CDB in our IoI-OoE-OoC design on page 1. Mr, Bruin says that CDB is like the WB stage hence branch execution is too late and our design is wrong. Please explain. ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.7 The stage register IF/ID of the in-order 5-stage pipeline is replaced by what in the IoI-OoE-OoC design on page 1? _____________________________________________________________________________________________________________________________________________

1.8 add is stalled for _____ (1 / 2) clock(s) in an in-order 5-stage pipeline and _____ (1 / 2) clock(s) in an in-order 7-stage pipeline. If lw incurs cache miss ______ (more / less) clocks are lost. This loss impacts CPI ________ (more / less / the same) in the IoI-OoE-OoC design on page 1.

3+2 pts

4+2 pts

RegInstr.

BRANCH

1FU_Br

HDU_Br

4+2 pts

lw $2, 2000($0);add $3, $2, $1;

1.9 If we use 64 tokens or TAGs in the IoI-OoE-OoC design, the TAG FIFO will have ____ locations each of ____ bits per location. Since we compared this FIFO to ___________________ ________________________________ (paper tokens forming a virtual queue / pile of tokens on the cashier’s table in State Bank of India) it _____________________ (matters / does not matter) in which order the tokens 0 to 63 are placed initially in the token FIFO. The TAG FIFO should have a valid bit to indicate whether the location has a valid token or if it is empty. True / False

1.9.1 The RST (Register Status Table) is a look-up table to allow the dispatch unit to look up ________ ___________________________________ (entries for each of the source registers / entry for the destination register) of the instruction being dispatched (say add $3, $2, $1). This look-up is ____________________ (a search / an indexing) operation. The destination register gets renamed to a token drawn from the token FIFO. This register renaming causes _____________ (reading from / writing to) the Token FIFO and _____________ (reading from / writing to) the RST. The RST should have a valid bit besides the TOKEN field in each entry. True / False

1.9.2 Suppose a register-writing instruction comes on the CDB and says, "it is LION calling (or some 6-bit token in place of LION), and I am going to write the value 2000". Let us try to understand what should the dispatch unit do now. Circle all correct statements: (a) the Dispatch unit does not do anything (b) it indexes RST to find LION in RST (c) it performs a parallel search to find LION in RST. (d) if it finds LION across $2 in RST, it takes 2000 and deposit in the register file in $2.(e) if it finds LION across $2 in RST, and if it is currently dispatching an instruction with source register $2, it gives the value 2000 for its source register value. This like the internal forwarding in the register file.(f) if it finds LION across $2 in RST, it erases LION’s name across $2 and invalidates that entry.(g) it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the RP(h)it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the WP

1.10 Consider the four designs, (A) in-order 5-stage late branch, (B) in-order 5-stage early branch, (C) IoI-OoE-OoC, and (D) IoI-OoE-IoC.

1.10.1 A RAW problem for a memory location (i.e. lw dependent on a senior sw instruction with matching address) is naturally taken care of in _____________ (A/B/C/D/if multiple, then state them). Among the ones in which the problem needs to be taken care of by explicit logic, it takes less logic in ________ (A/B/C/D) compared to _________ (A/B/C/D).

1.10.2 WAW and WAR problems for memory locations do not exist in _______________ (A/B/C/D/if multiple, state them). Two store word instructions can leave LSQ in any order even if their addresses match in __________________________________ (C/D/neither C nor D/either C or D).

1.11 Every register writing instruction such as add $1, $2, $3 will end up writing into the destination register in the register file in __________________________ (IoI-OoE-OoC / IoI-OoE-IoC / both / neither) designs.

1.12 IFQ (Instruction PreFetch Queue) gets flushed ________________ (less often / more often / equally often) in the IoI-OoE-IoC design with branch prediction as compared to the IoI-OoE-OoC design with no branch prediction. In the IoI-OoE-IoC design, you end up flushing IFQ for every mis-prediction. T / F You flush IFQ when you predict a branch as ___________ (Taken/Not Taken/either/neither). Which is true in the IoI-OoE-OoC? ____ [(a) / (b)] (a) we just opted not to have branch prediction (b) Since it is OoC, branch prediction is not possible.

5+2 pts

8+2 pts

6+2 pts

Instructionmemory

opcode rs rt rd shift funct

Registers

Control

(rs) (rt)

rs rt rd functshift

Sign ext.

Datamemory

WB ALU_result

MemRead

MemWrite

Store_data

RegWrite

Branch

IF.Flush

WB MEM_data REG_data

RegWrite

04 Instruction

memory

opcode rs rt rd shift funct

Registers

Control

(rs) (rt)

ALU ctrl

RegDst

Datamemory

WB ALU_result

MemRead

MemWrite

Store_data

RegWrite

WB MEM_data REG_data

RegWrite

functs_ext

WriteRegister_EXFU

FW_RS_WB

FW_RS_MEM

FW_RT_WB

FW_RT_MEM

WriteRegister_MEM

Branch

0 1 10

Branch

2 ( 77 points) 45 min. Advanced topics Miscellaneous

2.1 Exceptions are taken in _____________________________ (program order / temporal order).Illegal instruction exception is often _____________________________ (a precise exception / an exception leading to abortion of the program) so as to support software emulation of unimplemented instructions.

2.2 In the two CMP (Chip Multi Processors) organizations shown below, the shared L2 cache is shown "banked" on the left but not on the right. Explain. ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.2.1 If there are 8 banks of L2 cache in the left side organization, is it true that there can be 8 copies of a block in the 8 banks besides 8 copies in the 8 L1 caches? True / Not trueExplain: ____________________________________________________________________________________________________________________________________________________

2.3 To avoid RWM race, you should be able to make atomic operation. What RWM stand for?____________________________________________________________________________

2.4 Compared to MSI protocol, MOESI protocol _______________ (reduces / increases) L2 to L1 transactions mainly because of _________ (O-state / E-state). Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.4.1 We proposed to split the O-state into two states: O-Dirty and O-Clean. You arrive in O-Dirty state from _________ (M/O-Clean/E/S/I) and arrive in O-Clean state from _________ (M/O-Dirty/E/S/I).

2.5 Branch Prediction from ID stage: In the diagram on the side, the (30-K) bit field of the PC is ______________________ (ignored / used as a TAG). Is there any connection between the frequency of aliasing and the BPB’s (Branch Prediction Buffer’s) depth (=2K) or width (1-bit predictor vs. 2-bit predictor)? ________________________________________________________________________________________________________________________________________

2.6 RAS stands for ________________________. It is usually ____________________ (4 or 8 locations / 1K to 2K locations) deep. The contents of RAS change during the execution of ______ _____ (jal / jr $31/both). RAS helps in the execution of ____________ (jal / jr $31/both) and such help __________ (is / isn’t) considered as a prediction that can go wrong.

Memory Interconnection Network

Shared (banked)L2$ L2$ L2 cache

Shared L2 cache (no banks)

01010011

K-bits

K30-K PC

BPB 2K

2.7 Our USC multi-threaded multi-core processor currently has 4 cores each with 4 threads and the core resources are well utilized by the 4 threads. Due to process improvements we have more silicon in the next generation processors. We are considering the two choices: (i) 8 cores each with 4 threads (ii) 4 cores each with 8 threads, You recommend to go for ______ ( i / ii). Explain: _______________________________________________________________________________________________________________________________________________________________________Choice ______ ( i / ii) is expected to require _________ (more / less) silicon compared to choice ______ ( i / ii). Explain: ___________________________________________________________ ______________________________________________________________________________

2.8 Two challenges in compiler design for the current super-scalar super-pipelined processors:(i) avoid pairs of dependent instructions, (ii) schedule instructions into longer delay slots for load-word and branch instructions. The first is important because of the ____________ (super-scalar / super-pipelined) aspect of our processor and we said "pairs" because we assumed that ______________________________________________________________________________________

2.9 MPI (Miss Penalty per Instruction) is used to estimate the impact of cache misses on CPI.Assume that the CPI without cache misses is 1.35. In a system with L1 and L2 caches, the overall CPI was calculated as 1.35 + (0.04 * 25) + (0.01 * 200) = 4.35. What are the numbers 0.04, 25, 0.01, and 200? _____________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.10 Intel HTT stands for ___________________________ and it is same as ___________________________________ (fine-grain / coarse-grain / simultaneous) multi-threading. It uses ________ (ILP/TLP) advantage of its out-of-order execution engine together with ________ (ILP/TLP). _______ (Like / Unlike) other multi-threading techniques, here they do not need to roll-back a thread on a cache miss because ________________________________________________________________________________________________________________________________________________________________________________________________________________

2.11 Mr. Bruin joined USC and was appointed as EE457 lab grader. He was grading Lab 7 Part 3 Subpart 4 (RTL coding of the pipeline with EX1 and EX2 merged into EX12). Several students made errors in the stall logic coding. He was puzzled because their _________________ (Reg. file contents / TimeSpace.txt) came out correct but their ________________ (Reg. file contents / TimeSpace.txt). He expected both to be wrong. Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.11.1 In our RTL coding, we produced two STALL signals ( declared on the side. Which one of the two produces a waveform easier to understand and why? Is it necessary to declare STALL_combinational as a wire? __________________________________________________________________________________________________________________________________________________________________________

2.11.2 In RTL coding, forwarding muxes output in an "if" statement in a clocked always block is usually assigned using the _______________ (blocking / non-blocking) procedural assignment operator because _____________________________________________________________________________________________________________________________________________________________

reg STALL;wire STALL_combinational;6

X1_mux SUB 3

3 ( 14 + 18 = 32 points) 20 min. Topic: Lab 7 Part 3 Subpart 2

3.1 Given on the side is the incomplete stall logic. Label the three points as Stalled, Add1, and Stall as appropriate. The Stall output is a __________ (Mealy / Moore) output. This is a state machine implementation with an encoded state assignment and has _____ (1/2/3/4) states. Complete the state diagram below. Write state transition conditions and write, "if ADD 1, then STALL" in one of the two states. And then try to implement a 1-hot coded state machine for the same state diagram. Produce the STALL output.

3.2 A similar state diagram is shown below in nearly completed form to stall for 1 clock for M (MULT) instruction and to stall for 2 clocks for D (DIVIDE) instruction and no stall for other instructions. The state transition conditions are completed. Conditional STALL output generation statements need to be added in one or two or three state circles as appropriate. Some suggestions:i) if (M | D), then STALL (ii) If (M) then STALL and (iii) If (D) then STALL

Implement the one-hot state machine and generate the STALL signal (i.e do the OFL). Let us use A (ADD) and S (SUB) for other instructions which do not need stalls. If A is in the multi-functional EX state in the current clock and the above state machine is in the IDLE state, write the sequence of states the above machine goes through for D M S M A instructions.

D QCLKCLRCLK

RESET_B

14 pts

IDLE STALLED

RESET_B

D QCLK

PRESET

RESET_B

D QCLKCLRCLK

RESET_B

QQIDLE STALLED

18 pts

IDLE STALLED_1

RESET_B

STALLED_2M | D

D QCLK

PRESET

RESET_B

D QCLKCLRCLK

RESET_B

QQIDLE STALLED_1 D Q

CLKCLRCLK

RESET_B

QSTALLED_2

IDLE, STALLED_1, STALLED_2,

4 ( 27 points) 20 min. adder, subtracter, incrementer, decrementer

4.1 We know we do (A - B) by doing (A + B’ +1). To perform (X -2Y), we do ______ (i / ii / iii / iv / v)(i) [X + (Y’ || 1’b0) + 1] (ii) [X + (Y’ || 1’b0) + 2] (iii) [X + (Y’ || 1’b1) + 1] (iv) [X + (Y’ || 1’b1) + 2] (v) none of these, we need to do ______________________________ . Note: || means concatenate

4.1.1 Say, the numbers are all 4-bit numbers and we have a 4-bit adder/subtracter to perform (X-2Y) by simply discarding (dropping) Y3’. In which cases discarding Y3’ does not change the value of (-2Y).Answer separately for unsigned numbers and for signed numbers.Unsigned numbers: ____________________________________________________________________________________________________________________________________________Signed numbers: ______________________________________________________________________________________________________________________________________________

4.1.2 Let us play safe and use a 5-bit adder and convert it to a subtracter to perform (X-2Y). Complete the two designs by using inverters, XOR gates, etc. as needed. Produce USO (unsigned subtraction overflow) and SSO (signed subtraction overflow). You do not have to simplify the FULL-ADDER building blocks.

4.2 On the left, we have a chart for delays in gates for the incrementer/decrementer design based on CLA design assuming 4 as the blocking factor (4 CLL carry look-ahead logic boxes require one next-level CLL). Complete the table on the right for a new blocking factor of 3.

For the left-side design, how many CLL boxes are needed for a 1024-bit incrementer? ________Think of a formula to arrive at this number rather than using a brute-force method!

Are the CLL boxes identical in the case of incrementer as well as decrementer? Yes / No

Do they both contain 1-level logic or 2-level logic? _______________________

In ___________________ (an incrementer / a decrementer), C3 = g2 + g1 + g0 + C0 because every cell agrees to _________ (generate / propagate).

In ___________________ (an incrementer / a decrementer), C3 = p2.p1.p0.C0 because no cell _________ (generates / propagates) a carry.

a bcin

scout C0

a bcin

UNSIGNED SUBTRACTOR SIGNED SUBTRACTERDo not forget to produce USO.

R4 R3 R2 R1 R0

a bcin

scout C0

a bcin

R4 R3 R2 R1 R0

13 pts

blocking factor = 4

blocking factor = 3

5 ( 35 points) 25 min. Virtual memory

5.1 Given ________ (VPN / VA), we get ________ (PPFN / PA) from _____________________ (the TLB / the PT / either the TLB or the PT if TLB does not have) if the page is ___________ (present / absent) in the MM.

5.2 A TLB of 17 entries (17 is a prime number and does not have any factors and is certainly not a power of 2) is possible if the TLB uses a _______________ (fully-associative / set-associative / direct) mapping. Usually either a fully-associative mapping or a set-associative mapping with liberal associativity is used for the TLB because (talk about both cost and performance) ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________The same thing ______ (can / can not) be said about cache because cache is much ___________ (bigger/smaller) and cost will be much __________ (higher / smaller) and the penalty of a miss is relatively _____ ________ (smaller / larger).

5.3 In a way both TLB and page Table are look-up table but so far as indexing or searching is concerned, ________________________________________________________________________________________________________________________________________________________________________________________________________________________________Direct mapped cache is ____________________________ (indexed / searched in parallel), a fully associative cache is ____________________________ (indexed / searched in parallel).

5.4 Two processes, one with process id 2 and another with process id 3 can both use location with 32-bit address 00002000 as in lw $2, 2000($0). Is this 32-bit address 00002000 a virtual address (VA) or a physical address (PA)? ______ (VA / PA). Assuming a fully associative TLB, the page number 00002 (upper 20 bits of the 32-bit address 00002000) is in the VPN field or PPN field of an entry in the TLB? __________ (VPN / PPFN) field.

5.5 In a multi-level page table, the _______ (VPN/PPFN) is present in the entries of ____________ (every level / the first A level / the last D level / other ..). In the example on the side the ______ (VA/VPN/PA/PPFN) is shown divided into 4 fields: A, B, C, D. It takes ___________(minimum/maximum/always) ____ (state a

number like 2) accesses to the example table on the side before you declare page table hit. You may be able to declare page fault after 1 or 2 or 3 or 4 accesses to the PT. ______ (T / F). "A" table is only one. ______ (T / F). Number of "B" tables is less than or equal to number of "C" tables. ______ (T / F). _________ (Page Table / TLB / Both / Neither) is flushed on context switch. Explain: ______________________________________________________________________________________________________________________________________________________

1010111100011

A B C D

A table16 entries

B tables16 entries

C tables8 entries

D tables4 entries

13 pts

We enjoyed teaching this course! Hope you liked it! Hope to see some of you in EE560. Grades will be out in a week. Enjoy your semester break! Happy Summer Holidays!!! - Gandhi and TA Yue, Mentors: Manasa, Vibha, Lai, Binal, Graders: Atit, Guan (Crystal), Guan (Wade), Mukhdeep, Ruozhi, and Zhe Happy Semester Break!

1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book ... more bits in the

Documents

EE457 Midterm Exam (~24%)

Gandhi Jayanti (Mahatma Gandhi)

EE457 Quiz (~10%) -

ee457 Quiz Sp2020 -

EE560 CMP Design Aspects Simplified for EE457

ee101 hw on datapath - University of Southern...

ee457 Final Fall2010 r1 sol - University of Southern ...

University of Southern California · 2014. 11. 4. ·...

ee457 Final Fall2020 -

EE457 - University of Southern California...the first five.....

EE457 Quiz (~10%)

Satya Puvvada Design Patterns Satya Puvvada. Objectives ...

Spring 2013 EE457 Instructor: Gandhi Puvvada …...This mu.....

1.1 WebPack Microsoft Windows Vista Business or Windows...

ee457 Final Sp2019 -

EE254L divider - University of Southern California...