1 ( 90 points) 65 min. - University of Southern California · Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book ... more bits in the
Post on 29-Apr-2018
223 Views
Preview:
Transcript
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 1 / 10C Copyright 2014 Gandhi Puvvada
Spring 2014 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 5/12/2014, Monday Closed Book, Closed Notes, no Chit sheet, no Calculator Time: 04:30-07:20PM in THH102/THH301
Esperan Verilog Guide is allowed but not required Total points: 261 Name: Perfect score: 250 / 261
1 ( 90 points) 65 min.
Topic: pipeline design (refer to the block diagrams on pages 5 and 6 and the two OoO designs below)
1.1 HDU (in the ID stage) is overridden and told not to stall in some cases to avoid conflict with the operation of the branch instruction. If this applies, write YES otherwise write NO in the cells below.
1.1.1 _____________ (HDU_Br / HDU /both HDU and HDU_Br) ______ (is / are) called the guardian angel in the case of _________________________ (an early branch / a medium-delay branch / a late branch / multiple of these).
1.1.2 FU_Br in the early branch design checks to see if the senior instruction in the MEM stage is(a) a register-writing instruction(b) a register-writing R-Type instruction but not a lw instruction(c) neither of the above.
1.2 Compared to the late branch, an early branch has __________ (less / more) branch penalty due to flushing and this difference is seen in ________________________________ (successful branches / unsuccessful branches / both / neither). Compared to the late branch, an early branch causes __________ (less / more) RAW dependency stalls and this difference is seen in _____________________________ (successful branches / unsuccessful branches / both / neither).
ROB
IoI - OoE - OoC design IoI - OoE - IoC design
3+2 pts
2+1 pts
3 pts
4+2 pts
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 2 / 10C Copyright 2014 Gandhi Puvvada
1.3 Assembly language code can be written deliberately (i) to make a late-branch look better than an early branch True / False(ii) to make a late-branch look better than a medium-delay branch True / False(iii) to make a medium-delay look better than an early branch True / False
1.4 There are 4 5-bit comparison units in the ID stage of the lab 6 Part 5 design as shown below.This number 4 __________________ (includes / does not include) the two comparators needed for internally forwarding in the register file. What would this number be if it was the 7-stage pipeline of lab 6 part 4? _________ . This part 5 design provides time advantage in _________ (ID stage / EX stage / neither) and timing disadvantage in _________ (ID stage / EX stage / neither).
1.5 Assuming that both early branch and late branch designs are working at the same frequency, the control unit of the early branch has to finish decoding the code a little more quickly compared to the control unit of the late branch. True / False
1.5.1 We can be more confident to move the 5-bit wide multiplexer on the side from the current EX stage to the ID stage in the case of the ________ (early / late) branch design.This move saves in the ID/EX stage register 1-bit for RegDst control signal. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the current lab 6 design. Besides this RegDst signal, this move saves _____ ( 0 / 4/ 5 / 8 /10) more bits in the lab 6 Part 5 design. This 5-bit multiplexer is in a time-critical path. True/False
1.6 Late branch can not be too late. WB stage in the 5-stage pipeline for the late branch execution is too late. True / FalseBranch outcome is announced from the CDB in our IoI-OoE-OoC design on page 1. Mr, Bruin says that CDB is like the WB stage hence branch execution is too late and our design is wrong. Please explain. ___________________________________________________________________________________________________________________________________________________________________________________________________________________________________
1.7 The stage register IF/ID of the in-order 5-stage pipeline is replaced by what in the IoI-OoE-OoC design on page 1? _____________________________________________________________________________________________________________________________________________
1.8 add is stalled for _____ (1 / 2) clock(s) in an in-order 5-stage pipeline and _____ (1 / 2) clock(s) in an in-order 7-stage pipeline. If lw incurs cache miss ______ (more / less) clocks are lost. This loss impacts CPI ________ (more / less / the same) in the IoI-OoE-OoC design on page 1.
3+2 pts
4+2 pts
RegInstr.
HDU
Data
FU
BRANCH
BR
1FU_Br
PC
cont
rol
HDU_Br
Zero
EQ
5 5
EQ
5 5
EQ
5 5
EQ
5 5
EQ
5 5
EQ
5 5
2 pts
4+2 pts
5 pts
3 pts
lw $2, 2000($0);add $3, $2, $1;
4 pts
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 3 / 10C Copyright 2014 Gandhi Puvvada
1.9 If we use 64 tokens or TAGs in the IoI-OoE-OoC design, the TAG FIFO will have ____ locations each of ____ bits per location. Since we compared this FIFO to ___________________ ________________________________ (paper tokens forming a virtual queue / pile of tokens on the cashier’s table in State Bank of India) it _____________________ (matters / does not matter) in which order the tokens 0 to 63 are placed initially in the token FIFO. The TAG FIFO should have a valid bit to indicate whether the location has a valid token or if it is empty. True / False
1.9.1 The RST (Register Status Table) is a look-up table to allow the dispatch unit to look up ________ ___________________________________ (entries for each of the source registers / entry for the destination register) of the instruction being dispatched (say add $3, $2, $1). This look-up is ____________________ (a search / an indexing) operation. The destination register gets renamed to a token drawn from the token FIFO. This register renaming causes _____________ (reading from / writing to) the Token FIFO and _____________ (reading from / writing to) the RST. The RST should have a valid bit besides the TOKEN field in each entry. True / False
1.9.2 Suppose a register-writing instruction comes on the CDB and says, "it is LION calling (or some 6-bit token in place of LION), and I am going to write the value 2000". Let us try to understand what should the dispatch unit do now. Circle all correct statements: (a) the Dispatch unit does not do anything (b) it indexes RST to find LION in RST (c) it performs a parallel search to find LION in RST. (d) if it finds LION across $2 in RST, it takes 2000 and deposit in the register file in $2.(e) if it finds LION across $2 in RST, and if it is currently dispatching an instruction with source register $2, it gives the value 2000 for its source register value. This like the internal forwarding in the register file.(f) if it finds LION across $2 in RST, it erases LION’s name across $2 and invalidates that entry.(g) it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the RP(h)it reclaims the token LION and deposits in the TOKEN FIFO at the location pointed to by the WP
1.10 Consider the four designs, (A) in-order 5-stage late branch, (B) in-order 5-stage early branch, (C) IoI-OoE-OoC, and (D) IoI-OoE-IoC.
1.10.1 A RAW problem for a memory location (i.e. lw dependent on a senior sw instruction with matching address) is naturally taken care of in _____________ (A/B/C/D/if multiple, then state them). Among the ones in which the problem needs to be taken care of by explicit logic, it takes less logic in ________ (A/B/C/D) compared to _________ (A/B/C/D).
1.10.2 WAW and WAR problems for memory locations do not exist in _______________ (A/B/C/D/if multiple, state them). Two store word instructions can leave LSQ in any order even if their addresses match in __________________________________ (C/D/neither C nor D/either C or D).
1.11 Every register writing instruction such as add $1, $2, $3 will end up writing into the destination register in the register file in __________________________ (IoI-OoE-OoC / IoI-OoE-IoC / both / neither) designs.
1.12 IFQ (Instruction PreFetch Queue) gets flushed ________________ (less often / more often / equally often) in the IoI-OoE-IoC design with branch prediction as compared to the IoI-OoE-OoC design with no branch prediction. In the IoI-OoE-IoC design, you end up flushing IFQ for every mis-prediction. T / F You flush IFQ when you predict a branch as ___________ (Taken/Not Taken/either/neither). Which is true in the IoI-OoE-OoC? ____ [(a) / (b)] (a) we just opted not to have branch prediction (b) Since it is OoC, branch prediction is not possible.
5+2 pts
5 pts
8+2 pts
4 pts
4 pts
4 pts
6+2 pts
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 4 / 10C Copyright 2014 Gandhi Puvvada
Forw
ardi
ngun
it
Haz
ard
dete
ctio
nun
it
04
0
0
Instructionmemory
PC
+
r1 r2
R1
R2
w W
opcode rs rt rd shift funct
Registers
Control
(PC)
(rs) (rt)
AL
U
rs rt rd functshift
AL
Uct
rl
Sign ext.
EXME
WB
AL
USr
cA
LU
Op
Reg
Dst
AL
USr
c
Reg
Dst
AL
UO
p
Mem
Rea
d
+
(PC)
Z
Datamemory
WR
ME
WB ALU_result
@ W
R
MemRead
MemWrite
Store_data
RegWrite
(PC)
Branch
ID.F
lush
IF.Flush
EX
.Flu
sh
WR
WB MEM_data REG_data
RegWrite
Mem
toR
eg
Orig
inal
dra
win
g pr
ovid
ed b
y Pr
of. D
uboi
sPi
pelin
ed C
PU (L
ate
Bra
nch
from
1st
Ed.
) for
the
EE45
7 cl
ass L
ab #
6
Shift
Lef
t 2
3/26
/200
0
IF/I
DIF
-Sta
geID
/EX
ID-S
tage
EX
/ME
ME
X-S
tage
ME
M-S
tage
ME
M/W
B WB
-Sta
ge
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 5 / 10C Copyright 2014 Gandhi Puvvada
Haz
ard
dete
ctio
nun
it
04 Instruction
memory
PC
+
r1 r2
R1
R2
w W
opcode rs rt rd shift funct
Registers
Control
(PC)
(rs) (rt)
ALU
rt rd
ALU ctrl
Sig
nex
t.
EXME
WB
ALU
Src
ALU
Op
Reg
Dst
ALU
Src
RegDst
ALU
Op
Reg
Writ
e_EX
Datamemory
WR
ME
WB ALU_result
@ W
R
MemRead
MemWrite
Store_data
RegWrite
IF.F
lush
WR
WB MEM_data REG_data
RegWrite
Mem
toR
eg
+
=
functs_ext
Shift
Left
2Zero
Forw
ardi
ng U
nit
Des
igne
d by
: Gan
dhi P
uvva
daD
etai
led
impl
emen
tatio
n of
Ear
ly B
ranc
h su
gges
ted
in 3
rd E
d.10
/18/
06
IF/I
DIF
-Sta
geID
/EX
ID-S
tage
EX
/ME
ME
X-S
tage
ME
M-S
tageM
EM
/WB
WB
-Sta
ge
rs
Mem
Rea
d_EX
Mem
Rea
d_M
EM
WriteRegister_EXFU
_Br
FW_RS_WB
FW_RS_MEM
FW_RT_WB
FW_RT_MEM
FW_RT
FW_RS
WriteRegister_MEM
Writ
eReg
iste
r_M
EMH
DU
_Br
STA
LL_B
EQST
ALL_
LW
STA
LL
Branch
0 1
0 1 10
01
11
11
1
00
00
0
0
0 1
Branch
1
fow
ardi
ng_m
ux_c
ontr
ol
Dra
wn
by: W
ei-je
n H
su
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 6 / 10C Copyright 2014 Gandhi Puvvada
2 ( 77 points) 45 min. Advanced topics Miscellaneous
2.1 Exceptions are taken in _____________________________ (program order / temporal order).Illegal instruction exception is often _____________________________ (a precise exception / an exception leading to abortion of the program) so as to support software emulation of unimplemented instructions.
2.2 In the two CMP (Chip Multi Processors) organizations shown below, the shared L2 cache is shown "banked" on the left but not on the right. Explain. ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
2.2.1 If there are 8 banks of L2 cache in the left side organization, is it true that there can be 8 copies of a block in the 8 banks besides 8 copies in the 8 L1 caches? True / Not trueExplain: ____________________________________________________________________________________________________________________________________________________
2.3 To avoid RWM race, you should be able to make atomic operation. What RWM stand for?____________________________________________________________________________
2.4 Compared to MSI protocol, MOESI protocol _______________ (reduces / increases) L2 to L1 transactions mainly because of _________ (O-state / E-state). Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
2.4.1 We proposed to split the O-state into two states: O-Dirty and O-Clean. You arrive in O-Dirty state from _________ (M/O-Clean/E/S/I) and arrive in O-Clean state from _________ (M/O-Dirty/E/S/I).
2.5 Branch Prediction from ID stage: In the diagram on the side, the (30-K) bit field of the PC is ______________________ (ignored / used as a TAG). Is there any connection between the frequency of aliasing and the BPB’s (Branch Prediction Buffer’s) depth (=2K) or width (1-bit predictor vs. 2-bit predictor)? ________________________________________________________________________________________________________________________________________
2.6 RAS stands for ________________________. It is usually ____________________ (4 or 8 locations / 1K to 2K locations) deep. The contents of RAS change during the execution of ______ _____ (jal / jr $31/both). RAS helps in the execution of ____________ (jal / jr $31/both) and such help __________ (is / isn’t) considered as a prediction that can go wrong.
4 pts
4 pts
P0
L1$
P1
L1$
P7
L1$
Memory Interconnection Network
Shared (banked)L2$ L2$ L2 cache
P0
L1$
P1
L1$
P7
L1$
Shared L2 cache (no banks)
4 pts
2 pts
4 pts
4 pts
01010011
00
K-bits
K30-K PC
BPB 2K
5 pts
6 pts
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 7 / 10C Copyright 2014 Gandhi Puvvada
2.7 Our USC multi-threaded multi-core processor currently has 4 cores each with 4 threads and the core resources are well utilized by the 4 threads. Due to process improvements we have more silicon in the next generation processors. We are considering the two choices: (i) 8 cores each with 4 threads (ii) 4 cores each with 8 threads, You recommend to go for ______ ( i / ii). Explain: _______________________________________________________________________________________________________________________________________________________________________Choice ______ ( i / ii) is expected to require _________ (more / less) silicon compared to choice ______ ( i / ii). Explain: ___________________________________________________________ ______________________________________________________________________________
2.8 Two challenges in compiler design for the current super-scalar super-pipelined processors:(i) avoid pairs of dependent instructions, (ii) schedule instructions into longer delay slots for load-word and branch instructions. The first is important because of the ____________ (super-scalar / super-pipelined) aspect of our processor and we said "pairs" because we assumed that ______________________________________________________________________________________
2.9 MPI (Miss Penalty per Instruction) is used to estimate the impact of cache misses on CPI.Assume that the CPI without cache misses is 1.35. In a system with L1 and L2 caches, the overall CPI was calculated as 1.35 + (0.04 * 25) + (0.01 * 200) = 4.35. What are the numbers 0.04, 25, 0.01, and 200? _____________________________________________________________________________________________________________________________________________________________________________________________________________________________
2.10 Intel HTT stands for ___________________________ and it is same as ___________________________________ (fine-grain / coarse-grain / simultaneous) multi-threading. It uses ________ (ILP/TLP) advantage of its out-of-order execution engine together with ________ (ILP/TLP). _______ (Like / Unlike) other multi-threading techniques, here they do not need to roll-back a thread on a cache miss because ________________________________________________________________________________________________________________________________________________________________________________________________________________
2.11 Mr. Bruin joined USC and was appointed as EE457 lab grader. He was grading Lab 7 Part 3 Subpart 4 (RTL coding of the pipeline with EX1 and EX2 merged into EX12). Several students made errors in the stall logic coding. He was puzzled because their _________________ (Reg. file contents / TimeSpace.txt) came out correct but their ________________ (Reg. file contents / TimeSpace.txt). He expected both to be wrong. Explain: _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________
2.11.1 In our RTL coding, we produced two STALL signals ( declared on the side. Which one of the two produces a waveform easier to understand and why? Is it necessary to declare STALL_combinational as a wire? __________________________________________________________________________________________________________________________________________________________________________
2.11.2 In RTL coding, forwarding muxes output in an "if" statement in a clocked always block is usually assigned using the _______________ (blocking / non-blocking) procedural assignment operator because _____________________________________________________________________________________________________________________________________________________________
8 pts
3 pts
6 pts
8 pts
8 pts
reg STALL;wire STALL_combinational;6
pts
X1_mux SUB 3
A A-3
5 pts
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 8 / 10C Copyright 2014 Gandhi Puvvada
3 ( 14 + 18 = 32 points) 20 min. Topic: Lab 7 Part 3 Subpart 2
3.1 Given on the side is the incomplete stall logic. Label the three points as Stalled, Add1, and Stall as appropriate. The Stall output is a __________ (Mealy / Moore) output. This is a state machine implementation with an encoded state assignment and has _____ (1/2/3/4) states. Complete the state diagram below. Write state transition conditions and write, "if ADD 1, then STALL" in one of the two states. And then try to implement a 1-hot coded state machine for the same state diagram. Produce the STALL output.
3.2 A similar state diagram is shown below in nearly completed form to stall for 1 clock for M (MULT) instruction and to stall for 2 clocks for D (DIVIDE) instruction and no stall for other instructions. The state transition conditions are completed. Conditional STALL output generation statements need to be added in one or two or three state circles as appropriate. Some suggestions:i) if (M | D), then STALL (ii) If (M) then STALL and (iii) If (D) then STALL
Implement the one-hot state machine and generate the STALL signal (i.e do the OFL). Let us use A (ADD) and S (SUB) for other instructions which do not need stalls. If A is in the multi-functional EX state in the current clock and the above state machine is in the IDLE state, write the sequence of states the above machine goes through for D M S M A instructions.
D QCLKCLRCLK
RESET_B
14 pts
IDLE STALLED
RESET_B
D QCLK
PRESET
CLK
RESET_B
D QCLKCLRCLK
RESET_B
QQIDLE STALLED
18 pts
IDLE STALLED_1
RESET_B
STALLED_2M | D
M & D
D
D
1
D QCLK
PRESET
CLK
RESET_B
D QCLKCLRCLK
RESET_B
QQIDLE STALLED_1 D Q
CLKCLRCLK
RESET_B
QSTALLED_2
IDLE, STALLED_1, STALLED_2,
ee457_Final_Sp2014.fm May 9, 2014 11:05 am EE457 Final Exam - Spring 2014 9 / 10C Copyright 2014 Gandhi Puvvada
4 ( 27 points) 20 min. adder, subtracter, incrementer, decrementer
4.1 We know we do (A - B) by doing (A + B’ +1). To perform (X -2Y), we do ______ (i / ii / iii / iv / v)(i) [X + (Y’ || 1’b0) + 1] (ii) [X + (Y’ || 1’b0) + 2] (iii) [X + (Y’ || 1’b1) + 1] (iv) [X + (Y’ || 1’b1) + 2] (v) none of these, we need to do ______________________________ . Note: || means concatenate
4.1.1 Say, the numbers are all 4-bit numbers and we have a 4-bit adder/subtracter to perform (X-2Y) by simply discarding (dropping) Y3’. In which cases discarding Y3’ does not change the value of (-2Y).Answer separately for unsigned numbers and for signed numbers.Unsigned numbers: ____________________________________________________________________________________________________________________________________________Signed numbers: ______________________________________________________________________________________________________________________________________________
4.1.2 Let us play safe and use a 5-bit adder and convert it to a subtracter to perform (X-2Y). Complete the two designs by using inverters, XOR gates, etc. as needed. Produce USO (unsigned subtraction overflow) and SSO (signed subtraction overflow). You do not have to simplify the FULL-ADDER building blocks.
4.2 On the left, we have a chart for delays in gates for the incrementer/decrementer design based on CLA design assuming 4 as the blocking factor (4 CLL carry look-ahead logic boxes require one next-level CLL). Complete the table on the right for a new blocking factor of 3.
For the left-side design, how many CLL boxes are needed for a 1024-bit incrementer? ________Think of a formula to arrive at this number rather than using a brute-force method!
Are the CLL boxes identical in the case of incrementer as well as decrementer? Yes / No
Do they both contain 1-level logic or 2-level logic? _______________________
In ___________________ (an incrementer / a decrementer), C3 = g2 + g1 + g0 + C0 because every cell agrees to _________ (generate / propagate).
In ___________________ (an incrementer / a decrementer), C3 = p2.p1.p0.C0 because no cell _________ (generates / propagates) a carry.
4 pts
4 pts
6 pts
a bcin
scout C0
a bcin
scout
a bcin
scout
a bcin
scout
a bcin
scout
UNSIGNED SUBTRACTOR SIGNED SUBTRACTERDo not forget to produce USO.
R4 R3 R2 R1 R0
a bcin
scout C0
a bcin
scout
a bcin
scout
a bcin
scout
a bcin
scout
R4 R3 R2 R1 R0
13 pts
blocking factor = 4
blocking factor = 3
ee457_Final_Sp2014.fm May 9, 2014 12:15 am EE457 Final Exam - Spring 2014 10 / 10C Copyright 2014 Gandhi Puvvada
5 ( 35 points) 25 min. Virtual memory
5.1 Given ________ (VPN / VA), we get ________ (PPFN / PA) from _____________________ (the TLB / the PT / either the TLB or the PT if TLB does not have) if the page is ___________ (present / absent) in the MM.
5.2 A TLB of 17 entries (17 is a prime number and does not have any factors and is certainly not a power of 2) is possible if the TLB uses a _______________ (fully-associative / set-associative / direct) mapping. Usually either a fully-associative mapping or a set-associative mapping with liberal associativity is used for the TLB because (talk about both cost and performance) ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________The same thing ______ (can / can not) be said about cache because cache is much ___________ (bigger/smaller) and cost will be much __________ (higher / smaller) and the penalty of a miss is relatively _____ ________ (smaller / larger).
5.3 In a way both TLB and page Table are look-up table but so far as indexing or searching is concerned, ________________________________________________________________________________________________________________________________________________________________________________________________________________________________Direct mapped cache is ____________________________ (indexed / searched in parallel), a fully associative cache is ____________________________ (indexed / searched in parallel).
5.4 Two processes, one with process id 2 and another with process id 3 can both use location with 32-bit address 00002000 as in lw $2, 2000($0). Is this 32-bit address 00002000 a virtual address (VA) or a physical address (PA)? ______ (VA / PA). Assuming a fully associative TLB, the page number 00002 (upper 20 bits of the 32-bit address 00002000) is in the VPN field or PPN field of an entry in the TLB? __________ (VPN / PPFN) field.
5.5 In a multi-level page table, the _______ (VPN/PPFN) is present in the entries of ____________ (every level / the first A level / the last D level / other ..). In the example on the side the ______ (VA/VPN/PA/PPFN) is shown divided into 4 fields: A, B, C, D. It takes ___________(minimum/maximum/always) ____ (state a
number like 2) accesses to the example table on the side before you declare page table hit. You may be able to declare page fault after 1 or 2 or 3 or 4 accesses to the PT. ______ (T / F). "A" table is only one. ______ (T / F). Number of "B" tables is less than or equal to number of "C" tables. ______ (T / F). _________ (Page Table / TLB / Both / Neither) is flushed on context switch. Explain: ______________________________________________________________________________________________________________________________________________________
4 pts
9 pts
6 pts
3 pts
1010111100011
PTBR
A B C D
A table16 entries
B tables16 entries
C tables8 entries
D tables4 entries
PPFN
13 pts
We enjoyed teaching this course! Hope you liked it! Hope to see some of you in EE560. Grades will be out in a week. Enjoy your semester break! Happy Summer Holidays!!! - Gandhi and TA Yue, Mentors: Manasa, Vibha, Lai, Binal, Graders: Atit, Guan (Crystal), Guan (Wade), Mukhdeep, Ruozhi, and Zhe Happy Semester Break!
top related