Trends in High-Performance Computer Architecture David J. Lilja Department of Electrical Engineering Center for Parallel Computing University of Minnesota Minneapolis E-mail: [email protected]Phone: 625-5007 FAX: 625-4583 1 -- Lilja University of Minnesota April 1996
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Trends in High-PerformanceComputer Architecture
David J. Lilja
Department of Electrical EngineeringCenter for Parallel Computing
1: direction of movement: FLOW2 a: a prevailing tendency or inclination: DRIFT2 b: a general movement: SWING2 c: a current style or preference: VOGUE2 d: a line of development: APPROACH
Webster’s Dictionary
It is very difficult to make an accurate prediction, especiallyabout the future.
Niels Bohr
Historical Trends and Perspective
pre-WW II: Mechanical calculating machines
WW II - 50’s: Technology improvementrelays → vacuum tubeshigh-level languages
70’s: Semantic gapcomplex instruction setslanguage support in hardwaremicrocoding
80’s: Keep It Simple, StupidRISC vs CISC debateshift complexity to software
90’s: What to do with all of these transistors?large on-chip cachesprefetching hardwarespeculative executionspecial-purpose instructionsmultiple processors on-a-chip
2 -- Lilja University of Minnesota April 1996
What is Computer Architecture?
It has nothing to do with buildings.
Goals of a computer designer- control complexity- maximize performance- minimize cost?
Use levels of abstraction
silicon and metal→ transistors
→ gates→ flip-flops
→ registers→ functional units
→ processors→ systems
Architecture- defines interface between higher levels and software- requires close interaction between
* HW designer* SW designer
3 -- Lilja University of Minnesota April 1996
Performance Metrics
System throughput- work per unit time → rate- used by system managers
Execution time- how long to execute your application- used by system designers and users
Texec = n instrs *# instrs# cycles
*cycle
seconds
== n * CPI * Tclock
Example
Texec = 900 M instrs *instr
1.8 cycles*
cycle10 ns = 16.2 sec
4 -- Lilja University of Minnesota April 1996
Improving Performance
Texec = Tclock * n * CPI
Improve clock rate, Tclock
Reduce total number of instructions executed, n
Reduce average number of cycles per instruction, CPI
5 -- Lilja University of Minnesota April 1996
1) Improving the Clock Rate
Use faster technology- BiCMOS, ECL, etc- smaller features to reduce propagation delay
Pipelining- reduce the amount of work per clock cycle
instr
fetch
instr
decode
generate
effective
op addr
operand
fetchexecute
operand
write
Performance improvement- reduces Tclock
- overlaps execution of instructions→ parallelism
Maximum speedup ≤ pipeline depth
6 -- Lilja University of Minnesota April 1996
Cost of Pipelining
More hardware- need registers between each pipe segment
Data hazards- data needed by instr i+x from instr i has not been calculated
Branch hazards- began executing instrs from wrong branch path
7 -- Lilja University of Minnesota April 1996
Branch Penalty
Instruction i+2 branches to instr j- branch resolved in stage 5
cycle pipeline segment #1 2 3 4 5 6
1 i - - - - - (start up latency)2 i+1 i - - - - (start up latency)3 i+2 i+1 i - - - (start up latency)4 i+3 i+2 i+1 i - - (start up latency)5 i+4 i+3 i+2 i+1 i - (start up latency)6 i+5 i+4 i+3 i+2 i+1 i instruction i finished7 X X X X i+2 i+1 instruction i+1 finished8 j X X X X i+2 instruction i+2 finished9 j+1 j X X X X (branch penalty)
10 j+2 j+1 j X X X (branch penalty)11 j+3 j+2 j+1 j X X (branch penalty)12 j+4 j+3 j+2 j+1 j X (branch penalty)13 j+5 j+4 j+3 j+2 j+1 j instruction j finished14 j+6 j+5 j+4 j+3 j+2 j+1 instruction j+1 finished
Data hazards produce similar pipeline bubbles- i+3 needs data generated by i+2- i+3 stalled until i+2 in stage 5
Solutions to hazards- data bypassing- instruction reordering- branch prediction- delayed branch
8 -- Lilja University of Minnesota April 1996
2) Reduce Number of Instructions Executed
Texec = Tclock * n * CPI
CISC -- Complex Instruction Set Computer- powerful instrs to reduce instr count