EECS 252 Graduate Computer Architecture Lecture 2 0 Review of Instruction Sets, Pipelines, and Caches January 26 th , 2009 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
93
Embed
EECS 252 Graduate Computer Architecture Lecture 2 0 Review of Instruction Sets, Pipelines, and Caches January 26 th, 2009 John Kubiatowicz Electrical.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
EECS 252 Graduate Computer Architecture
Lecture 2
0 Review of Instruction Sets, Pipelines,
and Caches January 26th, 2009
John KubiatowiczElectrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
1/26/2009 CS252-S09, Lecture 02 2
Review: Moore’s Law
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # on transistors on cost-effective integrated circuit double every 18 months
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006
1/26/2009 CS252-S09, Lecture 02 4
“Bell’s Law” – new class per decade
year
log
(p
eo
ple
pe
r c
om
pu
ter)
streaming informationto/from physical world
Number CrunchingData Storage
productivityinteractive
• Enabled by technological opportunities
• Smaller, more numerous and more intimately connected
• Brings in a new kind of application
• Used in many ways not previously imagined
1/26/2009 CS252-S09, Lecture 02 5
Today: Quick review of everything you should
have learned
0
( A countably-infinite set of computer architecture concepts )
1/26/2009 CS252-S09, Lecture 02 6
Metrics used to Compare Designs
• Cost– Die cost and system cost
• Execution Time– average and worst-case– Latency vs. Throughput
• Energy and Power– Also peak power and peak switching current
• Reliability– Resiliency to electrical noise, part failure– Robustness to bad software, operator error
• Maintainability– System administration costs
• Compatibility– Software costs dominate
1/26/2009 CS252-S09, Lecture 02 7
Cost of Processor• Design cost (Non-recurring Engineering Costs, NRE)
– dominated by engineer-years (~$200K per engineer year)– also mask costs (exceeding $1M per spin)
• Cost of die– die area– die yield (maturity of manufacturing process, redundancy features)– cost/size of wafers– die cost ~= f(die area4) with no redundancy
• Cost of packaging– number of pins (signal + power/ground pins)– power dissipation
• Cost of testing– built-in test features?– logical complexity of design– choice of circuits (minimum clock rates, leakage currents, I/O drivers)
Architect affects all of these
1/26/2009 CS252-S09, Lecture 02 8
What is Performance?
• Latency (or response time or execution time)– time to complete one task
• Bandwidth (or throughput)– tasks completed per unit time
1/26/2009 CS252-S09, Lecture 02 9
Performance(X) Execution_time(Y)
n = =
Performance(Y) Execution_time(X)
Definition: Performance• Performance is in units of things per sec
– bigger is better
• If we are primarily concerned with response time
performance(x) = 1 execution_time(x)
" X is n times faster than Y" means
1/26/2009 CS252-S09, Lecture 02 10
Performance: What to measure
• Usually rely on benchmarks vs. real workloads
• To increase predictability, collections of benchmark applications-- benchmark suites -- are popular
• SPECCPU: popular desktop benchmark suite– CPU only, split between integer and floating point programs
– SPECint2000 has 12 integer, SPECfp2000 has 14 integer pgms
– SPECCPU2006 to be announced Spring 2006
– SPECSFS (NFS file server) and SPECWeb (WebServer) added as server benchmarks
• Transaction Processing Council measures server performance and cost-performance for databases
– TPC-C Complex query for Online Transaction Processing
– TPC-H models ad hoc decision support
– TPC-W a transactional web benchmark
– TPC-App application server and web services benchmark
1/26/2009 CS252-S09, Lecture 02 11
Summarizing Performance
Which system is faster?
System Rate (Task 1) Rate (Task 2)
A 10 20
B 20 10
1/26/2009 CS252-S09, Lecture 02 12
… depends who’s selling
System Rate (Task 1) Rate (Task 2)
A 10 20
B 20 10
Average
15
15
Average throughput
System Rate (Task 1) Rate (Task 2)
A 0.50 2.00
B 1.00 1.00
Average
1.25
1.00
Throughput relative to B
System Rate (Task 1) Rate (Task 2)
A 1.00 1.00
B 2.00 0.50
Average
1.00
1.25
Throughput relative to A
1/26/2009 CS252-S09, Lecture 02 13
Summarizing Performance over Set of Benchmark Programs
Arithmetic mean of execution times ti (in seconds)
1/n i ti
Harmonic mean of execution rates ri (MIPS/MFLOPS)
n/ [i (1/ri)]
• Both equivalent to workload where each program is run the same number of times
• Can add weighting factors to model other workload distributions
1/26/2009 CS252-S09, Lecture 02 14
Normalized Execution Timeand Geometric Mean
• Measure speedup up relative to reference machine
ratio = tRef/tA
• Average time ratios using geometric mean
n(I ratioi )• Insensitive to machine chosen as reference
• Insensitive to run time of individual benchmarks
• Used by SPEC89, SPEC92, SPEC95, …, SPEC2006
….. But beware that choice of reference machine can suggest what is “normal” performance profile:
1/26/2009 CS252-S09, Lecture 02 15
Vector/Superscalar Speedup
• 100 MHz Cray J90 vector machine versus 300MHz Alpha 21164• [LANL Computational Physics Codes, Wasserman, ICS’96]• Vector machine peaks on a few codes????
1/26/2009 CS252-S09, Lecture 02 16
Superscalar/Vector Speedup
• 100 MHz Cray J90 vector machine versus 300MHz Alpha 21164• [LANL Computational Physics Codes, Wasserman, ICS’96]• Scalar machine peaks on one code???
1/26/2009 CS252-S09, Lecture 02 17
How to Mislead with Performance Reports
• Select pieces of workload that work well on your design, ignore others• Use unrealistic data set sizes for application (too big or too small)• Report throughput numbers for a latency benchmark• Report latency numbers for a throughput benchmark• Report performance on a kernel and claim it represents an entire
application• Use 16-bit fixed-point arithmetic (because it’s fastest on your system)
even though application requires 64-bit floating-point arithmetic• Use a less efficient algorithm on the competing machine• Report speedup for an inefficient algorithm (bubblesort)• Compare hand-optimized assembly code with unoptimized C code• Compare your design using next year’s technology against
competitor’s year old design (1% performance improvement per week)• Ignore the relative cost of the systems being compared• Report averages and not individual results• Report speedup over unspecified base system, not absolute times• Report efficiency not absolute times• Report MFLOPS not absolute times (use inefficient algorithm)[ David Bailey “Twelve ways to fool the masses when giving performance
results for parallel supercomputers” ]
1/26/2009 CS252-S09, Lecture 02 18
Amdahl’s Law
enhanced
enhancedenhanced
new
oldoverall
Speedup
Fraction Fraction
1
ExTimeExTime
Speedup
1
Best you could ever hope to do:
enhancedmaximum Fraction - 1
1 Speedup
enhanced
enhancedenhancedoldnew Speedup
FractionFraction ExTime ExTime 1
1/26/2009 CS252-S09, Lecture 02 19
Amdahl’s Law example
• New CPU 10X faster
• I/O bound server, so 60% time waiting for I/O
• Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster
56.1
64.0
1
100.4
0.4 1
1
SpeedupFraction
Fraction 1
1 Speedup
enhanced
enhancedenhanced
overall
1/26/2009 CS252-S09, Lecture 02 20
Computer Performance
CPU time = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
CPU time = Seconds = Instructions x Cycles x Seconds
Problems with Pipelining• Exception: An unusual event happens to an
instruction during its execution – Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch the processor to a new instruction stream
– Example: a sound card interrupts when it needs more audio output samples (an audio “click” happens if it is left waiting)
• Problem: It must appear that the exception or interrupt must appear between 2 instructions (Ii and Ii+1)
– The effect of all instructions up to and including Ii is totalling complete
– No effect of any instruction after Ii can take place
• The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
1/26/2009 CS252-S09, Lecture 02 58
Precise Exceptions in Static Pipelines
Key observation: architected state only change in memory and register write stages.
1/26/2009 CS252-S09, Lecture 02 59
Memory Hierarchy Review
1/26/2009 CS252-S09, Lecture 02 60
Since 1980, CPU has outpaced DRAM ...
CPU60% per yr2X in 1.5 yrs
DRAM9% per yr2X in 10 yrs
10
DRAM
CPU
Performance(1/latency)
100
1000
1980 20001990
Year
Gap grew 50% per year
• How do architects address this gap?– Put small, fast “cache” memories between CPU and DRAM.– Create a “memory hierarchy”
1/26/2009 CS252-S09, Lecture 02 61
1977: DRAM faster than microprocessors Apple ][ (1977)
Steve WozniakSteve
Jobs
CPU: 1000 ns DRAM: 400 ns
1/26/2009 CS252-S09, Lecture 02 62
Memory Hierarchy of a Modern Computer• Take advantage of the principle of locality to:
– Present as much memory as in the cheapest technology
– Provide access at speed offered by the fastest technology
On
-Ch
ipC
ache
Registers
Control
Datapath
SecondaryStorage(Disk)
Processor
MainMemory(DRAM)
SecondLevelCache
(SRAM)
1s 10,000,000s
(10s ms)
Speed (ns): 10s-100s 100s
100s GsSize (bytes): Ks-Ms Ms
TertiaryStorage(Tape)
10,000,000,000s
(10s sec)Ts
1/26/2009 CS252-S09, Lecture 02 63
The Principle of Locality
• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of time.
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be
referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access)
• Last 15 years, HW relied on locality for speed
1/26/2009 CS252-S09, Lecture 02 64
Programs with locality cache well ...
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Mem
ory
Ad
dre
ss (
on
e d
ot
per
acc
ess)
SpatialLocality
Temporal Locality
Bad locality behavior
1/26/2009 CS252-S09, Lecture 02 65
Memory Hierarchy: Apple iMac G5
iMac G51.6 GHz
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency
Cycles, Time
1,
0.6 ns
3,
1.9 ns
3,
1.9 ns
11,
6.9 ns
88,
55 ns
107,
12 ms
Let programs address a memory space that scales to the disk size, at
a speed that is usually as fast as register access
Managed by compiler
Managed by hardware
Managed by OS,hardware,
application
Goal: Illusion of large, fast, cheap memory
1/26/2009 CS252-S09, Lecture 02 66
iMac’s PowerPC 970: All caches on-chip
(1K)
Reg
isters
512KL2
L1 (64K Instruction)
L1 (32K Data)
1/26/2009 CS252-S09, Lecture 02 67
Memory Hierarchy: Terminology• Hit: data appears in some block in the upper level (example:
Block X) – Hit Rate: the fraction of memory access found in the upper level
– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss
• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)
– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor
• Hit Time << Miss Penalty (500 instructions on 21264!)
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
1/26/2009 CS252-S09, Lecture 02 68
4 Questions for Memory Hierarchy
• Q1: Where can a block be placed in the upper level? (Block placement)
• Q2: How is a block found if it is in the upper level? (Block identification)
• Q3: Which block should be replaced on a miss? (Block replacement)
• Q4: What happens on a write? (Write strategy)
1/26/2009 CS252-S09, Lecture 02 69
Q1: Where can a block be placed in the upper level?
• Block 12 placed in 8 block cache:– Fully associative, direct mapped, 2-way set associative
• Coherence (Invalidation): other process (e.g., I/O) updates memory
A Summary on Sources of Cache Misses
1/26/2009 CS252-S09, Lecture 02 71
• Index Used to Lookup Candidates in Cache– Index identifies the set
• Tag used to identify actual copy– If no candidates match, then declare cache miss
• Block is minimum quantum of caching– Data select field used to select data within block
– Many caching applications don’t have data select field
Q2: How is a block found if it is in the upper level?
Blockoffset
Block AddressTag Index
Set Select
Data Select
1/26/2009 CS252-S09, Lecture 02 72
:
0x50
Valid Bit
:
Cache Tag
Byte 32
0
1
2
3
:
Cache Data
Byte 0Byte 1Byte 31 :
Byte 33Byte 63 :Byte 992Byte 1023 : 31
Direct Mapped Cache• Direct Mapped 2N byte cache:
– The uppermost (32 - N) bits are always the Cache Tag– The lowest M bits are the Byte Select (Block Size = 2M)
• Example: 1 KB Direct Mapped Cache with 32 B Blocks– Index chooses potential block– Tag checked to verify block– Byte select chooses byte within block
Ex: 0x50 Ex: 0x00
Cache Index
0431
Cache Tag Byte Select
9
Ex: 0x01
1/26/2009 CS252-S09, Lecture 02 73
Cache Index
0431
Cache Tag Byte Select
8
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Mux 01Sel1 Sel0
OR
Hit
Set Associative Cache• N-way set associative: N entries per Cache Index
– N direct mapped caches operates in parallel• Example: Two-way set associative cache
– Cache Index selects a “set” from the cache– Two tags in the set are compared to input in parallel– Data is selected based on the tag result
Compare Compare
Cache Block
1/26/2009 CS252-S09, Lecture 02 74
Fully Associative Cache• Fully Associative: Every block can hold any line– Address does not include a cache index– Compare Cache Tags of all Cache Entries in Parallel
• Example: Block Size=32B blocks– We need N 27-bit comparators– Still have byte select to choose from within block
:
Cache Data
Byte 0Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Valid Bit
::
Cache Tag
04
Cache Tag (27 bits long) Byte Select
31
=
=
=
=
=
Ex: 0x01
1/26/2009 CS252-S09, Lecture 02 75
Q3: Which block should be replaced on a miss?
• Easy for Direct Mapped
• Set Associative or Fully Associative:– LRU (Least Recently Used): Appealing, but hard to
implement for high associativity
– Random: Easy, but – how well does it work?
Assoc: 2-way 4-way 8-way
Size LRU Ran LRU Ran LRU Ran
16K 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64K 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256K 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
1/26/2009 CS252-S09, Lecture 02 76
Q4: What happens on a write?
Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level memory
Write data only to the cache
Update lower level when a block falls out of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to
lower level?Yes No
Additional option -- let writes to an un-cached address allocate a new cache line (“write-
allocate”).
1/26/2009 CS252-S09, Lecture 02 77
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall
Q. Why a buffer, why not just one register ?
A. Bursts of writes arecommon.Q. Are Read After
Write (RAW) hazards an issue for write buffer?
A. Yes! Drain buffer before next read, or check write buffers for match on reads
1/26/2009 CS252-S09, Lecture 02 78
5 Basic Cache Optimizations• Reducing Miss Rate
1. Larger Block size (compulsory misses)
2. Larger Cache size (capacity misses)
3. Higher Associativity (conflict misses)
• Reducing Miss Penalty
4. Multilevel Caches
• Reducing hit time
5. Giving Reads Priority over Writes • E.g., Read complete before earlier writes in write buffer
1/26/2009 CS252-S09, Lecture 02 79
Virtual Memory
1/26/2009 CS252-S09, Lecture 02 80
• Virtual memory => treat memory as a cache for the disk• Terminology: blocks in this cache are called “Pages”
• As described, TLB lookup is in serial with cache lookup:
• Machines with TLBs go one step further: they overlap TLB lookup with cache access.
– Works because offset available early
Reducing translation time further
Virtual Address
TLB Lookup
V AccessRights PA
V page no. offset10
P page no. offset10
Physical Address
1/26/2009 CS252-S09, Lecture 02 88
• Here is how this might work with a 4K cache:
• What if cache size is increased to 8KB?– Overlap not complete– Need to do something else. See CS152/252
• Another option: Virtual Caches– Tags in cache are virtual addresses– Translation only happens on cache misses
TLB 4K Cache
10 2
004 bytes
index 1 K
page # disp20
assoclookup
32
Hit/Miss
FN Data Hit/Miss
=FN
Overlapping TLB & Cache Access
1/26/2009 CS252-S09, Lecture 02 89
Problems With Overlapped TLB Access
11 2
00
virt page # disp20 12
cache index
This bit is changedby VA translation, butis needed for cachelookup
Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13]
1K
4 410
2 way set assoc cache
• Overlapped access requires address bits used to index into cache do not change as result translation
– This usually limits things to small caches, large page sizes, or high– n-way set associative caches if you want a large cache
• Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K:
1/26/2009 CS252-S09, Lecture 02 90
Summary: Control and Pipelining• Next time: Read Appendix A
• Control VIA State Machines and Microprogramming
• Just overlap tasks; easy if tasks are independent
• Speed Up Pipeline Depth; if ideal CPI is 1, then:
• Hazards limit performance on computers:– Structural: need more HW resources
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling
– Control: delayed branch, prediction
• Exceptions, Interrupts add complexity
• Next time: Read Appendix C, record bugs online!
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline 1depth Pipeline
Speedup
1/26/2009 CS252-S09, Lecture 02 91
Summary #1/3: The Cache Design Space• Several interacting dimensions
– cache size
– block size
– associativity
– replacement policy
– write-through vs write-back
– write allocation
• The optimal choice is a compromise– depends on access characteristics
» workload
» use (I-cache, D-cache, TLB)
– depends on technology / cost
• Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good
Less More
Factor A Factor B
1/26/2009 CS252-S09, Lecture 02 92
Summary #2/3: Caches• The Principle of Locality:
– Program access a relatively small portion of the address space at any instant of time.
» Temporal Locality: Locality in Time» Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Capacity Misses: increase cache size– Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!
• Write Policy: Write Through vs. Write Back• Today CPU time is a function of (ops, cache misses)
vs. just f(ops): affects Compilers, Data structures, and Algorithms
• TLB misses are significant in processor performance– funny times, as most systems can’t access all of 2nd level cache without
TLB misses!
• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed?2) How is block found? 3) What block is replaced on miss? 4) How are writes handled?
• Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure