CS252 Graduate Computer Architecture Lecture 14 3+1 Cs of Caching and many ways Cache Optimizations John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/ cs252
CS252Graduate Computer Architecture
Lecture 14
3+1 Cs of Caching and many ways Cache Optimizations
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
3/11/2009 cs252-S09, Lecture 14 2
Review: VLIW: Very Large Instruction Word
• Each “instruction” has explicit coding for multiple operations
– In IA-64, grouping called a “packet”
– In Transmeta, grouping called a “molecule” (with “atoms” as ops)
• Tradeoff instruction space for simple decoding– The long instruction word has room for many operations
– By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel
– E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
» 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
– Need compiling technique that schedules across several branches
3/11/2009 cs252-S09, Lecture 14 3
Problems with 1st Generation VLIW• Increase in code size
– generating enough operations in a straight-line code fragment requires ambitiously unrolling loops
– whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding
• Operated in lock-step; no hazard detection HW– a stall in any functional unit pipeline caused entire processor to
stall, since all functional units must be kept synchronized
– Compiler might prediction function units, but caches hard to predict
• Binary code compatibility– Pure VLIW => different numbers of functional units and unit
latencies require different versions of the code
3/11/2009 cs252-S09, Lecture 14 4
Discussion of two papers for today• “Abstract DAISY: Dynamic Compilation for 100 %
Architectural Compatibility,” Erik R. Altman. Appeared in International Symposium on Computer Architecture (ISCA), 1997
• “The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges,” James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, Jim Mattson. Appeared in the Proceedings of the First Annual IEEE/ACM International Symposium on Code Generation and Optimization, March 2003
3/11/2009 cs252-S09, Lecture 14 5
Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”
• IA-64: instruction set architecture– 128 64-bit integer regs + 128 82-bit floating point regs
» Not separate register files per functional unit as in old VLIW– Hardware checks dependencies
(interlocks binary compatibility over time)
• 3 Instructions in 128 bit “bundles”; field determines if instructions dependent or independent
– Smaller code size than old VLIW, larger than x86/RISC– Groups can be linked to show independence > 3 instr
• Predicated execution (select 1 out of 64 1-bit flags) 40% fewer mispredictions?
• Speculation Support: – deferred exception handling with “poison bits”– Speculative movement of loads above stores + check to see if incorect
• Itanium™ was first implementation (2001)– Highly parallel and deeply pipelined hardware at 800Mhz– 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process
• Itanium 2™ is name of 2nd implementation (2005)– 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process– Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3
3/11/2009 cs252-S09, Lecture 14 6
Branch Hints
Memory Hints
InstructionCache
& BranchPredictors
FetchFetch Memory Memory SubsystemSubsystem
Three levels of cache:L1, L2, L3
Register Stack & Rotation
Explicit Parallelism
128 GR &128 FR,RegisterRemap
&Stack Engine
Register Register HandlingHandling
Fast, S
imp
le 6-Issue
IssueIssue ControlControl
Micro-architecture Features in hardwareMicro-architecture Features in hardware: :
Itanium™ EPIC Design Maximizes SW-HW Synergy
(Copyright: Intel at Hotchips ’00)Architecture Features programmed by compiler::
PredicationData & ControlSpeculation
Byp
asse
s & D
ep
end
encie
s
Parallel ResourcesParallel Resources
4 Integer + 4 MMX Units
2 FMACs (4 for SSE)
2 LD/ST units
32 entry ALAT
Speculation Deferral Management
3/11/2009 cs252-S09, Lecture 14 7
10 Stage In-Order Core Pipeline(Copyright: Intel at Hotchips ’00)
Front EndFront End• Pre-fetch/Fetch of up Pre-fetch/Fetch of up to 6 instructions/cycleto 6 instructions/cycle
• Hierarchy of branch Hierarchy of branch predictorspredictors
• Decoupling bufferDecoupling buffer
Instruction DeliveryInstruction Delivery• Dispersal of up to 6 Dispersal of up to 6 instructions on 9 portsinstructions on 9 ports
• Reg. remappingReg. remapping• Reg. stack engineReg. stack engine
Operand DeliveryOperand Delivery• Reg read + Bypasses Reg read + Bypasses • Register scoreboardRegister scoreboard• Predicated Predicated
dependencies dependencies
ExecutionExecution• 4 single cycle ALUs, 2 ld/str4 single cycle ALUs, 2 ld/str• Advanced load control Advanced load control • Predicate delivery & branchPredicate delivery & branch• Nat/Exception/Nat/Exception///RetirementRetirement
IPG FET ROT EXP REN REG EXE DET WRBWLD
REGISTER READWORD-LINE DECODERENAMEEXPAND
INST POINTER GENERATION
FETCH ROTATE EXCEPTIONDETECT
EXECUTE WRITE-BACK
3/11/2009 cs252-S09, Lecture 14 8
Why More on Memory Hierarchy?
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Pe
rfo
rma
nc
e
Memory
Processor Processor-MemoryPerformance GapGrowing
3/11/2009 cs252-S09, Lecture 14 9
A Typical Memory Hierarchy c.2007
L1 Data Cache
L1 Instruction
CacheUnified L2
Cache
RF Memory
Memory
Memory
Memory
Multiported register file
(part of CPU)
Split instruction & data primary caches (on-chip SRAM)
Multiple interleaved memory banks(off-chip DRAM)
Large unified secondary cache (on-chip SRAM)
CPU
3/11/2009 cs252-S09, Lecture 14 10
Itanium-2 On-Chip Caches(Intel/HP, 2002)
Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency
Level 2, 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency
Level 3, 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency
3/11/2009 cs252-S09, Lecture 14 11
Review: Cache performance
CycleTimeyMissPenaltMissRateInst
MemAccessCPIICCPUtime Execution
• Miss-oriented Approach to Memory Access:
• Separating out Memory component entirely– AMAT = Average Memory Access Time
CycleTimeAMATInst
MemAccessCPIICCPUtime AluOps
yMissPenaltMissRateHitTimeAMAT DataDataData
InstInstInst
yMissPenaltMissRateHitTime
yMissPenaltMissRateHitTime
3/11/2009 cs252-S09, Lecture 14 12
What is Cache Impact on Performance?• Suppose a processor executes at
– Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
– 50% arith/logic, 30% ld/st, 20% control
• Miss Behavior:– 10% of memory operations get 50 cycle miss penalty– 1% of instructions get same miss penalty
• CPI = ideal CPI + average stalls per instruction1.1(cycles/ins) +
[ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] +
[ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)]
= (1.1 + 1.5 + .5) cycle/ins = 3.1 • 58% of the time the proc is stalled waiting for memory!• AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54
3/11/2009 cs252-S09, Lecture 14 13
What is impact of Harvard Architecture?• Unified vs Separate I&D (Harvard)
• Statistics (given in H&P):– 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47%– 32KB unified: Aggregate miss rate=1.99%
• Which is better (ignore L2 cache)?– Assume 33% data ops 75% accesses from instructions (1.0/1.33)– hit time=1, miss time=50– Note that data hit has 1 stall for unified cache (only one port)
AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05
AMATUnified=75%x(1+1.99%x50)+25%x(1+1+1.99%x50)= 2.24
ProcI-Cache-1
Proc
UnifiedCache-1
UnifiedCache-2
D-Cache-1
Proc
UnifiedCache-2
3/11/2009 cs252-S09, Lecture 14 14
Recall: Reducing Misses• Classifying Misses: 3 Cs
– Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses.(Misses in even an Infinite Cache)
– Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved.(Misses in Fully Associative Size X Cache)
– Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses.(Misses in N-way Associative, Size X Cache)
• More recent, 4th “C”:– Coherence - Misses caused by cache coherence.
3/11/2009 cs252-S09, Lecture 14 15
Review: 6 Basic Cache Optimizations• Reducing hit time1. Avoiding Address Translation during Cache
Indexing• E.g., Overlap TLB and cache access, Virtual Addressed Caches
• Reducing Miss Penalty2. Giving Reads Priority over Writes
• E.g., Read complete before earlier writes in write buffer
3. Multilevel Caches
• Reducing Miss Rate4. Larger Block size (Compulsory misses)5. Larger Cache size (Capacity misses)6. Higher Associativity (Conflict misses)
3/11/2009 cs252-S09, Lecture 14 16
1. Two options for avoiding translation:
CPU
TB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym Problem
VATags
CPU
$ TB
MEM
VA
PATags
PA
Overlap $ accesswith VA translation:requires $ index to
remain invariantacross translation
L2 $
Option A Option B
3/11/2009 cs252-S09, Lecture 14 17
Virtually Addressed Caches (Details)
• Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache
– Every time process is switched logically must flush the cache; otherwise get false hits
» Cost is time to flush + “compulsory” misses from empty cache
– Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address
– I/O must interact with cache, so need virtual address
• Solution to aliases– HW guaranteess covers index field & direct mapped, they must be unique;
called page coloring
• Solution to cache flush– Add process identifier tag that identifies process as well as address within
process: can’t get a hit if wrong process
3/11/2009 cs252-S09, Lecture 14 18
• Write Buffer is needed between the Cache and Memory– Processor: writes data into the cache and the write buffer
– Memory controller: write contents of the buffer to memory
• Write buffer is just a FIFO:– Typical number of entries: 4
– Works fine if:Store frequency (w.r.t. time) << 1 / DRAM write cycle
– Must handle burst behavior as well!
ProcessorCache
Write Buffer
DRAM
2. Read Priority over Write on Miss
3/11/2009 cs252-S09, Lecture 14 19
• Write-Buffer Issues: Could introduce RAW Hazard with memory!– Write buffer may contain only copy of valid data
Reads to memory may get wrong result if we ignore write buffer
• Solutions:– Simply wait for write buffer to empty before servicing reads:
» Might increase read miss penalty (old MIPS 1000 by 50% )– Check write buffer contents before read (“fully associative”);
» If no conflicts, let the memory access continue» Else grab data from buffer
• Can Write Buffer help with Write Back?– Read miss replacing dirty block
» Copy dirty block to write buffer while starting read to memory
RAW Hazards from Write Buffer!
RAS/CAS
WriteDATA
RAS/CAS
ReadDATA
3 8 3 8
Processor + DRAM
RAS/CAS
ReadDATA
RAS/CAS
WriteDATA
8 3 83
WriteDATA
ReadDATA
8 8
DRAM
Proc
3/11/2009 cs252-S09, Lecture 14 20
12 Advanced Cache Optimizations• Reducing hit time
1.Small and simple caches
2.Way prediction
3.Trace caches
• Increasing cache bandwidth
4.Pipelined caches
5.Multibanked caches
6.Nonblocking caches
• Reducing Miss Penalty
7. Critical word first
8. Merging write buffers
• Reducing Miss Rate
9. Victim Cache
10. Hardware prefetching
11. Compiler prefetching
12. Compiler Optimizations
3/11/2009 cs252-S09, Lecture 14 21
1. Fast Hit times via Small and Simple Caches• Index tag memory and then compare takes time Small cache can help hit time since smaller memory
takes less time to index– E.g., L1 caches same size for 3 generations of AMD microprocessors:
K6, Athlon, and Opteron
– Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip
• Simple direct mapping– Can overlap tag check with data transmission since no choice
• Access time estimate for 90 nm using CACTI model 4.0– Median ratios of access time relative to the direct-mapped caches are
1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches
-
0.50
1.00
1.50
2.00
2.50
16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB
Cache size
Ac
ce
ss
tim
e (
ns
)
1-way 2-way 4-way 8-way
3/11/2009 cs252-S09, Lecture 14 22
Recall: Set Associative Cache• N-way set associative: N entries for each Cache Index
– N direct mapped caches operates in parallel
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared to the input in parallel– Data is selected based on the tag result
• Disadvantage: Time to set mux
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
3/11/2009 cs252-S09, Lecture 14 23
2. Fast Hit times via Way Prediction• How to combine fast hit time of Direct Mapped and have the lower
conflict misses of 2-way SA cache?
• Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access.
– Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data
– Miss 1st check other blocks for matches in next clock cycle
• Accuracy 85%
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Used for instruction caches vs. data caches
– Also used on MIPS R10K for off-chip L2 unified cache, way-prediction table on-chip
Hit Time
Way-Miss Hit Time Miss Penalty
3/11/2009 cs252-S09, Lecture 14 24
Way Predicting Caches(MIPS R10000 L2 cache)
• Use processor address to index into way prediction table
• Look in predicted way at given index, then:
HIT MISS
Return copyof data fromcache
Look in other way
Read block of data from next level of cache
MISSSLOW HIT(change entry in prediction table)
3/11/2009 cs252-S09, Lecture 14 25
Way Predicting Instruction Cache (Alpha 21264-like)
PC
addr instPrimaryInstructionCache
0x4Add
Sequential Way
Branch Target Way
way
Jump target
Jump control
3/11/2009 cs252-S09, Lecture 14 26
Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line
BR BR BR
• Single fetch brings in multiple basic blocks
• Trace cache indexed by start address and next n branch predictions
BRBRBR
3. Fast (Instruction Cache) Hit times via Trace Cache
3/11/2009 cs252-S09, Lecture 14 27
3. Fast Hit times via Trace Cache (Pentium 4 only; and last time?)
• Find more instruction level parallelism?How avoid translation from x86 to microops?
• Trace cache in Pentium 41. Dynamic traces of the executed instructions vs. static sequences of instructions
as determined by layout in memory– Built-in branch predictor
2. Cache the micro-ops vs. x86 instructions– Decode/translate from x86 to micro-ops on trace cache miss
+ 1. better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block)
- 1. complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size
- 1. instructions may appear multiple times in multiple dynamic traces due to different branch outcomes
3/11/2009 cs252-S09, Lecture 14 28
4: Increasing Cache Bandwidth by Pipelining
• Pipeline cache access to maintain bandwidth, but higher latency
• Instruction cache access pipeline stages:
1: Pentium
2: Pentium Pro through Pentium III
4: Pentium 4 greater penalty on mispredicted branches more clock cycles between the issue of the load
and the use of the data
3/11/2009 cs252-S09, Lecture 14 29
5. Increasing Cache Bandwidth: Non-Blocking Caches
• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss
– requires F/E bits on registers or out-of-order execution
– requires multi-bank memories
• “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests
• “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses
– Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses
– Requires muliple memory banks (otherwise cannot support)
– Penium Pro allows 4 outstanding memory misses
3/11/2009 cs252-S09, Lecture 14 30
Value of Hit Under Miss for SPEC (old data)
• FP programs on average: Miss Penalty = 0.68 -> 0.52 -> 0.34 -> 0.26
• Int programs on average: Miss Penalty = 0.24 -> 0.20 -> 0.19 -> 0.19
• 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
eqnto
tt
esp
ress
o
xlisp
com
pre
ss
mdljsp
2
ear
fpppp
tom
catv
swm
256
doduc
su2co
r
wave5
mdljdp2
hydro
2d
alv
inn
nasa
7
spic
e2g6
ora
0->1
1->2
2->64
Base
Integer Floating Point
“Hit under n Misses”
0->11->22->64Base
3/11/2009 cs252-S09, Lecture 14 31
6: Increasing Cache Bandwidth via Multiple Banks
• Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses
– E.g.,T1 (“Niagara”) L2 has 4 banks
• Banking works best when accesses naturally spread themselves across banks mapping of addresses to banks affects behavior of memory system
• Simple mapping that works well is “sequential interleaving”
– Spread block addresses sequentially across banks
– E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; …
3/11/2009 cs252-S09, Lecture 14 32
7. Reduce Miss Penalty: Early Restart and Critical Word First
• Don’t wait for full block before restarting CPU
• Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
– Spatial locality tend to want next sequential word, so not clear size of benefit of just early restart
• Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block
– Long blocks more popular today Critical Word 1st Widely used
block
3/11/2009 cs252-S09, Lecture 14 33
8. Merging Write Buffer to Reduce Miss Penalty• Write buffer to allow processor to continue
while waiting to write to memory
• If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry
• If so, new data are combined with that entry
• Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory
• The Sun T1 (Niagara) processor, among many others, uses write merging
3/11/2009 cs252-S09, Lecture 14 34
9. Reducing Misses: a “Victim Cache”
• How to combine fast hit time of direct mapped yet still avoid conflict misses?
• Add buffer to place data discarded from cache
• Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache
• Used in Alpha, HP machines To Next Lower Level InHierarchy
DATATAGS
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
One Cache line of DataTag and Comparator
3/11/2009 cs252-S09, Lecture 14 35
10. Reducing Misses by Hardware Prefetching of Instructions & Data
• Prefetching relies on having extra memory bandwidth that can be used without penalty
• Instruction Prefetching– Typically, CPU fetches 2 blocks on a miss: the requested block and the next
consecutive block.
– Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer
• Data Prefetching– Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8
different 4 KB pages
– Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes
1.16
1.45
1.18 1.20 1.21 1.26 1.29 1.32 1.401.49
1.97
1.001.201.401.601.802.002.20
Perf
orm
ance I
mpro
vem
ent
SPECint2000 SPECfp2000
3/11/2009 cs252-S09, Lecture 14 36
Issues in Prefetching
• Usefulness – should produce hits
• Timeliness – not late and not too early
• Cache and bandwidth pollution
L1 Data
L1 Instruction
Unified L2 Cache
RF
CPU
Prefetched data
3/11/2009 cs252-S09, Lecture 14 37
Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064– Fetch two blocks on a miss; the requested block (i) and the next
consecutive block (i+1)
– Requested block placed in cache, and next block in instruction stream buffer
– If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2)
L1 Instruction
Unified L2 Cache
RF
CPU
StreamBuffer
Prefetchedinstruction blockReq
block
Req block
3/11/2009 cs252-S09, Lecture 14 38
Hardware Data Prefetching• Prefetch-on-miss:
–Prefetch b + 1 upon miss on b
• One Block Lookahead (OBL) scheme – Initiate prefetch for block b + 1 when block b is
accessed–Why is this different from doubling block size?–Can extend to N block lookahead
• Strided prefetch– If observe sequence of accesses to block b, b+N, b+2N,
then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access
3/11/2009 cs252-S09, Lecture 14 39
11. Reducing Misses by Software Prefetching Data
• Data Prefetch– Load data into register (HP PA-RISC loads)
– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)
– Special prefetching instructions cannot cause faults;a form of speculative execution
• Issuing Prefetch Instructions takes time– Is cost of prefetch issues < savings in reduced misses?
– Higher superscalar reduces difficulty of issue bandwidth
3/11/2009 cs252-S09, Lecture 14 40
12. Reducing Misses by Compiler Optimizations
• McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
• Instructions– Reorder procedures in memory so as to reduce conflict misses
– Profiling to look at conflicts(using tools they developed)
• Data– Merging Arrays: improve spatial locality by single array of compound elements
vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order stored in memory
– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
3/11/2009 cs252-S09, Lecture 14 41
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
3/11/2009 cs252-S09, Lecture 14 42
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
3/11/2009 cs252-S09, Lecture 14 43
Loop Fusion Example/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; improve spatial locality
3/11/2009 cs252-S09, Lecture 14 44
Blocking Example/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
• Two Inner Loops:– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly
– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits
3/11/2009 cs252-S09, Lecture 14 45
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
• B called Blocking Factor
• Capacity Misses from 2N3 + N2 to 2N3/B +N2
• Conflict Misses Too?
3/11/2009 cs252-S09, Lecture 14 46
Reducing Conflict Misses by Blocking
• Conflict misses in caches not FA vs. Blocking size– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs.
48 despite both fit in cache
Blocking Factor
Mis
s R
ate
0
0.05
0.1
0 50 100 150
Fully Associative Cache
Direct Mapped Cache
3/11/2009 cs252-S09, Lecture 14 47
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Summary of Compiler Optimizations to Reduce Cache Misses (by hand)
3/11/2009 cs252-S09, Lecture 14 48
Impact of Hierarchy on Algorithms• Today CPU time is a function of (ops, cache misses)
• What does this mean to Compilers, Data structures, Algorithms?
– Quicksort: fastest comparison based sorting algorithm when keys fit in
memory
– Radix sort: also called “linear time” sortFor keys of fixed length and fixed radix a constant number of
passes over the data is sufficient independent of the number of keys
• “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370-379.
– For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to 4000000
3/11/2009 cs252-S09, Lecture 14 49
Quicksort vs. Radix: Instructions
Job size in keys
3/11/2009 cs252-S09, Lecture 14 50
Quicksort vs. Radix Inst & Time
Time
Job size in keys
Insts
3/11/2009 cs252-S09, Lecture 14 51
Quicksort vs. Radix: Cache misses
Job size in keys
3/11/2009 cs252-S09, Lecture 14 52
Experimental Study (Membench)
• Microbenchmark for memory system performance
• for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average)
for i from 0 to L by s load A[i] from memory (4 Bytes)
s
1 experiment
3/11/2009 cs252-S09, Lecture 14 53
Membench: What to Expect
• Consider the average cost per load– Plot one line for each array length, time vs. stride
– Small stride is best: if cache line holds 4 words, at most ¼ miss
– If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs)
– Picture assumes only one level of cache
– Values have gotten more difficult to measure on modern procs
s = stride
average cost per access
total size < L1cache hit time
memory time
size > L1
3/11/2009 cs252-S09, Lecture 14 54
Memory Hierarchy on a Sun Ultra-2i
L1: 16 KB2 cycles (6ns)
Sun Ultra-2i, 333 MHz
L2: 64 byte line
See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details
L2: 2 MB, 12 cycles (36 ns)
Mem: 396 ns
(132 cycles)
8 K pages, 32 TLB entries
L1: 16 B line
Array length
3/11/2009 cs252-S09, Lecture 14 55
Memory Hierarchy on a Power3Power3, 375 MHz
L2: 8 MB128 B line9 cycles
L1: 32 KB128B line.5-2 cycles
Array size
Mem: 396 ns(132 cycles)
3/11/2009 cs252-S09, Lecture 14 56
Compiler Optimization vs. Memory Hierarchy Search
• Compiler tries to figure out memory hierarchy optimizations
• New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer
• “Auto-tuner” targeted to numerical method– E.g., PHiPAC (BLAS), Atlas (BLAS),
Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W
3/11/2009 cs252-S09, Lecture 14 57
Reference
Best: 4x2
Mflop/s
Mflop/s
Sparse Matrix – Search for Blockingfor finite element problem [Im, Yelick, Vuduc, 2005]
3/11/2009 cs252-S09, Lecture 14 58
Best Sparse Blocking for 8 Computers
• All possible column block sizes selected for 8 computers; How could compiler know?
Intel Pentium M
Sun Ultra 2, Sun Ultra 3,
AMD Opteron
IBM Power 4, Intel/HP Itanium
Intel/HP Itanium 2
IBM Power 3
8
4
2
1
1 2 4 8
row
blo
ck s
ize
(r)
column block size (c)
3/11/2009 cs252-S09, Lecture 14 59
Technique Hit Time
Bandwidth
Miss penalty
Miss rate
HW cost/ complexity Comment
Small and simple caches + – 0 Trivial; widely used
Way-predicting caches + 1 Used in Pentium 4
Trace caches + 3 Used in Pentium 4
Pipelined cache access – + 1 Widely used
Nonblocking caches + + 3 Widely used
Banked caches + 1Used in L2 of Opteron and Niagara
Critical word first and early restart + 2 Widely used
Merging write buffer + 1 Widely used with write through
Victim Caches – + 1 Fairly Simple and common
Compiler techniques to reduce cache misses + 0
Software is a challenge; some computers have compiler option
Hardware prefetching of instructions and data
+ +2 instr., 3 data
Many prefetch instructions; AMD Opteron prefetches data
Compiler-controlled prefetching + + 3
Needs nonblocking cache; in many CPUs
3/11/2009 cs252-S09, Lecture 14 60
Conclusion• Memory wall inspires optimizations since so much
performance lost there– Reducing hit time: Small and simple caches, Way prediction, Trace
caches– Increasing cache bandwidth: Pipelined caches, Multibanked caches,
Nonblocking caches– Reducing Miss Penalty: Critical word first, Merging write buffers– Reducing Miss Rate: Compiler optimizations– Reducing miss penalty or miss rate via parallelism: Hardware prefetching,
Compiler prefetching
• Actual performance of a simple program can be a complicated function of the architecture
– To write fast programs, need to consider architecture» True on sequential or parallel processor
– We would like simple models to help us design efficient algorithms
• “Auto-tuners” search replacing static compilation to explore optimization space?