순천향대학교 정보기술공학부 이 상 정 1 7. Large and Fast: Exploiting Memory Hierarchy.

순천향대학교 정보기술공학부 이 상 정 1

7. Large and Fast:Exploiting Memory Hierarchy


Computer Architecture

7. Large and Fast: Exploiting Memory Hierarchy

The Five Classic Components of a Computer

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Input

Output




Technology Trends

Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 yearsDRAM: 4x in 3 years 2x in 10 yearsDisk: 4x in 3 years 2x in 10 years

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

1000:1! 2:1!




µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1

10

100

1000

198

0198

1 198

3198

4198

5 198

6198

7198

8198

9199

0199

1 199

2199

3199

4199

5199

6199

7199

8 199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

Processor-DRAM Memory Gap (latency)

Who Cares About the Memory Hierarchy?




The Goal: illusion of large, fast, cheap memory

Fact: Large memories are slowFast memories are small

How do we create a memory that is large, cheap and fast (most of the time)?• Hierarchy




Users want large and fast memories! As of 2004,

SRAM access times are .5 – 5ns at cost of $4000 to $10,000 per GB.DRAM access times are 50-70ns at cost of $100 to $200 per GB.Disk access times are 5 to 20 million ns at cost of $.50 to $2 per GB.

Try and give it to them anyway• build a memory hierarchy

Exploiting Memory Hierarchy

CPU

Level 1

Level 2

Level n

Increasing distance

from the CPU in

access timeLevels in the

memory hierarchy

Size of the memory at each level




Memory Hierarchy of a Modern Computer System

By taking advantage of the principle of locality:• Present the user with as much memory as is available in the

cheapest technology.• Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On-C

hip

Cach

e

1 10,000,000

(10 ms)

Speed (ns): 10 100

100 GSize (bytes): K M

TertiaryStorage(Tape)

10,000,000,000 (10 sec)

T




Memory Hierarchy: Why Does it Work? Locality!

• Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper levels

• Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the

processorLower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlk X

Blk Y

Address Space0 2n - 1

Probabilityof reference




Memory Hierarchy: Terminology

Hit: data appears in some block in the upper level (example: Block X) • Hit Rate: the fraction of memory access found in the upper level• Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block Y)• Miss Rate = 1 - (Hit Rate)• Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

Hit Time << Miss PenaltyLower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlk X

Blk Y




Memory Hierarchy Technology

Random Access:• “Random” is good: access time is the same for all locations• Volatile Memory

• DRAM: Dynamic Random Access Memory– High density, low power, cheap, slow– Dynamic: need to be “refreshed” regularly

• SRAM: Static Random Access Memory– Low density, high power, expensive, fast– Static: content will last “forever”(until lose power)

• Non-Volatile Memory• ROM(Mask ROM, PROM, EPROM, E2PROM)• Flash Memory, FRAM, MRAM

“Not-so-random” Access Technology:• Access time varies from location to location and from time to time• Examples: Disk, CD-ROM

Sequential Access Technology: • access time linear in location (e.g.,Tape)




Main Memory Background

Main Memory is DRAM: Dynamic Random Access Memory• 1 transistor and 1 capacitor (~ 2 transistors) / bit• Dynamic since needs to be refreshed periodically (8

ms)• Addresses divided into 2 halves (Memory as a 2D

matrix):• Row address and then column address• Number of address pins cut into half • Called “address multiplexing”

Cache uses SRAM: Static Random Access Memory• No refresh• 6 transistors/bit• No address multiplexing• SRAM is faster, and more expensive than DRAM

• Size: SRAM/DRAM 4-8• Cost: SRAM/DRAM 20-25 (1997)• Access Time: DRAM/SRAM - 5-12




Cache

Motivation• Slow speed of DRAM main memory limits processor

performance• a smaller SRAM memory matches processor speed

Make the average access time near SRAM• if the large majority of memory references hit the

cache

Reduce bandwidth required of the large memory

ProcessorProcessor CacheCache DRAMDRAM

Memory System




Cache Organization

Cache duplicates part of main memory• we specify an address in main memory to search whether a

copy of that memory location resides in the cache• need a mapping between main memory location and cache

location

Direct-mapped Cache• Each memory address maps to a

UNIQUE cache location determined by a simple modulo function

• Simplest implementation because only one cache location to search




Memory Reference Sequence in Direct-Mapped Cache




Directed-Mapped Cache Lookup

For a cache with block size 4 bytes and total capacity 4KB (1024 blocks)• the 2 lowest address bits

specify the byte within a block• the next 10 address bits

specify the block’s index within the cache

• the 20 highest address bits are the unique tag for this memory block

• the valid bit specifies whether the block is an accurate copy of memory

Address (showing bit positions)

20 10

Byteoffset

Valid Tag DataIndex

0

1

2

1021

1022

1023

Tag

Index

Hit Data

20 32

31 30 13 12 11 2 1 0




Index Valid Tag Data

0000 0000 00 x xxx xxx

0000 0000 01 x xxx xxx

0000 0000 10 x xxx xxx

… x xxx xxx

1010 1111 10 1 1011 0110 0101 1001 0101 1111 0000 1100 1100 1010 1010 0101 0101

1010 1111 11 x xxx xxx

. . . x xxx xxx

1111 1111 11 x xxx xxx

Cache Entry Example

1111000011110000110011001100110010101010101010100101010101010101

1 byte = 8 bits

1011 0110 0101 1001 0101 1010 1111 01 001011 0110 0101 1001 0101 1010 1111 01 011011 0110 0101 1001 0101 1010 1111 01 101011 0110 0101 1001 0101 1010 1111 01 111011 0110 0101 1001 0101 1010 1111 10 001011 0110 0101 1001 0101 1010 1111 10 011011 0110 0101 1001 0101 1010 1111 10 101011 0110 0101 1001 0101 1010 1111 10 111011 0110 0101 1001 0101 1010 1111 11 00. . .

Memory Address

Tag (20 bits) Index(10bits)

1 word = 4 bytes = 32 bits

In Cache:




Bits in a Cache

Total Bits Required for a Directed-mapped cache with a 4KB of data and 1-word blocks, assuming a 32-bit address• Block size = 1-words = 4 bytes• # of blocks = 4KB / 4 bytes = 1K blocks• Each block has 4 bytes of data + a tag + a valid bit• Tag size = 32 bits (data address) – 10 bits (block address) – 2 bits (byte in a block) = 20 bits• Total bits in a cache = 1K x (4 bytes + 20 bits + 1 bit) = 53Kbits ( = 6.625KB)




Cache Blocks

Cache Block (sometimes called a cache line)• a cache entry that has in its own cache tag• previous example uses 4-byte blocks

Lager cache blocks take advantage of spatial locality• Example of 64KB cache using 4-word (16-byte) block

Address (showing bit positions)

16 12 Byteoffset

V Tag Data

Hit Data

16 32

4Kentries

16 bits 128 bits

Mux

32 32 32

2

32

Block offsetIndex

Tag

31 16 15 4 32 1 0




Block Size Tradeoff In general, larger block size take advantage of spatial locality BUT:

• Larger block size means larger miss penalty:• Takes longer time to fill up the block

• If block size is too big relative to cache size, miss rate will go up• Too few cache blocks

In gerneral, Average Access Time: • = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate

Block Size Block Size

MissPenalty

MissRate Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

AverageAccess

Time

Increased Miss Penalty& Miss Rate

Block Size




Block Size Tradeoff (cont.)

Data from simulating a direct-mapped cache Note miss rate trends

• capacity increases for fixed block size• block size increases for fixed capacity

1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Mis

s ra

te

64164

Block size (bytes)




Read hits• this is what we want!

Read misses• stall the CPU, fetch block from memory, deliver to cache, restart

Write hits:• can replace data in cache and memory (write-through)• write the data only into the cache (write-back the cache later)

Write misses:• read the entire block into the cache, then write the word

Hits vs. Misses




Make reading multiple words easier by using banks of memory

It can get a lot more complicated...

Hardware Issues

CPU

Cache

Memory

Bus

One-word-widememory organization

a.

b. Wide memory organization

CPU

Cache

Memory

Bus

Multiplexor

CPU

Cache

Bus

Memory

bank 0

Memory

bank 1

Memory

bank 2

Memory

bank 3

c. Interleaved memory organization




Synchronous DRAM (SDRAM) Timing




Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

Acc

ess

Ban

k 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2

Cycle TimeAccess Time




Use split caches because there is more spatial locality in code:• Two independent caches operating in parallel• Instruction cache and data cache• Used to increase cache bandwidth

• i.e. the data rate between cache and processor• Miss rate slightly higher than that of combined cache

• e.g. Total cache size: 32KB– Split cache effect miss rate: 3.24%– Combined cache miss rate: 3.18%

• Increased cache bandwidth easily overcomes the disadvantage of slightly increased miss rate

• Free from cache contention in instruction pipelining

Split Cache




More about Cache Write

Cache read much easier to handle than cache write• Read does not change value of data

Cache write• Need to keep data in the cache and memory consistent

Two Options• Write-Through: write to both cache and memory

• control is simple• Isn’t memory too slow for this?

• Write-Back: write to cache only• write the cache block to memory when that cache block is

being replaced on a cache miss• reduces the memory bandwidth required• keep a bit ( called the dirty bit ) per cache block to track

whether the block has been modified– only need to write back modified blocks

• control can be complex




Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory• Processor: writes data into the cache and the write buffer• Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO (First-In First-Out) queue:• Typical number of entries: 4 - 8• Works fine if: “Store” frequency << 1 / (DRAM write cycle)• In Write buffer saturation, stall processor to allow memory to

catch up

ProcessorCache

Write Buffer

DRAM




Cache Performance

We can safely assume cache access time (hit time) is single clock cycle:

CPU Time with perfect cache = CPU cycles x Clock cycle time CPU Time with real world cache = (CPU Cycles + Memory Stall Cycles) x

Clock cycle time Memory system affects

• Memory stall cycles• cache miss stalls + write buffer stalls (in case of write-back cache)

• Clock cycle time• since cache access often determines clock speed for a processor

Memory stall cycles = Read stall cycles + Write stall cycles• Read stall cycles = Read miss rate x #Reads x Read miss penalty• For write-back cache

• Write stall cycles = Write miss rate x #Writes x Write miss penalty• Can combine read and write components

– memory stall cycles = Miss Rate x #MemAccesses x Miss Penalty• For write-through caches

• add write buffer stalls




Cache Performance Example

Assume:• Miss rate for instruction = 5%• Miss rate for data = 8%• Data references per instruction = 0.4• CPI with perfect cache = 1.4• Miss penalty = 20 cycles

Find performance relative to perfect cache with no misses (same clock rate)• Misses/instruction = 0.05(instruction miss) + 0.4x0.08(data miss)

= 0.082• Miss stall CPI = 0.082 x 20 = 1.64• Performance is ratio of CPIs (#instruction, clock rate is the same)

Performance no misses CPI with misses 1.4 + 1.64 ------------------------------------ = --------------------------- = --------------- = 2.1 Performance with misses CPI no misses 1.4




Set-Associative Caches

Improve cache hit ratio by allowing a memory location to be placed in more than one cache block• N-way associative cache allows placement in any block of a set

with N elements• N is the set size• Number of blocks = N x number of sets• Set number is selected by a simple modulo function of the

address bits (the set number is also called the index)• Fully-associative cache

• when there is a single set allowing a memory location to be placed in any cache block

• Directed- mapped organization can be considered a degenerate set-associative cache with set-size=1

For fixed cache capacity, large set size leads to higher hit rates• because more combinations of cache blocks can be present in

the cache at the same time




Set-Associative Cache Examples

Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

Eight-way set associative (fully associative)

Tag Data Tag Data Tag Data Tag Data

Four-way set associative

Set

0

1

Tag Data

One-way set associative(direct mapped)

Block

0

7

1

2

3

4

5

6

Tag Data

Two-way set associative

Set

0

1

2

3

Tag Data




Implementation of 4-Way Set-Associative CacheAddress

22 8

V TagIndex

0

1

2

253

254255

Data V Tag Data V Tag Data V Tag Data

3222

4-to-1 multiplexor

Hit Data

123891011123031 0




Miss Rate vs. Set Size

Data is for gcc (gnu C-compiler) and spice for DECStation 3100 with separate instrution/data 64KB caches using 16B blocks

In general, the benefit increasing associativity beyond 2~4 has minimal impact on miss ratio

Associativity Instruction MissRate

Data Miss Rate

gcc 1 2.0 % 1.7 %

gcc 2 1.6 % 1.4 %

gcc 4 1.6 % 1.4 %

spice 1 0.3 % 0.6%spice 2 0.3% 0.6%

spice 4 0.3% 0.6%




Miss Rate vs. Set Size

0%

3%

6%

9%

12%

15%

Eight-wayFour-wayTwo-wayOne-way

1 KB

2 KB

4 KB

8 KB

Mis

s ra

te

Associativity 16 KB

32 KB

64 KB

128 KB

Data for SPEC92 on combined instruction/data cache with 32B block




Disadvantage of Set Associative Cache

N-way Set Associative Cache versus Direct Mapped Cache:• N comparators vs. 1• Extra MUX delay for the data• Data comes AFTER Hit/Miss decision and set selection

In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:• Possible to assume a hit and continue. Recover later if miss.

Example:• 2-way set associative cache

Cache DataCache Block 0

Cache Tag Valid

: ::

Cache DataCache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit




Cache Block Replacement Policies

Directed Mapped Cache• Each memory location mapped to a single cache location• No replacement policy is necessary

• new item replaces previous item in that cache location

Set-Associative Caches• N-way set associative cache

• each memory location has a choice of N cache location• Cache miss handling for set-associative caches

• bring in new block from memory• identify a block in the selected set to replace in case of full• need to decide which block to replace




Cache Block Replacement Policies (cont.)

Random Replacement• Hardware randomly selects a cache block to replace

Optimal Replacement• Replace the block that will be used latest in the future

Least Recently Used (LRU)• Hardware keeps track of access history

• replace entry that has not been used for the longest time• Simple for 2-way associative

• single bit in each set to indicate which block was more recently used• Implementing LRU gets harder for higher degrees of associativity

In practice replacement policy has minor impact on miss rate• Especially for high associativity




Decreasing miss penalty with multilevel caches

Add a second level cache:• often primary cache is on the same chip as the processor

• Primary cache, L1 cache, on-chip cache• use SRAMs to add another cache above primary memory (DRAM)

• L2 cache• miss penalty goes down if data is in 2nd level cache • On-die L2 cache

• Started to get integrated into the same die since late 1998, and now became a general trend

Example:• CPI of 1.0 on a 5 Ghz machine with a 2% miss rate, 100ns DRAM access• Adding 2nd level cache with 5ns access time decreases miss rate to .5%• Performance gain is 2.8

• Refer to the textbook (pp.505~506)

Using multilevel caches:• try to optimize the hit time on the 1st level cache• try to optimize the miss rate on the 2nd level cache




Cache Complexities

Not always easy to understand implications of caches:

Radix sort

Quicksort

Size (K items to sort)

04 8 16 32

200

400

600

800

1000

1200

64 128 256 512 1024 2048 4096

Radix sort

Quicksort


04 8 16 32

400

800

1200

1600

2000

64 128 256 512 1024 2048 4096

Theoretical behavior of Radix sort vs. Quicksort

Observed behavior of Radix sort vs. Quicksort




Cache Complexities

Here is why:

Memory system performance is often critical factor• multilevel caches, pipelined processors, make it harder to predict outcomes• Compiler optimizations to increase locality sometimes hurt ILP

Difficult to predict best algorithm: need experimental data

Radix sort

Quicksort


04 8 16 32

1

2

3

4

5

64 128 256 512 1024 2048 4096




Summary: Improving Cache Performance

Cache Performance is determined by• Average Memory Access Time = Hit Time + Miss Rate x Miss Penalty

Use Better Technology• Use faster RAMs• Cost and availability are limitations

Decrease Hit Time• Make cache smaller, but miss rate increases• Use direct mapped instead of set-associative, but miss rate increases

Decrease Miss Rate• Make cache large, but can increase hit time• Add associativity, but can increase hit time• Increase block size, but increase miss penalty

Decrease Miss Penalty• Reduce transfer time component of miss penalty• Add another level of cache (L2 cache)




Another View of the Memory Hierarchy

Regs

L2 Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

Upper Level

Lower Level

Faster

Larger

CacheBlocks

Thus far{{Next:

VirtualMemory




Memory Hierarchy Requirements

If Principle of Locality allows caches to offer (close to) speed of cache memory with size of DRAM memory,then recursively why not use at next level to give speed of DRAM memory, size of Disk memory?

Share memory between multiple processes but still provide protection – don’t let one program read/write memory from another

Address space – give each program the illusion that it has its own private memory• compiler, linker, and loader are simplified because they see only the

virtual address space abstracted from physical memory allocation




Virtual Memory

Called “Virtual Memory” Also allows OS to share memory, protect

programs from each other Today, more important for protection vs. just

another level of memory hierarchy Each process thinks it has all the memory to itself Historically, it predates caches




Virtual Memory

Addressable Memory Space vs. Physical Memory• example:

• 32bit memory address can specify 4GB memory• physical main memory = 16MB ~ 512MB

Distinguish between virtual and physical addresses• virtual address is used by the programmer to address memory within a

process’s address space• physical address is used by the hardware to access a physical memory

location “Virtual Memory” provides appearance of very large

memory• total memory of all jobs >> physical memory• address space of each job > physical memory

Simplifies memory management for multi-processing system • each program operates in its own virtual address space as if it is the only

program running in the system Uses 2 storage levels

• primary (DRAM) and secondary (Hard Disk) Exploits hierarchy to reduce average access time as in

cache




Virtual to Physical Address Translation

Each program operates in its own virtual address space• as if it is the only program running in the system

Each program is protected from the other

OS can decide where each program goes in memory

Hardware (HW) provides virtual physical mapping

virtualaddress(inst. fetchload, store)

Programoperates inits virtualaddressspace

HWmapping

physicaladdress(inst. fetchload, store)

Physicalmemory(incl. caches)




Paged Virtual Memory

Most common form of address translation• virtual and physical address space partitioned into blocks of equal

size• virtual address space blocks are called pages• physical address space blocks are called frames (or page frames)

Placement• any page can be placed in any frame (fully-associative)

Pages are fetched on demand

regreg cachecache memorymemory diskdisk

pagepage

frame

Memory Hierarchy




Paging Organization

Paging can map any virtual page to any physical frame Data missing from main memory must be transferred from secondary m

emory (disk)• misses(page fault) handled by Operating System• miss time very large, so OS manages the hierarchy and schedules another process inst

ead of stalling (context switching)Physical addresses

Disk addresses

Virtual addresses

Address translation

Virtual addresses Physical addresses




Paging/Virtual Memory Multiple Processes

User B: Virtual Memory

Code

Static

Heap

Stack

0Code

Static

Heap

Stack

A PageTable

B PageTable

User A: Virtual Memory

00

Physical Memory

64 MB




Address Translation

Program uses virtual addresses• Relocation: a program can be loaded anywhere in physical memory

without recompiling or re-linking

Memory accessed with physical addresses

Hardware (HW) provides virtual physical mapping• need a translation table for each process

When a virtual address is missing from main memory, the OS handles the miss• read the missing data, create the translation, return to re-execute the

instruction that caused the miss

VirtualVirtualAddressAddress

AddressAddressTranslationTranslation ValidValid

PhysicalPhysicalAddressAddress

MainMainMemoryMemory

SecondarySecondaryMemory (Disk)Memory (Disk)

Page faulthandler (in OS)

OS performs this transfer

page fault




Address Mapping

frame 0

1

7

0

1024

7168

P.A.

PhysicalMemory

1K

1K

1K

AddrTransMAP

page 0

1

31

1K

1K

1K

0

1024

31744

unit of mapping

also unit oftransfer fromvirtual tophysical memory

Virtual MemoryAddress Mapping

VA page no. disp

10

Page Table

indexintopagetable

Page TableBase Reg

VAccessRights Frame #

+

table locatedin physicalmemory

physicalmemoryaddress

actually, concatenation is more likely

V.A.

Frame # disp




Address Translation Algorithm

If V=1, the mapping is valid• CPU checks permissions (R,R/W,X) against access type

• if access is permitted, generate physical address and proceeds

• if access is not permitted, generates a protection fault

If V!= 1, the mapping is invalid• wanted page does not reside in main memory• CPU generates a page fault

Faults are exceptions handled by the OS• page faults

• the OS fetches the missing page, creates a map entry, and restarts the process

• another user process is switched in to execute while the page is brought from disk (context switching)

• protection faults• checks whether it is a programming error or permission

needs to be changed




Making VM Fast: TLB

If page table is kept in memory• all memory reference require two accesses

• one for page table entry and one to get the actual data Translation Lookaside Buffer (TLB)

• additional cache for page table only• hardware maintains a cache of recently-used page table translations• look up all accesses up in TLB

• hit in TLB gives the physical page number• miss in TLB => get translation from the page table and reload

• TLB usually smaller than cache (each entry maps a full page)• more associativity possible and common• similar speed to cache access• contains all bits needed to translate address, implement VM• Typical TLB entry

Valid Virtual Address Physical Address Dirty Access Rights




Virtual Memory and Cache

OS manages memory hierarchy between secondary storage and main memory• allocates physical memory to virtual memory and specifies the

mapping to hardware through page tables• hardware caches recently used page table entries in the TLB

AddressUnit

TLB Cache

Physical Memory/

Main Memory/Primary Storage

SecondaryStorage

Page Tables

CPU

TLBMisses

VirtualAddress

PhysicalAddress

CacheMisses

PageFaults

AddressTranslation




TLBs and caches

YesWrite access

bit on?

No

YesCache hit?

No

Write data into cache,update the dirty bit, and

put the data and theaddress into the write buffer

YesTLB hit?

Virtual address

TLB access

Try to read datafrom cache

No

YesWrite?

No

Cache miss stallwhile read block

Deliver datato the CPU

Write protectionexception

YesCache hit?

No

Try to write datato cache

Cache miss stallwhile read block

TLB missexception

Physical address




Page Replacement and Write Policies

When a page fault occurs, choose a page to replace• fully associative so any frame/page is a candidate• choose empty one if it exists• choose either (just as we did for cache)

• LRU• Random

Write Policy: always write-back• keep a dirty bit

• set to 1 if page is modified• when a modified page is replaced, the OS writes it back to

disk




Modern Systems




Modern Systems

Things are getting complicated!

순천향대학교 정보기술공학부 이 상 정 1 7. Large and Fast: Exploiting Memory Hierarchy.

Documents