PowerPoint

CME212 – Introduction to Large-Scale Computing in EngineeringHigh Performance Computing and Programming

CME212 Lecture 15

Caches and Memory

2

Memory

• Writable?– Read-Only (ROM)– Read-Write

• Accessing– Random Access (RAM)– Sequential Access (Tapes)

• Lifetime– Volatile (needs power)– Non-Volatile (can be powered off)

3

Conventional RAM

• Dynamic RAM (DRAM)– Works in refresh cycles– Few transistors means low cost

• Static RAM (SRAM)– More transistors than DRAM– More expensive– No refresh means much faster

4

Flash Memory

• Non-volatile memory– Charged electrons in fields, quantum tunneling

• Cheap NAND Flash has only sequential access

• Finite number of ”flashes”• Problems with writes

– Can only be written in blocks

• Used in cameras, MP3-players

5

The disk surface spins at a fixedrotational rate

spindle

By moving radially, the arm can position the read/write head over any track.

The read/write headis attached to the endof the arm and flies over the disk surface ona thin cushion of air.

spin

dle

spindle

spin

dle

spindle

Disk Operation (single-platter view)

6

Disk Operation (multi-platter view)

arm

read/write heads move in unison

from cylinder to cylinder

spindle

7

CPU-Memory Gap

CME212 – Introduction to Large Scale Computing in EngineeringHigh Performance Computing and Programming

Image from Sun Microsystems

8

The CPU-Memory Gap

• Cheap memory must be built out of few transistors• The most common main memory type is called DRAM

(dynamic RAM) which saves transistors by operating in refresh cycles

• The other type, SRAM (static RAM) uses another, more expensive design without refreshing

• The clock frequency of CPUs increases at a much higher rate than that of DRAM

• Conclusion: CPU must wait for data to pass through the memory system

9

Implications for Pipelines

• Waiting for data stalls the pipeline• Common DRAM latency is about 150 cycles• UNACCEPTABLE!• We will need a lot of registers to keep this

latency hidden• Solution: cache memories• A cache memory is a smaller SRAM

(faster) memory which act as a temporary storage to hide the DRAM latencies

10

Webster Definition of “cache”

1. cache \'kash\ n [F, fr. cacher to press, hide, fr. (assumed) VL coacticare to press] together, fr. L coactare to compel, fr. coactus, pp. of cogere to compel - more at COGENT

2. 1a: a hiding place esp. for concealing and preserving provisions or implements

3. 1b: a secure place of storage 2: something hidden or stored in a cache

11

Cache Memory


Cache

Memory (DRAM)

CPU

Small but fastClose to CPU

Large and slow (cheap)Far away from CPU

12

Basics of Caches

• Caches hold copies of the memory– Need to be synchronized with memory– This is handled transparently to the CPU

• Caches have a limited capacity– Cannot fit the entire memory at one time

• Caches work because of the principle of locality


13

General Principles of Computer Programs

• Principle of locality:– Programs tend to reuse data and instructions they have

used recently. A widely held rule of thumb is that a program spends 90% of its execution time in only 10% of its program.

• We can predict what instructions and data a program will use based on its history

• Temporal locality, recently accessed items are likely to be accessed in the near future

• Spatial locality, items whose addresses are near one another tend to be referenced close together in time.

14

Cache Knowledge Useful When...

• Designing a new computer• Writing an optimized program

– or compiler– or operating system …

• Implementing software caching– Web caches– Proxies– File systems

15

Cache Concepts

• Requests for data are sent to the memory subsystem– They either hit or miss in a cache– On a miss we need to get a copy from memory

• Caches have finite capacity– Data needs to be replaced– How do we find our victim?

• Caches need to be fast– How do we verify if data is in the cache or not?

16

Details of Caching

• Every piece of data is identified using an address

• We can store the address in a “phone book” to find a piece of data

• When the CPU sends out a request for data, we need a fast mechanism to find out if we have a hit or miss


17

Mapping Strategies

• In a direct mapped cache each piece of data has a given location

• In a fully associative cache any piece of data can go anywhere (parallel search)

• In a set associative cache any piece of data can go anywhere within a subset– Data is directly mapped to sets– Each set is associative (must be searched)


18

Set Associativity

• The address space is divided into sets modulo the associativity of the cache

• Exact mapping given some bits of address• Example:

– 4-way set associative, each set holds 256 bytes– Address space is 800 bytes (in hex), or 2048 bytes (decimal)– Bits 9 and 10 identify the set


set0 000-0FF 400-4FF

set1 100-1FF 500-5FF

set2 200-2FF 600-6FF

set3 300-3FF 700-7FF

Potential conflict(highest bits

specify a tag)

19

Address Book CacheLooking for Tommy’s Telephone Number

ÖÄÅZYXVUT

TOMMY 12345

ÖÄÅZYXV

“Address Tag”

One entry per page =>Direct-mapped caches with 28 entries

“Data”

Indexingfunction

20

Address Book CacheLooking for Tommy’s Number

ÖÄÅZYXVUT

OMMY 12345

TOMMY

EQ?

index

21

Address Book CacheLooking for Tomas’ Number

ÖÄÅZYXVUT

OMMY 12345

TOMAS

EQ?

index

Miss!Lookup Tomas’ number inthe telephone directory

22

Address Book CacheLooking for Tomas’ Number

ZYXVUT

OMMY 12345

TOMAS

index

Replace TOMMY’s datawith TOMAS’ data. There is no other choice(direct mapped)

OMAS 23457

ÖÄÅ

23

Cache Blocks

• To speed up the lookup process data is allocated in cache blocks consisting of several consecutively stored words

• When you access a word you will always allocate several neighboring words in the cache

• Works well due to the principle of locality


24i = 0 i = 4 i = 8

Cache Blocks and Miss Ratios• Consider a C array of 1024 doubles

– A pointer to a start address of a contiguous region in memory– Block size is 32 bytes which equals 4 array elements– Loop through the array with an index increment of one (stride-1)

double *arrayEvery 4th element a cache miss.

256 misses in totalMiss ratio of 25%

25

Consequences of Cache Blocks

• Works well because of principle of locality– Codes with high degree of spatial locality

reuse data within blocks

• We should aim for stride-1 access pattern

• Struct’s should be packed and aligned to cache blocks– Compiler can help– Fill out structs using dummy data

26

Who to Replace?Picking a “victim”

• Least-recently used (LRU)– Considered the “best” algorithm (which is not

always true…)– Only practical up to ~4-way associative

• Pseudo-LRU– Based on coarse time stamps.

• Random replacement

27

The Memory Hierarchy

• Extend the caching idea and create a hierarchy of caches

• Arranged into levels• L1 – level 1 cache• L2 – level 2 cache• Caches are often of

increasing size• Hide the latency of cheaper

memory


L1

L2

28

Memory/Storage

sram

dramdisk

sram

2000: 1ns 1ns 3ns 10ns 150ns 5 000 000ns 1kB 64k 4MB 1GB 1 TB

(1982: 200ns 200ns 200ns 10 000 000ns)

Registers & Caches Main Memory

Disk and Virtual Memory

29

An Example Memory Hierarchy

registers

on-chip L1cache (SRAM)

main memory(DRAM)

local secondary storage(local disks)

Larger, slower,

and cheaper (per byte)storagedevices

remote secondary storage(distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers.

Main memory holds disk blocks retrieved from local disks.

off-chip L2cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

CPU registers hold words retrieved from L1 cache.

L2 cache holds cache lines retrieved from main memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and

costlier(per byte)storage devices

30

Caching in a Memory Hierarchy

8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1

Level k:

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.

Data is copied betweenlevels in block-sized transfer units

Level k+1: 4

4

4 10

10

10

31

General Caching Concepts

• Program needs object d, which is stored in some block b.

• Cache hit– Program finds b in the cache at

level k. e.g., block 14.

• Cache miss– b is not at level k, so level k cache

must fetch it from level k+1. e.g., block 12.

– If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”?

• Placement policy: where can the new block go?

• Replacement policy: which block should be evicted? E.g., LRU

Request14

Request12

9 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Level k:

Level k+1:

1414

12

14

4*

4*12

12

0 1 2 3

Request12

4*4*12

32

Block Sizes in a Typical Memory Hierarchy

Capacity Block size

# of lines

# of 32-bit integers per block

Register 32-bits 4 bytes 1 1

L1 Cache

64kB 32 bytes

2048 8

L2 Cache

2MB 64 bytes

32768

16

33

Address Translation

• Translation is expensive since we need to keep track of many pages on a multi-tasking multi-user system– Need to search or index the page table that maintains this

information

• Introduce the Translation Lookaside Buffer (TLB) to remember the most recent translations– The TLB is a small on-chip cache– If we have an entry in the TLB the page is probably in

physical memory– Translation is much quicker (faster access time)

34

Page Sizes and TLB Reach

• Typical page sizes are 8kB or 4kB

• TLBs typically holds 256 or 512 entries

• The TLB reach is the amount of data we can fit in the TLB– Multiply page size by number of entries

35

General Caching Concepts

• Types of cache misses:– Cold (compulsory) miss

• Cold misses occur because the cache is empty.

– Capacity miss• Occurs when the set of active cache blocks (working

set) is larger than the cache.

– Conflict miss• Conflict misses occur when the level k cache is large

enough, but multiple data objects all map to the same level k block.

• E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

36

Caches in Hierarchies

• To syncronize data in hierachies caches can either be:

1. Write-through– Reflect change immediately– L1 is often write-through

2. Write-back– Syncronize all data at a given signal– Less traffic

37

Cache Performance Metrics• Miss Rate

– Fraction of memory references not found in cache (misses/references)– Typical numbers:

• 3-10% for L1• can be quite small (e.g., < 1%) for L2, depending on size, etc.

• Hit Time– Time to deliver a line in the cache to the processor (includes time to

determine whether the line is in the cache)– Typical numbers:

• 1 clock cycle for L1• 3-8 clock cycles for L2

• Miss Penalty– Additional time required because of a miss

• Typically 25-100 cycles for main memory

38

Caches and Performance

• Caches are extremely important for performance– Level 1 latency is usually 1 or 2 cycles

• Caches only work well for programs with nice locality properties

• Caching can be used in other areas as well, example: web-caching (proxies)

• Modern CPUs have two or three levels of caches.– Largest caches are tens of megabytes

• Most of the chip area is used for caches

39

Nested Multi-dim Arrays

• Dimensions are stacked consecutively using an index mapping

• Consider a square two-dimensional array of size N

N

N

40

Row or Column-wise Order

• If you allocate a static multi-dimensional array in C the rows of your array will be stored consequtively

• This is called row-wise ordering• Row-wise or row-major ordering means column

index should vary fastest (i,j)• Column-wise or column-major ordering means

that the row index should vary fastest– Used in Fortran


41

Row-Major Ordering

• (i,j) loop will give stride-1 access

• (j,i) loop will give stride-N access

Array(i,j) -> (i*N+j)

42

Column-Major Ordering

• (i,j) will give stride-N

• (j,i) will give stride-1

Array(i,j) ->(i+j*N)

43

Dynamically Allocated Arrays

• If you use a nested array you can choose row-major or column-major using your indexing function (i+N*j) or (i*N+j)

• For multi-level arrays there is no guarantee that the rows (the second indirection) will be stored consecutively

• You can still achieve this using some pointer arithmetic (page 92 in Oliviera)


44

Data caches, example

double x[m][n];register double sum = 0.0;

for( i = 0; i < m; i++ ){ for( j = 0; j < n; j++) { sum = sum + x[i][j]; }}

• Assumptions:1. Only one data cache2. A cache block contains 4 double elements3. The i,j,sum variables stay in registers

45

Storage visualization, (i,j)-loop

0 1 2 4 n

1

2

3

m

for( i = 0; i < m; i++ ) { for( j = 0; j < n; j++) { sum = sum + x[i][j]; }}

MISS

MISS

i

j

MISS

46

Storage visualization, (j,i)-loop

1 2 3 n

1

2

3

m

MISS

MISS

MISS

MISS

i

j

for( j = 0; j < m; j++ ) { for( i = 0; i < n; i++) { sum = sum + x[i][j]; }}

47

Cache Thrashing

• The start addresses of x and y might map to the same set• Accesses to y will conflict with x

– No data will be mapped to the other sets– Only one set will be used (small part of the cache)– Index bits are the same for x and y

• Solution: array padding– Make one array larger– Distance between arrays will not be a power of 2– Same thing can happen in set associative caches

float dotprod(float x[256], float y[256]) { float sum = 0.0; int i;

for( i=0; i < 256; i++ ) sum += x[i] * y[i]; return sum;}

Bevare of array sizes that are powers of two!

48

Array Padding

• Used to reduce thrashing– Especially important for multi-dimensional

arrays

• Allocate more space– Which isn’t used in computations– Will shift subsequent arrays to addresses that

are not powers of two

• Typical padding– Use a prime number like 13, 21, 31– Verify effect experimentally

PowerPoint

Documents

memory caches

memory need

piece of data

cpumemory gap cheap

large scale computing

entire memory

memory subsystem

memory system