Review: The Memory Hierarchy - RMD Engineering College Materials/7/ACA/unit5.… · Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main

Review: The Memory Hierarchy

Increasing

distance

from the

processor in

access time

L1$

L2$

Main Memory

Secondary Memory

Processor

(Relative) size of the memory at each level

Inclusive– what

is in L1$ is a

subset of what

is in L2$ is a

subset of what

is in MM that is

a subset of is

in SM

4-8 bytes (word)

1 to 4 blocks

1,024+ bytes (disk sector = page)

8-32 bytes (block)

Review: Principle of Locality Temporal Locality

Keep most recently accessed data items closer to the processor

Spatial Locality Move blocks consisting

of contiguous words to the upper levels

Hit Time << Miss Penalty Hit: data appears in some block in the upper level (Blk X)

Hit Rate: the fraction of accesses found in the upper level

Hit Time: RAM access time + Time to determine hit/miss

Miss: data needs to be retrieve from a lower level block (Blk Y) Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in the upper level with a block from the lower level + Time to deliver this block’s word to the processor

Miss Types: Compulsory, Conflict, Capacity

Lower Level

MemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y

Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then

CPU time = IC × CPI × CC= IC × (CPIideal + Memory-stall cycles) × CC

CPIstall

Memory-stall cycles come from cache misses (a sum of

read-stalls and write-stalls)

Read-stall cycles = reads/program × read miss rate

× read miss penalty

Write-stall cycles = (writes/program × write miss rate

× write miss penalty)

+ write buffer stalls

For write-through caches, we can simplify this to

Memory-stall cycles = miss rate × miss penalty

Set Associative Cache Example

0

Cache

Main Memory

Q2: How do we find it?

Use next 1 low order

memory address bit to

determine which

cache set (i.e., modulo

the number of sets in

the cache)

Tag Data

Q1: Is it there?

Compare all the cache

tags in the set to the

high order 3 memory

address bits to tell if

the memory block is in

the cache

V

0000xx

0001xx

0010xx

0011xx

0100xx

0101xx

0110xx

0111xx

1000xx

1001xx

1010xx

1011xx

1100xx

1101xx

1110xx

1111xx

Two low order bits

define the byte in the

word (32-b words)

One word blocks

Set

1

0

1

Way

0

1

Benefits of Set Associative Caches The choice of direct mapped or set associative depends on

the cost of a miss versus the cost of implementation

0

2

4

6

8

10

12

1-way 2-way 4-way 8-way

Associativity

Mis

s R

ate

4KB

8KB

16KB

32KB

64KB

128KB

256KB

512KB

Data from Hennessy &

Patterson, Computer

Architecture, 2003

Largest gains are in going from direct mapped to 2-way

(20%+ reduction in miss rate)

Q1&Q2: Where can a block be placed/found?

# of sets Blocks per set

Direct mapped # of blocks in cache 1

Set associative (# of blocks in cache)/

associativity

Associativity (typically

2 to 16)

Fully associative 1 # of blocks in cache

Location method # of comparisons

Direct mapped Index 1

Set associative Index the set; compare

set’s tags

Degree of

associativity

Fully associative Compare all blocks tags # of blocks

Q3: Which block should be replaced on a miss?

Easy for direct mapped – only one choice

Set associative or fully associative

Random

LRU (Least Recently Used)

For a 2-way set associative cache, random replacement has a miss rate about 1.1 times higher than LRU.

LRU is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costly

Improving Cache Performance0. Reduce the time to hit in the cache

smaller cache direct mapped cache smaller blocks for writes

no write allocate – no “hit” on cache, just write to write buffer write allocate – to avoid two cycles (first check for hit, then write)

pipeline writes via a delayed write buffer to cache

1. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache – small buffer holding most recently discarded

blocks

Improving Cache Performance2. Reduce the miss penalty

smaller blocks

use a write buffer to hold dirty blocks being replaced so don’t have to wait for the write to complete before reading

check write buffer (and/or victim cache) on read miss – may get lucky

for large blocks fetch critical word first

use multiple cache levels – L2 cache not tied to CPU clock rate

faster backing store/improved memory bandwidth wider buses

memory interleaving, page mode DRAMs

Summary: The Cache Design Space Several interacting dimensions

cache size

block size

associativity

replacement policy

write-through vs write-back

write allocation

The optimal choice is a compromise

depends on access characteristics

workload

use (I-cache, D-cache, TLB)

depends on technology / cost

Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good

Less More

Factor A Factor B

Next Lecture and Reminders Next lecture

Reading assignment – PH 7.4

Reminders

HW4 due November 8rd

Check grade posting on-line (by your midterm exam number) for correctness

Final exam (tentatively) schedule

Tuesday, December 13th, 2:30-4:20, Location TBD

What is RAID Redundant Array of Independent (Inexpensive) Disks

A set of disk stations treated as one logical station

Data are distributed over the stations

Redundant capacity is used for parity allowing for data repair

Mean Time Between Failures (MTBF)

Levels of RAID 6 levels of RAID (0-5) have been accepted by industry

Other kinds have been proposed in literature

Level 2 and 4 are not commercially available, they are included for clarity

RAID 0 All data (user and system) are distributed over the

disks so that there is a reasonable chance for parallelism

Disk is logically a set of strips (blocks, sectors,…). Strips are numbered and assigned consecutively to the disks (see picture.)

RAID 0a. It splits data among two or more disks.

b. Provides good performance.

c. Lack of data redundancy means there is no

fail over support with this configuration.

d. In the diagram to the right, the odd blocks

are written to disk 0 and the even blocks to

disk 1 such that A1, A2, A3, A4, … would

be the order of blocks read if read

sequentially from the beginning.

e. Used in read only NFS systems and

gaming systems.

Raid 0 (No redundancy)

strip 0

strip 4

strip 8

strip 12

strip 1

strip 5

strip 9

strip 13

strip 2

strip 6

strip 10

strip 14

strip 3

strip 7

strip 11

strip 15

Data mapping Level 0

RAID 1•RAID1 is ‘data mirroring’.

•Two copies of the data are held on

two physical disks, and the data is

always identical.

• Twice as many disks are required to

store the same data when compared to

RAID 0.

•Array continues to operate so long as

at least one drive is functioning.

Raid 1 (mirrored)

strip 0

strip 4

strip 8

strip 12

strip 1

strip 5

strip 9

strip 13

strip 2

strip 6

strip 10

strip 14

strip 3

strip 7

strip 11

strip 15

strip 0

strip 4

strip 8

strip 12

strip 1

strip 5

strip 9

strip 13

strip 2

strip 6

strip 10

strip 14

strip 3

strip 7

strip 11

strip 15

Raid 2 (redundancy through Hamming code)

f0(b)b2b1b0 b2f1(b) f2(b)

RAID 2 Small strips, one byte or one word

Synchronized disks, each I/O operation is performed in a parallel way

Error correction code (Hamming code) allows for correction of a single bit error

Controller can correct without additional delay

Is still expensive, only used in case many frequent errors can be expected

Hamming code

7 6 5 4 3 2 1 P

1 0 1 0 1 0 1

* * * * * 0

* * * * 0

* * * 0

7 6 5 4 3 2 1 P

1 1 1 0 1 0 1

* * * * * 1

* * * * 1

* * * 0

=6

Stored sequence

Data: 1011 in 7,6,5,3

Parity in 4,2,1

Single error can

be repaired

RAID 3 (bit-interleaved parity)

P(b)b2b1b0 b2

RAID 3 Level 2 needs log2(number of disks) parity disks

Level 3 needs only one, for one parity bit

In case one disk crashes, the data can still be reconstructed even on line (“reduced mode”) and be written (X1-4 data, P parity):

P = X1+X2+X3+X4

X1=P+X2+X3+X4

RAID 2-3 have high data transfer times, but perform only one I/O at the time so that response times in transaction oriented environments are not so good

RAID 4 (block-level parity)

block 0

block 4

block 8

block 12

block 1

block 5

block 9

block 13

block 2

block 6

block 10

block 14

block 3

block 7

block 11

block 15

P(0-3)

P(4-7)

P(8-11)

P(12-15)

RAID 4 Larger strips and one parity disk

Blocks are kept on one disk, allowing for parallel access by multiple I/O requests

Writing penalty: when a block is written, the parity disk must be adjusted (e.g. writing on X1):

P =X4+X3+X2+X1

P’=X4+X3+X2+X1’

=X4+X3+X2+X1’+X1+X1

=P+X1+X1’

Parity disk may be a bottleneck

Good response times, less good transfer rates

RAID 5 (block-level distributed parity)

block 0

block 4

block 8

block 12

P(16-19)

block 1

block 5

block 9

P(12-15)

block 16

block 2

block 6

P(8-11)

block 13

block 17

block 3

P(4-7)

block 10

block 14

block 18

P(0-3)

block 7

block 11

block 15

block 19

RAID 5 Distribution of the parity strip to avoid the bottle

neck.

Can use round robin:

Parity disk = (-block number/4) mod 5

Overview Raid 0-2

Overview Raid 3-5

Review: The Memory Hierarchy - RMD Engineering College Materials/7/ACA/unit5.… · Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main

Documents