Chapter 13: Direct Memory Access …DMA…provides direct access to the memory while the microprocessor is temporarily disabled. Typical uses of DMA –Video.

Chapter 13: Direct Memory Access

• “…DMA…provides direct access to the memory while the microprocessor is temporarily disabled.”

• Typical uses of DMA– Video displays for refreshing the screen– Hard disk reads and writes– High-speed memory-to-memory transfers

• Timing behavior– Shown in Fig. 13-1– HOLD HLDA

• Microprocessor suspends execution of its program and places its address, data, and control bus into high-impendence (Z) states

Basic DMA Definitions

• DMA normally occurs between an I/O device and memory without the use of the CPU

• DMA read– Transfers data from the memory to the I/O device

• DMA write– Transfers data from an I/O device to memory

• DMAC controls both memory and I/O device simultaneously

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)

1. CPU sends information for data transfer to DMAC chip (initialization)

2. DMA request from I/O

3. Bus request from DMAC

4. Bus grant from CPU

5. DMA grant from DMAC

6. Data transfer

7. DMA sends INTR to CPU to inform completion of DMA.

MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)

1. CPU sends information for data transfer to DMAC chip

2. DMA request from I/O (via DREQs, e.g., 4 channels in 8237)




6. Data transfer


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)



3. Bus request from DMAC (HRQ, hold request in 8237)



6. Data transfer


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)




4. Bus grant from CPU (setting all the bus outputs of processor to Z)


6. Data transfer


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)





5. DMA grant from DMAC (via DACKs)

6. Data transfer


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)






6. Data transfer (if DRAM read) // MRDC & IOWC signals are controlled


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)






6. Data transfer (if DMA write) // MWTC & IORC signals are controlled


MWTCMRDCIOWCIORC

CPU

DMACHold

HoldA

ADDR

DATA

I/O Memory

(Bus Request)

(Bus Grant)

(DMA Request)

(DMA Grant)






6. Data transfer

7. DMA sends INTR to CPU to inform completion of DMA

MWTCMRDCIOWCIORC

DMA Operation Initiation

• CPU sends information about required data transfer operation to the DMAC chip– Source device/address, destination device/address, data block size,

type of data transfer (demand, single, block), etc.– Uses OUT assembly instructions to send DMAC chip this

information

• DMAC chip requests a DMA to the CPU by asserting the HOLD line (via its HRQ)

• CPU acknowledges request by asserting HLDA• Request priority in the microprocessor

– Reset > Hold > Interrupt

Three Types of DMA Mode

• Demand mode– transfers data until DREQ becomes inactive

• Single mode– releases HOLD after each byte of data is transferred

– If DREQ is active, DMAC requests a DMA transfer to microprocessor

• Block mode– automatically transfers the number of bytes indicated

by the count register for the channel

Advanced Topics

• Lecture– Cache (5/31)

– DRAM (already touched in Chapter 10)

– Flash memory-based storage (6/14)

• Practice– Introduction to RTL design in Verilog (6/2, LG105)

– Two practices (6/7 and 6/9, LG114)• Note: the two practices are run in the same manner as the

normal practices. 1st and 2nd sessions (3:20pm~4:00pm, and 4:00pm~4:40pm)

Processor-DRAM Gap (latency)

Time

µProc 60%/year

DRAM7%/year

1

10

100

1000198

0198

1

198

3198

4198

5198

6198

7

198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9200

0

DRAM

CPU198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance “Moore’s Law”

Four-issue 2GHz superscalar accessing 100ns DRAM could execute 800 instructions during time for one memory access!

[Source: K. Asanovic, 2008]

What is a cache?• Small, fast storage used to improve average access time to

slow memory.• Exploits spatial and temporal locality• In computer architecture, almost everything is a cache!

– Registers a cache on variables– First-level cache a cache on second-level cache– Second-level cache a cache on memory– Memory a cache on disk (virtual memory)– TLB a cache on page table– Branch-prediction a cache on prediction information?

Proc/Regs

L1-Cache

L2-Cache

Memory

Disk, Tape, etc.

Bigger Faster

[Source: J. Kubiatowicz, 2000]

Typical Memory Reference PatternsAddress

Time

Instruction fetches

Stackaccesses

Dataaccesses

n loop iterations

subroutine call

subroutine return

argument access

vector access

scalar accesses


Temporal locality

Spatial locality

Temporal locality

Spatial localityTemporal & Spatial locality

Memory Reference Patterns

Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)

Time

Mem

ory

Ad

dre

ss (

on

e d

ot

per

acc

ess)


A Typical Memory Hierarchy c.2008

L1 Data Cache

L1 Instruction

CacheUnified L2

Cache

RF Memory

Memory

Memory

Memory

Multiported register file

(part of CPU)

Split instruction & data primary caches (on-chip SRAM)

Multiple interleaved memory banks(off-chip DRAM)

Large unified secondary cache (on-chip SRAM)

CPU


Itanium-2 On-Chip Caches(Intel/HP, 2002)

Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency

Level 2, 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency

Level 3, 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency


L3 and L2 caches occupy more than 2/3 of total area!

Workstation Memory System(Apple PowerMac G5, 2003)

Dual 2GHz processors, each has:• 64KB I-cache, direct mapped• 32KB D-cache, 2-way• 512KB L2 unified cache, 8-way• All 128B lines

Up to 8GB DDR SDRAM, 400MHz, 128-

bit bus, 6.4GB/s

1GHz, 2x32-bit bus, 16GB/s

North Bridge Chip

AGP Graphics Card, 533MHz, 32-bit bus,

2.1GB/s

PCI-X Expansion, 133MHz, 64-bit bus,

1 GB/s


Cache Policies

• Inclusion

• Placement

• Replacement

Inclusion Policy

• Inclusive multilevel cache: – Inner cache holds copies of data in outer cache– External access need only check outer cache– Most common case

• Exclusive multilevel caches:– Inner cache may hold data not in outer cache– Swap lines between inner/outer caches on miss– Used in AMD Athlon with 64KB primary and 256KB secondary cache

Why choose one type or the other?– Cache size matters. – In general, if L2 size >> L1 size, then inclusion policy


Types of Cache Miss

• “Three Cs”• 1st C: Compulsory Misses

– Happen when warming up the cache• 2nd C: Conflict Misses

– E.g., two addresses are mapped to the same cache line

– Solution: increase associativity• 3rd C: Capacity Misses

– E.g., sequential access of 40KB data via 32KB data cache

[Source: Garcia, 2008]

Placement Policy

0 1 2 3 4 5 6 70 1 2 3Set Number

Cache

Fully (2-way) Set DirectAssociative Associative Mappedanywhere anywhere in only into

set 0 block 4 (12 mod 4) (12 mod 8)

0 1 2 3 4 5 6 7 8 91 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9

2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9

3 30 1

Memory

Block Number

block 12 can be placed


Conflict miss!

Direct-Mapped Cache

Tag Data Block V

=

BlockOffset

Tag Index

t k b

t

HIT Data Word or Byte

2k

lines


Placement Policy

0 1 2 3 4 5 6 70 1 2 3Set Number

Cache

Fully (2-way) Set DirectAssociative Associative Mappedanywhere anywhere in only into

set 0 block 4 (12 mod 4) (12 mod 8)

0 1 2 3 4 5 6 7 8 91 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9

2 2 2 2 2 2 2 2 2 2 0 1 2 3 4 5 6 7 8 9

3 30 1

Memory

Block Number

block 12 can be placed


Conflict miss!

2-Way Set-Associative Cache

Tag Data Block V

=

BlockOffset

Tag Index

t k

b

HIT

Tag Data Block V

DataWordor Byte

=

t


Set

4-Way Set Associative Cache Circuit

tagindex

Mux is time consuming!


Fully Associative Cache

Tag Data Block V

=

Blo

ckO

ffse

t

Tag

t

b

HIT

DataWordor Byte

=

=

t


Fully Associative Cache

• Benefit of Fully Assoc Cache– No Conflict Misses (since data can go anywhere)

• Drawbacks of Fully Assoc Cache– Need hardware comparator for every single entry

• If we have a 64KB of data in cache with 4B entries, we need 16K comparators and 16K input MUX

• Infeasible for large size caches

– However, used for small size (e.g., 128 entry) caches, e.g., TLB


Replacement PolicyIn an associative cache, which block from a set should be evicted when the set becomes full?

• Random•used in highly (fully) associative caches, e.g., TLB

• Least Recently Used (LRU)• LRU cache state must be updated on every access• true implementation only feasible for small sets (2-way)• pseudo-LRU binary tree often used for 4-8 way

• First In, First Out (FIFO) a.k.a. Round-Robin• used in highly associative caches

• Other options, e.g., recent frequently used, etc.

This is a second-order effect. Why?

Replacement only happens on misses


Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Absolute Miss Rate (SPEC92)

Conflict

Compulsory vanishinglysmall


Cache Size (KB)

Mis

s R

ate

per

Typ

e

0

0.02

0.04

0.06

0.08

0.1

0.12

0.141 2 4 8

16

32

64

12

8

1-way

2-way

4-way

8-way

Capacity

Compulsory

2:1 Cache Rule

Conflict

miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2


RuleIf the workload is large, the cache miss rate is observed to decrease as a power law of the cache sizeIf the cache size is doubled, the miss rate drops by the factor of

2

2

[Source: A. Hartstein, 2006]

Chapter 13: Direct Memory Access …DMA…provides direct access to the memory while the microprocessor is temporarily disabled. Typical uses of DMA –Video.

Documents

dma transfer

cpu dmac

basic dma definitions

io device dma

dma operation initiation

dram cpu

hrq cpu

dmac hrq