A Programmable Memory Hierarchy for Prefetching Linked Data Structures

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Chia-Lin Yang

Department of Computer Science and Information Engineering

National Taiwan University

Alvin R. Lebeck

Department of Computer Science

Duke University

2

Memory Wall

• Processor-memory gap grows over time

• Prefetching– What ? Future Address Prediction

– When? Prefetch Schedule

1

10

100

1000

10000

100000

1980 1985 1990 1995 2000

Processor-Memory Gap

CPU perform

ance 60% yr

DRAM performance 10% yr

3

• Linked data structures

– No regularity in the address stream• Adjacent elements are not necessarily contiguous in memory

– Pointer-chasing problem

Prefetch Linked Data Structures (LDS)

…..

p = head;while (p){ work (p->data); p = p -> next;}

while (p){ prefetch (p->next->next->next); work (p->data); p = p -> next;}

currently visiting would like to prefetch

p *p

4

The Push Architecture

• A LDS prefetching framework built on a novel data movement model - Push (Yang’2000)

L1

L2

Main Memory

req

req

L1

L2

Main Memory

Traditional Pull Model New Push Model

5

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

6

Block Diagram of the Push Architecture

Prefetch Buffer L1

L2

Main

Memory

Prefetch Engine

Prefetch Engine

Prefetch Engine

prefetch req

prefetch req

prefetch req

L2 Bus

Memory Bus

7

How to Predict Future Addresses?

• LDS traversal kernels

• Load instructions in LDS traversal kernels are a compact representation of LDS accesses [Roth’98]

• PFEs execute LDS traversal kernels independent of the CPU

• The amount of computation between node accesses affects how far the PFE could run ahead of the CPU

while ( list != NULL) { p = list->x; process (p->data); list = list->next; recurrent load}

8

L1

L2

Main Memory

• Push model : pipelined process

1PFEr2

r2

a1

a1

a2 a2r1

r1 x1

x2

x1x2

2a2 x2 x1

3a2

2 3 41

The Pointer-Chasing Problem: how does the push model help?

a2

9

Push Architecture Design Issues

CPU L1

L2

Main

Memory

PFE

PFE

PFE

1. PFE Architecture Design

controller

controller

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

3. Synchronization between the CPU and PFE execution

4. Redundant Prefetch

10

ISSUE #1: PFE Architecture

• Programmable PFE – General purpose processor core – 5 stage pipeline, in-order processor– Integer ALU units for address calculation & control flow– TLB for address translation – Root register to store the root address of the LDS being traversed

11

Issue #2: Interaction among PFEs

CPU

L1

L2

Mem

Root Reg

PFE

store [x]

PFE

PFE

Root Reg

Root Reg

Tree (root); : :

Tree ( node){ if (node) { Tree (node->left); Tree (node->right); }}

2

3 4

5

6 7

y

x resume

resume x

1store root addressx

issue x

stop L1 PFE

12

• When do we need to synchronize the CPU and PFE execution?– Early prefetches

• the PFEs are running too far ahead of the CPU

– Useless prefetches• the PFEs are traversing down the wrong path

• the PFEs are running behind the CPU

• Throttle mechanism

Issue #3: Synchronization between CPU and PFEs

Free Bit Cache Blocks

1

0

0

1

PFECPUproduceconsume

Prefetch Buffer

13

Variations of the Push Architecture

• 2_PFE should perform comparably to 3_PFE

L1

L2

Main Memory

PFE

PFE

push

pull

L1

L2

Main Memory PFE

push

L1

L2

Main Memory

PFE

PFE

push

PFE

push

3_PFE 2_PFE 1_PFE

• 1_PFE performs well if most of LDS exist only in the main memory

14

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

15

Experimental Setup

• SimpleScalar: out-of-order processor• Benchmark:

• Olden benchmark suite & rayshade

• Baseline processor:– 4-way issue, 64 RUU, 16 LSQ– lockup-free caches with 8 outstanding misses– 32KB, 32B line, 2-way L1 & 512K, 64B line, 4-way L2– 84 cycle round-trip memory latency & 48 cycle DRAM access time

• Prefetch model– Push model: 3 level PFEs, 32-entry fully-associative prefetch buffer – Pull model: L1 level PFE, 32-entry fully-associative prefetch buffer

16

Performance Comparison: Push vs. Pull

• health, mst, perimeter and treeadd• Push: 4% to 25% speedup Pull: 0% to 4% speedup

• em3d, rayshade• Push: 31% to 57% speedup Pull: 25% to 39% speedup

• bh• Push: 33% speedup Pull: 33% speedup

• Dynamically changing structures: bisort and tsp

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

health em3d mst rayshade perimeter bh bisort treeadd tsp voronoi

Benchmark

Norm

aliz

ed E

xecu

tion

Tim

e

memory latency

computation time

17

Variations of the Push Architecture

00.10.20.30.40.50.60.70.80.9

1

Norm

aliz

ed E

xecution T

ime

Base

3_PFE

2_PFE

1_PFE

• 2_PFE performs comparably to 3_PFE• 1_PFE performs comparably to 3_PFE except for em3d.

18

Related Work

• Prefetching for Irregular Applications:– Correlation based prefetch (Joseph’97 and Alexander’96)

– Compiler based prefetch (Luk’96)

– Dependence based prefetch (Roth’98)

– Jump-pointer prefetch (Roth’99)

• Decoupled Architecture– Decoupled Access Execute (Smith’82)

– Pre-execution (Annavaram’2001,Collin’2001, Roth’2001, Zilles’2001, Luk’2001)

• Processor-in-Memory– Berkley IRAM Group (Patterson’97)

– Active Page (Oskin’98)

– FlexRAM (Kang’99)

– Impulse (Carter’99)

– Memory-side prefetching (Hughes’2000)

19

Conclusion

• Build a general architectural solution for the push model

• The push model is effective in reducing the impact of the pointer-chasing problem on prefetching performance– applications with tight traversal loops

• Push : 4% to 25% Pull: 0% to 4%

– applications with longer computation between node accesses

• Push : 31% to 57% Pull: 25% to 39%

• 2_PFE performs comparably to 3_PFE.

20

Traversal Kernel

void *HashLookup(int key, hash hash){ j = (hash->mapfunc)(key); for (ent = hash->array[j]; ent && ent->key != key; ent = ent->next); if (ent) return ent->entry; return Null;}

void kernel (HashEntry ent, int key){ for (ent ; ent && ent->key != key; ent = ent->next); }

1. traversal kernel identifier2. hash->array[j]3. key

PFE

CPU

memory-mapped interface

21

Block Diagram of Specialized PFE

Recurrent Load

Table

Non-Recurrent

Load Table

Root Register

Kernel Id Register

Instruction Buffer

Traversal-Info Table

Ready Queue (pc, base, offset)

+

+

Result Buffer (pc)

Cache/Memory Controller

TLB

22

Block Diagram of Programmable PFE

Kernel Id Register

Instruction Buffer

Kernel IndexTable

Result Buffer


TLB

Processor

Instruction Cache

Stack

Register File

Root reg

: memory-mapped structure

local access

global access

23

Issue #4: Redundant Prefetches

• Redundant prefetches:

• Tree traversals:

L1

L2

Main Memory

1

2 5

3 4 6 7

24

Issue #4: Redundant Prefetches

• Performance impact– Waste bus bandwidth

– Memory accesses are satisfied more slowly in the lower level of memory hierarchy

• Add a small data cache in the L2/Memory PFEs


PFE Processor

Data Cacherequest

result

miss request

25

#Issue 5: Modifications to Cache/Memory Controller

L1

L2

Main Memory

Request Buffer

MSHR

MSHR

L2 Bus

Memory Bus

demand requests merge

Request Buffer PFE

PFE

demand/prefetch requests merge

26

How to Avoid Early Prefetches?

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

534

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

t1 t2 t3

234

534

27

How to Avoid Early Prefetches?

Free Bit Data

0

0

0

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

234

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

Free Bit Data

1

0

0

234

t1 t3

suspend execution

PFE PFE

continue execution

28

How to Avoid Useless Prefetches?

Free Bit Data

0

0

0

234

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

6

t1

L1/L2 misses

L1 hits

1 2 3 4 5

Free Bit Data

0

0

0

234

::::::::

29

How to Avoid Useless Prefetches?

Free Bit Data

0

0

0

234

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

6

Free Bit Data

0

1

1trigger execution

MemPFE 7

t1 t2

L1/L2 misses

L1 hits

61 2 3 4 5

30

Performance Prediction of the Push Architecture for Future Processors

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

health em3d mst rayshade perimeter bh treeadd tsp

Norm

alized

Exec

ution

Time

0.8MHz

1.2GHz

1.6GHz

2.0GHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Norm

alized

Exec

ution

Time

memory latency

computation time

0.8M

1.2G1.6G

2.0G

31

Prefetch Coverage

0%

10%

20%

30%

40%

50%

60%

70%


% o

f Cac

he M

isse

s

Total Hidden Misses

Partial Hidden Misses

32

Prefetch Distribution

0%10%20%30%40%50%60%70%80%90%

100%

L1

L2

Memory

33

Cumulative Distance between Recurrent Loads

0%10%20%30%40%50%60%70%80%90%

100% <8

<16

<32

<64

<128

>128

34

Bandwidth Requirement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Norm

alize

d Ex

ecut

ion T

ime

Limited Bandw idth

Non_Limited Bandw idth

35

Effect of the PFE Data Cache & Throttle Mechanism

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1


Nor

mal

ized

Exe

cutio

n Ti

me

basepush_basepush_bufferpush_throttlepush_buffer_throttle

• The throttle mechanism has impact on bh.• The PFE data cache has impact on em3d, perimeter and treeadd

36

Effect of the PFE Data Cache

0%10%

20%30%

40%50%60%

70%80%

90%100%

em3d perimeter bh treeadd

• em3d, perimeter, bh and treeadd : 30% to 50% of prefetches are redundant

Redundant Prefetch Distribution% of redundant prefetches are captured inthe PFE data cache

0%

10%

20%

30%

40%

50%

% of

prefet

ches

L2

Memory

• 70% to 100% of redundant prefetches are captured in the PFE data cache

37

PFE Architecture :Effect of Wider Issue PFEs

• Increasing issue width further improves performance, particularly for em3d and treeadd

00.10.20.30.40.50.60.70.80.9

1

Nor

mal

ized

Exe

cutio

n T

ime base

single

2issue

4issue

38

TLB Miss Effect

• Hardware TLB miss handler, 30 cycle TLB miss penalty

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Benchmark

No

rmal

ized

Exe

cuti

on

Tim

e

perf

32

64

128

256

39

PFE Architecture: Specialized vs. Programmable PFE

• A programmable PFE can achieve performance comparable to a specialized PFE

00.10.20.30.40.50.60.70.80.9

11.1

Health Mst Rayshade

Benchmark

Nor

mal

ized

Exe

cutio

n T

ime

Specialized

Programmable

40

Breadth-First Tree Traversal

1

2 3

4 5 6 7

8 9 10

11

12

13

14

15

8 9 10

13

14

15

Head Tail

:::

Traversal Kernel list = head; while (list) { node = list->ptr; left = node->left; right = node->right; list = list->next; }

::::::::::::::::::::::::::::::::::::::::::::::::

41

Push Architecture Design Issues

CPU L1

L2

Main

Memory

PFE

PFE

PFE

1. PFE Architecture Design

controller

controller

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

4. Synchronization between the CPU and PFE execution

3. Redundant Prefetch

42

Restore PFE State

00400950 addiu $sp[29],$sp[29],-56 00400958 sw $ra[31],48($sp[29]) 00400960 sw $s8[30],44($sp[29]) 00400968 sw $s0[16],40($sp[29]) 00400970 addu $s8[30],$zero[0],$sp[29] 00400978 addu $s0[16],$zero[0],$a0[4] 00400980 beq $s0[16],$zero[0],004009a8 (x)00400988 lw $a0[4],4($s0[16]) miss 00400990 jal 00400950 <K_TreeAdd>(y)00400998 lw $a0[4],8($s0[16]) 004009a0 jal 00400950 <K_TreeAdd> 004009a8 addu $sp[29],$zero[0],$s8[30] 004009b0 lw $ra[31],48($sp[29]) 004009b8 lw $s8[30],44($sp[29]) 004009c0 lw $s0[16],40($sp[29]) ::::::::::

1

2 5

3 4 6 7

x yx issued: 400988

x miss: 400990, 400950 - 400978

y issued: 400998

Register File PC

save registers in the stack

restore registers from the stack

43

• Correct resume PC– Statically construct the resume PC table

Restore PFE State

Recurrent Load PC Resume PC

400988 400998

400998 4009a8

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Documents