Top Banner
A Programmable Memory Hierarchy for Prefetching Linked Data Structures Chia-Lin Yang Department of Computer Science and Information Engineering National Taiwan University Alvin R. Lebeck Department of Computer Science Duke University
43

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Jan 13, 2016

Download

Documents

Ion

A Programmable Memory Hierarchy for Prefetching Linked Data Structures. Alvin R. Lebeck Department of Computer Science Duke University. Chia-Lin Yang Department of Computer Science and Information Engineering National Taiwan University. Memory Wall. Processor-memory gap grows over time - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

A Programmable Memory Hierarchy for Prefetching Linked Data Structures

Chia-Lin Yang

Department of Computer Science and Information Engineering

National Taiwan University

Alvin R. Lebeck

Department of Computer Science

Duke University

Page 2: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

2

Memory Wall

• Processor-memory gap grows over time

• Prefetching– What ? Future Address Prediction

– When? Prefetch Schedule

1

10

100

1000

10000

100000

1980 1985 1990 1995 2000

Processor-Memory Gap

CPU perform

ance 60% yr

DRAM performance 10% yr

Page 3: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

3

• Linked data structures

– No regularity in the address stream• Adjacent elements are not necessarily contiguous in memory

– Pointer-chasing problem

Prefetch Linked Data Structures (LDS)

…..

p = head;while (p){ work (p->data); p = p -> next;}

while (p){ prefetch (p->next->next->next); work (p->data); p = p -> next;}

currently visiting would like to prefetch

p *p

Page 4: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

4

The Push Architecture

• A LDS prefetching framework built on a novel data movement model - Push (Yang’2000)

L1

L2

Main Memory

req

req

L1

L2

Main Memory

Traditional Pull Model New Push Model

Page 5: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

5

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

Page 6: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

6

Block Diagram of the Push Architecture

Prefetch Buffer L1

L2

Main

Memory

Prefetch Engine

Prefetch Engine

Prefetch Engine

prefetch req

prefetch req

prefetch req

L2 Bus

Memory Bus

Page 7: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

7

How to Predict Future Addresses?

• LDS traversal kernels

• Load instructions in LDS traversal kernels are a compact representation of LDS accesses [Roth’98]

• PFEs execute LDS traversal kernels independent of the CPU

• The amount of computation between node accesses affects how far the PFE could run ahead of the CPU

while ( list != NULL) { p = list->x; process (p->data); list = list->next; recurrent load}

Page 8: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

8

L1

L2

Main Memory

• Push model : pipelined process

1PFEr2

r2

a1

a1

a2 a2r1

r1 x1

x2

x1x2

2a2 x2 x1

3a2

2 3 41

The Pointer-Chasing Problem: how does the push model help?

a2

Page 9: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

9

Push Architecture Design Issues

CPU L1

L2

Main

Memory

PFE

PFE

PFE

1. PFE Architecture Design

controller

controller

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

3. Synchronization between the CPU and PFE execution

4. Redundant Prefetch

Page 10: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

10

ISSUE #1: PFE Architecture

• Programmable PFE – General purpose processor core – 5 stage pipeline, in-order processor– Integer ALU units for address calculation & control flow– TLB for address translation – Root register to store the root address of the LDS being traversed

Page 11: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

11

Issue #2: Interaction among PFEs

CPU

L1

L2

Mem

Root Reg

PFE

store [x]

PFE

PFE

Root Reg

Root Reg

Tree (root); : :

Tree ( node){ if (node) { Tree (node->left); Tree (node->right); }}

2

3 4

5

6 7

y

x resume

resume x

1store root addressx

issue x

stop L1 PFE

Page 12: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

12

• When do we need to synchronize the CPU and PFE execution?– Early prefetches

• the PFEs are running too far ahead of the CPU

– Useless prefetches• the PFEs are traversing down the wrong path

• the PFEs are running behind the CPU

• Throttle mechanism

Issue #3: Synchronization between CPU and PFEs

Free Bit Cache Blocks

1

0

0

1

PFECPUproduceconsume

Prefetch Buffer

Page 13: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

13

Variations of the Push Architecture

• 2_PFE should perform comparably to 3_PFE

L1

L2

Main Memory

PFE

PFE

push

pull

L1

L2

Main Memory PFE

push

L1

L2

Main Memory

PFE

PFE

push

PFE

push

3_PFE 2_PFE 1_PFE

• 1_PFE performs well if most of LDS exist only in the main memory

Page 14: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

14

Outline

• Background & Motivation• What is the Push Architecture?• Design of the Push Architecture• Variations of the Push Architecture• Experimental Results• Related Research• Conclusion

Page 15: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

15

Experimental Setup

• SimpleScalar: out-of-order processor• Benchmark:

• Olden benchmark suite & rayshade

• Baseline processor:– 4-way issue, 64 RUU, 16 LSQ– lockup-free caches with 8 outstanding misses– 32KB, 32B line, 2-way L1 & 512K, 64B line, 4-way L2– 84 cycle round-trip memory latency & 48 cycle DRAM access time

• Prefetch model– Push model: 3 level PFEs, 32-entry fully-associative prefetch buffer – Pull model: L1 level PFE, 32-entry fully-associative prefetch buffer

Page 16: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

16

Performance Comparison: Push vs. Pull

• health, mst, perimeter and treeadd• Push: 4% to 25% speedup Pull: 0% to 4% speedup

• em3d, rayshade• Push: 31% to 57% speedup Pull: 25% to 39% speedup

• bh• Push: 33% speedup Pull: 33% speedup

• Dynamically changing structures: bisort and tsp

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

health em3d mst rayshade perimeter bh bisort treeadd tsp voronoi

Benchmark

Norm

aliz

ed E

xecu

tion

Tim

e

memory latency

computation time

Page 17: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

17

Variations of the Push Architecture

00.10.20.30.40.50.60.70.80.9

1

Norm

aliz

ed E

xecution T

ime

Base

3_PFE

2_PFE

1_PFE

• 2_PFE performs comparably to 3_PFE• 1_PFE performs comparably to 3_PFE except for em3d.

Page 18: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

18

Related Work

• Prefetching for Irregular Applications:– Correlation based prefetch (Joseph’97 and Alexander’96)

– Compiler based prefetch (Luk’96)

– Dependence based prefetch (Roth’98)

– Jump-pointer prefetch (Roth’99)

• Decoupled Architecture– Decoupled Access Execute (Smith’82)

– Pre-execution (Annavaram’2001,Collin’2001, Roth’2001, Zilles’2001, Luk’2001)

• Processor-in-Memory– Berkley IRAM Group (Patterson’97)

– Active Page (Oskin’98)

– FlexRAM (Kang’99)

– Impulse (Carter’99)

– Memory-side prefetching (Hughes’2000)

Page 19: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

19

Conclusion

• Build a general architectural solution for the push model

• The push model is effective in reducing the impact of the pointer-chasing problem on prefetching performance– applications with tight traversal loops

• Push : 4% to 25% Pull: 0% to 4%

– applications with longer computation between node accesses

• Push : 31% to 57% Pull: 25% to 39%

• 2_PFE performs comparably to 3_PFE.

Page 20: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

20

Traversal Kernel

void *HashLookup(int key, hash hash){ j = (hash->mapfunc)(key); for (ent = hash->array[j]; ent && ent->key != key; ent = ent->next); if (ent) return ent->entry; return Null;}

void kernel (HashEntry ent, int key){ for (ent ; ent && ent->key != key; ent = ent->next); }

1. traversal kernel identifier2. hash->array[j]3. key

PFE

CPU

memory-mapped interface

Page 21: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

21

Block Diagram of Specialized PFE

Recurrent Load

Table

Non-Recurrent

Load Table

Root Register

Kernel Id Register

Instruction Buffer

Traversal-Info Table

Ready Queue (pc, base, offset)

+

+

Result Buffer (pc)

Cache/Memory Controller

TLB

Page 22: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

22

Block Diagram of Programmable PFE

Kernel Id Register

Instruction Buffer

Kernel IndexTable

Result Buffer

Cache/Memory Controller

TLB

Processor

Instruction Cache

Stack

Register File

Root reg

: memory-mapped structure

local access

global access

Page 23: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

23

Issue #4: Redundant Prefetches

• Redundant prefetches:

• Tree traversals:

L1

L2

Main Memory

1

2 5

3 4 6 7

Page 24: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

24

Issue #4: Redundant Prefetches

• Performance impact– Waste bus bandwidth

– Memory accesses are satisfied more slowly in the lower level of memory hierarchy

• Add a small data cache in the L2/Memory PFEs

Cache/Memory Controller

PFE Processor

Data Cacherequest

result

miss request

Page 25: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

25

#Issue 5: Modifications to Cache/Memory Controller

L1

L2

Main Memory

Request Buffer

MSHR

MSHR

L2 Bus

Memory Bus

demand requests merge

Request Buffer PFE

PFE

demand/prefetch requests merge

Page 26: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

26

How to Avoid Early Prefetches?

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

534

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

t1 t2 t3

234

534

Page 27: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

27

How to Avoid Early Prefetches?

Free Bit Data

0

0

0

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

234

1

2 9

3 6 10

13

4 5 7 8 11

12

14

15

Free Bit Data

1

0

0

234

t1 t3

suspend execution

PFE PFE

continue execution

Page 28: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

28

How to Avoid Useless Prefetches?

Free Bit Data

0

0

0

234

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

6

t1

L1/L2 misses

L1 hits

1 2 3 4 5

Free Bit Data

0

0

0

234

::::::::

Page 29: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

29

How to Avoid Useless Prefetches?

Free Bit Data

0

0

0

234

suspend execution

MemPFE

1 2 3 4 5

MemPFE

trigger execution

6

Free Bit Data

0

1

1trigger execution

MemPFE 7

t1 t2

L1/L2 misses

L1 hits

61 2 3 4 5

Page 30: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

30

Performance Prediction of the Push Architecture for Future Processors

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

health em3d mst rayshade perimeter bh treeadd tsp

Norm

alized

Exec

ution

Time

0.8MHz

1.2GHz

1.6GHz

2.0GHz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

health em3d mst rayshade perimeter bh treeadd tsp

Norm

alized

Exec

ution

Time

memory latency

computation time

0.8M

1.2G1.6G

2.0G

Page 31: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

31

Prefetch Coverage

0%

10%

20%

30%

40%

50%

60%

70%

health em3d mst rayshade perimeter bh treeadd tsp

% o

f Cac

he M

isse

s

Total Hidden Misses

Partial Hidden Misses

Page 32: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

32

Prefetch Distribution

0%10%20%30%40%50%60%70%80%90%

100%

L1

L2

Memory

Page 33: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

33

Cumulative Distance between Recurrent Loads

0%10%20%30%40%50%60%70%80%90%

100% <8

<16

<32

<64

<128

>128

Page 34: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

34

Bandwidth Requirement

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Norm

alize

d Ex

ecut

ion T

ime

Limited Bandw idth

Non_Limited Bandw idth

Page 35: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

35

Effect of the PFE Data Cache & Throttle Mechanism

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

health em3d mst rayshade perimeter bh treeadd tsp

Nor

mal

ized

Exe

cutio

n Ti

me

basepush_basepush_bufferpush_throttlepush_buffer_throttle

• The throttle mechanism has impact on bh.• The PFE data cache has impact on em3d, perimeter and treeadd

Page 36: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

36

Effect of the PFE Data Cache

0%10%

20%30%

40%50%60%

70%80%

90%100%

em3d perimeter bh treeadd

• em3d, perimeter, bh and treeadd : 30% to 50% of prefetches are redundant

Redundant Prefetch Distribution% of redundant prefetches are captured inthe PFE data cache

0%

10%

20%

30%

40%

50%

% of

prefet

ches

L2

Memory

• 70% to 100% of redundant prefetches are captured in the PFE data cache

Page 37: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

37

PFE Architecture :Effect of Wider Issue PFEs

• Increasing issue width further improves performance, particularly for em3d and treeadd

00.10.20.30.40.50.60.70.80.9

1

Nor

mal

ized

Exe

cutio

n T

ime base

single

2issue

4issue

Page 38: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

38

TLB Miss Effect

• Hardware TLB miss handler, 30 cycle TLB miss penalty

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Benchmark

No

rmal

ized

Exe

cuti

on

Tim

e

perf

32

64

128

256

Page 39: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

39

PFE Architecture: Specialized vs. Programmable PFE

• A programmable PFE can achieve performance comparable to a specialized PFE

00.10.20.30.40.50.60.70.80.9

11.1

Health Mst Rayshade

Benchmark

Nor

mal

ized

Exe

cutio

n T

ime

Specialized

Programmable

Page 40: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

40

Breadth-First Tree Traversal

1

2 3

4 5 6 7

8 9 10

11

12

13

14

15

8 9 10

13

14

15

Head Tail

:::

Traversal Kernel list = head; while (list) { node = list->ptr; left = node->left; right = node->right; list = list->next; }

::::::::::::::::::::::::::::::::::::::::::::::::

Page 41: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

41

Push Architecture Design Issues

CPU L1

L2

Main

Memory

PFE

PFE

PFE

1. PFE Architecture Design

controller

controller

controller

2. Interaction Scheme

5. Demands on the cache/memory

controller

4. Synchronization between the CPU and PFE execution

3. Redundant Prefetch

Page 42: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

42

Restore PFE State

00400950 addiu $sp[29],$sp[29],-56 00400958 sw $ra[31],48($sp[29]) 00400960 sw $s8[30],44($sp[29]) 00400968 sw $s0[16],40($sp[29]) 00400970 addu $s8[30],$zero[0],$sp[29] 00400978 addu $s0[16],$zero[0],$a0[4] 00400980 beq $s0[16],$zero[0],004009a8 (x)00400988 lw $a0[4],4($s0[16]) miss 00400990 jal 00400950 <K_TreeAdd>(y)00400998 lw $a0[4],8($s0[16]) 004009a0 jal 00400950 <K_TreeAdd> 004009a8 addu $sp[29],$zero[0],$s8[30] 004009b0 lw $ra[31],48($sp[29]) 004009b8 lw $s8[30],44($sp[29]) 004009c0 lw $s0[16],40($sp[29]) ::::::::::

1

2 5

3 4 6 7

x yx issued: 400988

x miss: 400990, 400950 - 400978

y issued: 400998

Register File PC

save registers in the stack

restore registers from the stack

Page 43: A Programmable Memory Hierarchy for Prefetching Linked Data Structures

43

• Correct resume PC– Statically construct the resume PC table

Restore PFE State

Recurrent Load PC Resume PC

400988 400998

400998 4009a8