THE DYNAMIC GRANULARITY MEMORY SYSTEM

Post on 03-Oct-2021

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Doe Hyun Yoon

IIL, HP Labs

Michael Sullivan

Min Kyu Jeong

Mattan Erez

ECE, UT Austin

MEMORY ACCESS GRANULARITY • The size of block for accessing main memory

– Often, equal to last-level cache line size

• Modern systems use coarse-grained (CG)

memory access

– 64B or larger

– Amortize control & ECC overhead

– Prefetching

2

CG ACCESS MAY WASTE BW

• Waste BW on unused data

3

for( i=0; i<N; i++ ) {

a[ b[i] ] += x;

}

GUPS microbenchmark Buffer a

Initialized with random

numbers

CAN WE WASTE BW? • CG access often improves performance

– Large cache lines reduce miss rate

due to prefetching

• Off-chip BW doesn’t scale with # cores

• Power is the limiting factor

• We shouldn’t waste the finite off-chip BW

4

HOW TO EFFICIENTLY UTILIZE OFF-CHIP BW?

• Prior work: AGMS [ISCA’11]

– Combine CG and FG accesses

– Need SW help for ECC support

• Source code, compiler, OS, virtual memory, …

• DGMS

– HW-only variant of AGMS

– Truly dynamic granularity adaptation

5

ADAPTIVE GRANULARITY

MEMORY SYSTEM [ISCA’11]

6

AGMS [ISCA’11] • Combine coarse-grained (CG) and

fine-grained (FG) accesses

• CG for high spatial locality regions

• FG for low spatial locality regions

• Higher throughput

• Lower DRAM power

7

SUB-RANKED DRAM MODULE • Independently control individual DRAM chips

• Access granularity = 8bit x burst 8 = 8B

8

DBUS

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

9

Burst 8

B0

B8

B16

B24

B32

B40

B48

B56

B1

B9

B17

B25

B33

B41

B49

B57

B2

B10

B18

B26

B34

B42

B50

B58

B3

B11

B19

B27

B35

B43

B51

B59

B4

B12

B20

B28

B36

B44

B52

B60

B5

B13

B21

B29

B37

B45

B53

B61

B6

B14

B22

B30

B38

B46

B54

B62

B7

B15

B23

B31

B39

B47

B55

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

64-bit data + 8-bit ECC (SEC-DED)

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

10

B0

B1

B2

B3

B4

B5

B6

B7

E0

E1

E2

E3

E4

E5

E6

E7

B8

B9

B10

B11

B12

B13

B14

B15

E8

E9

E10

E11

E12

E13

E14

E15

B16

B17

B18

B19

B20

B21

B22

B23

E16

E17

E18

E19

E20

E21

E22

E23

B24

B25

B26

B27

B28

B29

B30

B31

E24

E25

E26

E27

E28

E29

E30

E31

x8

x8

x8

x8

x8

x8

x8

x8

x8

ABUS

SR0 SR1 SR2 SR3 SR4 SR5 SR6 SR7 SR8

Re

g/D

em

ux

8-bit data + 5-bit SEC-DED or 8-bit DEC

Burst 8

SOFTWARE SUPPORT IN AGMS • Different data/ECC layouts for CG & FG

• Requires software help

– Extend virtual memory interface

– OS/runtime manages CG&FG pages

– Programmer/compiler annotates preferred granularity

• Need to change every level of system hierarchy!

11

DYNAMIC GRANULARITY

MEMORY SYSTEM

12

DGMS • Unified data/ECC layout for CG & FG

– No SW support

• HW-only variant of AGMS

– Comparable or better performance

– Easier to implement

• Challenge:

– How to predict access granularity dynamically? 13

UNIFIED DATA/ECC LAYOUT

14

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

64-bit data 8-bit ECC (SEC-DED)

CG ACCESS • Access the whole 72B

15

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

FG ACCESS • Access 8B data and 8B ECC

16

B8

B9

B10

B11

B12

B13

B14

B15

B0

B1

B2

B3

B4

B5

B6

B7

B16

B17

B18

B19

B20

B21

B22

B23

B24

B25

B26

B27

B28

B29

B30

B31

B32

B33

B34

B35

B36

B37

B38

B39

B40

B41

B42

B43

B44

B45

B46

B47

B48

B49

B50

B51

B52

B53

B54

B55

B56

B57

B58

B59

B60

B61

B62

B63

E 0-7

E 8-15

E 16-23

E 24-31

E 32-39

E 40-47

E 48-55

E 56-63

Burst 8

AVOIDING CONTENTION ON ECC DRAM

17

ECC 8 B

SR 0 SR 1 SR 2 SR 3 SR 4 SR 5 SR 6 SR 7 SR 8

8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

ECC 8 B 8 B 8 B 8 B 8 B 8 B 8 B 8 B

DGMS DESIGN

18

Last Level Cache

Memory Controller

Core 0

$I $D

L2

Core 1

$I $D

L2

Core N-1

$I $D

L2

Sector

cache

Sub-ranked memory

w/ unified data/ECC layout DRAM

GRANULARITY PREDICTION

19

Last Level Cache

Memory Controller

Core 0

$I $D

L2

SPP

Core 1

$I $D

L2

SPP

Core N-1

$I $D

L2

SPP

Spatial

Pattern

Predictor

Sub-ranked Memory

[Chen; HPCA’04]

Which words within a cache line will be used?

SPATIAL PATTERN PREDICTOR [CHEN; HPCA’04]

20

Tag Status Data

L1 Data Cache

. . . . . .

. . .

00101101 01001011 00001000 10000000

00110000 11010001

Used Idx

CPT

. . . . . .

Idx Status 01000000

Pattern

00001000 00001110 01001100

11010001 01110000

. . . . . .

. . .

PHT

Load/Store

Evicted or

Subsector miss

Update CPT

Request To L2

PHT hit

Default

PC DA +

F

T

Tag Status Data

L1 Data Cache

. . . . . .

. . .

Load/Store

SPP ACCURACY

21

0.0

0.2

0.4

0.6

0.8

1.0

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN streamcluster stream

Not predicted, but Referenced Predicted & Referenced

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream

SPP LIMITATIONS • Case 1)

– Application accesses 5~7 words per cache line

• Case 2)

– App1 has low spatial locality, MPKI is 1

– App2 has high spatial locality, MPKI is 20

• Minimizing traffic doesn’t always

improve performance 22

23

PREDICTION CONTROLLER

Last Level Cache

Memory Controller

Core 0

$I $D

L2

SPP

GPC

Sub-ranked Memory

Core 1

$I $D

L2

SPP

Core N-1

$I $D

L2

SPP

Ignore SPP

if AvgRefWord > 3.75, or

if row-buffer hirate > 0.8

Treat all requests are CG

if CG requests are dominant

(more than 80% of MC queue)

LPC & GPC prevent performance degradation in some CG-friendly apps

LPC LPC LPC

EVALUATION

24

EVALUATION • Zesto simulator

– 8 out-of-order x86 cores

– Private caches: 32kB I/D L1, 256kB unified L2

– Shared last-level cache: 8MB

• DrSim: detailed DDR3 DRAM model

• Memory intensive multiprogrammed workloads

25

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

26

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

27

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

28

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

LOW SPATIAL LOCALITY APPS

29

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp

We

igh

ted

Sp

ee

du

p

We

igh

ted

Sp

ee

du

p

CG

AGMS

DGMS

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

30

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

HIGH SPATIAL LOCALITY APPS

31 mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3

We

igh

ted

Sp

ee

du

p

6

5

4

3

2

1

0

CG AGMS DGMS

0

1

2

3

4

5

6

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SYSTEM THROUGHPUT

32

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

MIXED CASES

33 cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

4

3

2

1

0

6 MIX1: SSCA2 x2, mst x2, em3d x2, canneal x2 MIX2: SSCA2 x2, canneal x2, mcf x2, OCEAN x2 MIX3: canneal x2, mcf x2, bzip2 x2, hmmer x2

MIX4: mcf x4, omnetpp x4 MIX5: SSCA2 x2, canneal x2, mcf x2, streamcluster x2

CG

AGMS

DGMS

POWER EFFICIENCY

34

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

3.2 2.8

No

rma

lize

d T

hro

ug

hp

ut/

Po

we

r

CG AGMS DGMS

35

0

1

2

3

4

5

6

7

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

SSCA2 canneal em3d mst gups mcf omnetpp lbm OCEAN s-cluster stream MIX1 MIX2 MIX3 MIX4 MIX5

We

igh

ted

Sp

ee

du

p

CG AGMS DGMS

SYSTEM THROUGHPUT (NO ECC)

CONCLUSIONS • Dynamic Granularity Memory System

– HW-only variant of AGMS

– Truly dynamic granularity adaptation

– Higher performance [31% vs. CG]

– Lower DRAM power [13% vs. CG]

• More in the paper – Reg/demux and address/command bus bandwidth

– LPC&GPC details

– DGMS with chipkill-correct support 36

THE DYNAMIC GRANULARITY MEMORY SYSTEM

Doe Hyun Yoon

IIL, HP Labs

Michael Sullivan

Min Kyu Jeong

Mattan Erez

ECE, UT Austin

top related