Top Banner
Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry Vivek Seshadri
49

Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

May 01, 2018

Download

Documents

buiphuc
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,

P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

Page 2: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

2

Bulk data copy and initialization

• Unnecessarily move data on the memory channel

• Degrade system performance and energy efficiency

RowClone – perform copy in DRAM with low cost

• Uses row buffer to copy large quantity of data

• Source row → row buffer → destination row

• 11X lower latency and 74X lower energy for a bulk copy

Accelerate Copy-on-Write and Bulk Zeroing

• Forking, checkpointing, zeroing (security), VM cloning

Improves performance and energy efficiency at low cost

• 27% and 17% for 8-core systems (0.01% DRAM chip area)

Page 3: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

3

Core

Core

Cac

he

MC

Mem

ory

Channel

Limited Bandwidth

High Energy

Page 4: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

4

Core

Core

Cac

he

MC

Mem

ory

Channel

Reduce unnecessary data movement

Page 5: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

5

Bulk Data Copy

Bulk Data Initialization

src dst

dstval

Page 6: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

6

Bulk Data Copy

Bulk Data Initialization

src dst

dstval

Page 7: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

7

Forking

000000000000000

Zero initialization(e.g., security)

VM CloningDeduplication

Checkpointing

Page Migration

Many more

Page 8: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

8

Core

Core

Cac

he

MC Channel src

dst

High latency(1046ns to copy 4KB)

Interference

High Energy(3600nJ to copy 4KB)

Page 9: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

9

Core

Core

Cac

he

MC Channeldst

High latency

Interference

High Energy

src

XX

X

?

Page 10: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

10

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

Page 11: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

11

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank

Bank I/O

Subarray

Row Buffer

Row of DRAM Cells

Page 12: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

12

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank I/O

ACTIVATE: Copy data from row to row buffer

READ: Transfer data to channel using the shared bus

Page 13: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

13

VDD/2

VDD/2

0

VDD

DRAMCell

Sense Amplifier(Row Buffer)

Page 14: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

14

VDD/2

VDD/2

0

VDD/2 + δ

0

VDDVDDVDD/2 + δ

DRAMCell

Sense Amplifier(Row Buffer)

Cell loses

charge

Amplify the differenceRestore

Cell Data

READ/WRITE

In the stable state, the sense amplifier drives the cell

ACTIVATE

Page 15: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

15

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

Page 16: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

16

r c r o ws

s t o wd r

1. Source row to row buffer

2. Row buffer to destination row

Row Buffer

r c r o ws

s r c r o w

?

Page 17: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

17

VDD/2

VDD/2

0

VDD/2 + δ

0

VDD

VDDVDD/2 + δ

Sense Amplifier(Row Buffer)

Amplify the difference

0

Data gets copied

src

dst

Page 18: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

18

r c r o ws

s t o wd r

Row Buffer

r c r o ws

s r c r o w

1. Activate src row (copy data from src to row buffer)

2. Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)

Page 19: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

19

Latency Energy11x 74x

Bulk Data Copy

1046ns to 90ns 3600nJ to 40nJ

No bandwidth consumption

Very little changes to the DRAM chip

Page 20: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

20

Location of source/destination

• Both should be in the same subarray

Size of the copy

• Copies all the data from source row to destination

Page 21: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

21

Mem

ory

Ch

ann

el

Ch

ip I/

OBank

Shared internal bus

Overlap the latency of the read and the write1.9X latency reduction, 3.2X energy reduction

Page 22: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

22

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank Bank I/O

Subarray

Intra subarrayUse FPM

Inter bankUse PSM

Inter subarrayUse PSM twice

Page 23: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

23

Initialization with arbitrary data

• Initialize one row

• Copy the data to other rows

Zero initialization (most common)

• Reserve a row in each subarray (always zero)

• Copy data from reserved row (FPM mode)

• 6.0X lower latency, 41.5X lower DRAM energy

• 0.2% loss in capacity

Page 24: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

24

02468

101214

Intr

a-S

ub

arra

y

Inte

r-B

ank

Inte

r-S

ub

arra

y

Intr

a-S

ub

arra

y

Copy Zero

Latency Reduction

020406080

Intr

a-S

ub

arra

y

Inte

r-B

ank

Inte

r-S

ub

arra

y

Intr

a-S

ub

arra

y

Copy Zero

Energy Reduction

11.6x

1.9x

6.0x

1.0x

74.4x

3.2x 1.5x

41.5x

Very low cost: 0.01% increase in die area

Page 25: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

25

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

Page 26: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

26

DRAM (RowClone)

Microarchitecture

ISA

Operating System

Application How does the software communicate occurrences of bulk copy/initialization to hardware?

How to maximize use of the Fast Parallel Mode?

How to ensure cache coherence?

Handling data reuse after zero initialization?

Page 27: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

27

Two new instructions

• memcopy and meminit

• Similar instructions present in existing ISAs

Microarchitecture Implementation

• Checks if instructions can be sped up by RowClone

• Export instructions to the memory controller

Page 28: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

28

RowClone modifies data in memory

• Need to maintain coherence of cached data

Similar to DMA

• Source and destination in memory

• Can leverage hardware support for DMA

Additional optimizations

Page 29: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

29

Make operating system subarray-aware

Primitives amenable to use of FPM

• Copy-on-Write Allocate destination in same subarray as source

Use FPM to copy

• Bulk Zeroing Use FPM to copy data from reserved zero row

Page 30: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

30

Data reuse after zero initialization• Phase 1: OS zeroes out the page

• Phase 2: Application uses cachelines of the page

RowClone• Avoids misses in phase 1• But incurs misses in phase 2

RowClone-Zero-Insert (RowClone-ZI)• Insert clean zero cachelines

Page 31: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

31

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

Page 32: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

32

Out-of-order multi-core simulator

1MB/core last-level cache

Cycle-accurate DDR3 DRAM simulator

6 Copy/Initialization intensive applications

+SPEC CPU2006 for multi-core

Performance

• Instruction throughput for single-core

• Weighted Speedup for multi-core

Page 33: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

33

System bootup (Booting the Debian OS)

Compile (GNU C compiler – executing cc1)

Forkbench (A fork microbenchmark)

Memcached (Inserting a large number of objects)

MySql (Loading a database)

Shell script (find with ls on each subdirectory)

Page 34: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

34

0

0.2

0.4

0.6

0.8

1

bootup compile forkbench mcached mysql shell

Frac

tion

of M

emor

y Tr

affic

Zero Copy Write Read

Page 35: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

35

0%

10%

20%

30%

40%

50%

60%

70%

bootup compile forkbench mcached mysql shell

Com

pare

d to

Bas

elin

e

IPC Improvement Memory Energy Reduction

Improvements correlate with fraction of memory traffic due to copy/initialization

Page 36: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

36

Reduced bandwidth consumption benefits all applications.

Run copy/initialization intensive applications with memory intensive SPEC applications.

Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications.

Page 37: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

37

0%

5%

10%

15%

20%

25%

30%

2-Core 4-Core 8-CoreImpr

ove

men

t o

ver

Bas

elin

e

System Performance Memory Energy Efficiency

Performance improvement increases with increasing core countConsistent improvement in

energy/instruction

Page 38: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

38

Discussion on interleaving and copy granularity

Detailed analysis of the fork benchmark

Detailed multi-core results and analysis

Results with the PSM mode

Analysis of RowClone-ZI

Comparison to memory-controller-based DMA

Page 39: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

39

Bulk data copy and initialization

• Unnecessarily move data on the memory channel

• Degrade system performance and energy efficiency

RowClone – perform copy in DRAM with low cost

• Uses row buffer to copy large quantity of data

• Source row → row buffer → destination row

• 11X lower latency and 74X lower energy for a bulk copy

Accelerate Copy-on-Write and Bulk Zeroing

• Forking, checkpointing, zeroing (security), VM cloning

Improves performance and energy efficiency at low cost

• 27% and 17% for 8-core systems (0.01% chip area overhead)

Page 40: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,

P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

Page 41: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization
Page 42: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

42

2-core 4-core 8-core# Workloads 138 50 40

Weighted Speedup 15% 20% 27%Instruction Throughput 14% 15% 25%

Harmonic Speedup 13% 16% 29%Max Slowdown Reduction 6% 12% 23%

Bandwidth/Instruction Reduction 29% 27% 28%Energy/Instruction Reduction 19% 17% 17%

Page 43: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

43

0

0.5

1

1.5

2

2.5

bootup compile forkbench mcached mysql shell

Inst

ruct

ions

per

Cyc

le

Baseline RowClone RowClone-ZI

Page 44: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

44

0.90.95

11.051.1

1.151.2

1.251.3

1.351.4

No

rmai

zed

Wei

ghte

d S

peed

up

Baseline RowClone RowClone-ZI

Page 45: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

45

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

Number of Pages Updated

64MB 128MB

Page 46: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

46

0

0.5

1

1.5

2

2.5

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

No

rmal

ized

IPC

Number of Pages Updated

64MB 128MB

Page 47: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

47

0

0.2

0.4

0.6

0.8

1

1.2

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

No

rmal

ized

En

ergy

Number of Pages Updated

Baseline RowClone-PSM RowClone-FPM

Page 48: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

48

Copy engines (Zhao et al. 2005, Jiang et al. 2009)

• Addresses cache pollution, pipeline stalls due to copy

• But requires data transfer over the memory channel

IRAM (Patterson et al. 1997)• Compute + memory using same technology

• Exploit high DRAM bandwidth

• Goal: Wider range of SIMD operations

• High cost

Page 49: Fast and Energy-Efficient In-DRAM Bulk Data Copy and ... · Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization ... Vivek Seshadri. 2 ... Bulk data copy and initialization

49

Copy/Initialization is important

• But not well known

Opportunity to perform in DRAM

• Not well known

This paper: Proof of concept

• More challenges to be addressed