Top Banner
Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons, Michael A. Kozuch
54

Gennady Pekhimenko , Vivek Seshadri , Yoongu Kim, Hongyi Xin , Onur Mutlu , Todd C. Mowry

Feb 24, 2016

Download

Documents

yaphet

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency . Phillip B. Gibbons, Michael A. Kozuch. Gennady Pekhimenko , Vivek Seshadri , Yoongu Kim, Hongyi Xin , Onur Mutlu , Todd C. Mowry. Executive Summary. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

Linearly Compressed Pages: A Main Memory

Compression Framework with Low Complexity and Low Latency

Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry

Phillip B. Gibbons, Michael A. Kozuch

Page 2: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

2

Executive Summary Main memory is a limited shared resource Observation: Significant data redundancy Idea: Compress data in main memory Problem: How to avoid inefficiency in address

computation? Solution: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression1. Increases memory capacity (62% on average)2. Decreases memory bandwidth consumption (24%)3. Decreases memory energy consumption (4.9%)4. Improves overall performance (13.9%)

Page 3: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

3

Potential for Data CompressionSignificant redundancy in in-memory data:

0x00000000

How can we exploit this redundancy?• Main memory compression helps• Provides effect of a larger memory without

making it physically larger

0x0000000B 0x00000003 0x00000004 …

Page 4: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

4

Challenges in Main Memory Compression

1. Address Computation

2. Mapping and Fragmentation

Page 5: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

L0 L1 L2 . . . LN-1

Cache Line (64B)

Address Offset 0 64 128 (N-1)*64

L0 L1 L2 . . . LN-1Compressed Page

0 ? ? ?Address Offset

Uncompressed Page

Challenge 1: Address Computation

5

Page 6: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

Challenge 2: Mapping & Fragmentation

6

Virtual Page (4KB)

Physical Page (? KB) Fragmentation

Virtual Address

Physical Address

Page 7: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

7

Outline• Motivation & Challenges• Shortcomings of Prior Work• LCP: Key Idea • LCP: Implementation• Evaluation• Conclusion and Future Work

Page 8: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

8

Key Parameters in Memory CompressionCompressionRatio

Address Comp.Latency

Decompression Latency

Complexityand Cost

Page 9: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

9

Shortcomings of Prior WorkCompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexityand Cost

IBM MXT[IBM J.R.D. ’01]

2X 64 cycles

Page 10: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

10

Shortcomings of Prior Work (2)CompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexity And Cost

IBM MXT[IBM J.R.D. ’01] Robust Main Memory Compression [ISCA’05]

Page 11: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

11

Shortcomings of Prior Work (3)CompressionMechanisms

CompressionRatio

Address Comp.Latency

Decompression Latency

Complexity And Cost

IBM MXT[IBM J.R.D. ’01] Robust Main Memory Compression [ISCA’05]

LCP: Our Proposal

Page 12: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

12

Linearly Compressed Pages (LCP): Key Idea

64B 64B 64B 64B . . .

. . .4:1 Compression

64B

Uncompressed Page (4KB: 64*64B)

Compressed Data (1KB)

LCP effectively solves challenge 1: address computation

128

32

Fixed compressed size

Page 13: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

13

4:1 Compression

E

LCP: Key Idea (2)

64B 64B 64B 64B . . .

. . . M

Metadata (64B)

ExceptionStorage

64B

Uncompressed Page (4KB: 64*64B)

Compressed Data (1KB)

idx

E0

Page 14: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

14

E

But, wait …

64B 64B 64B 64B . . .

. . . M

4:1 Compression

64B

Uncompressed Page (4KB: 64*64B)

Compressed Data (1KB)

How to avoid 2 accesses ?

Metadata (MD) cache

Page 15: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

15

Key Ideas: Summary

Fixed compressed size per cache line

Metadata (MD) cache

Page 16: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

16

Outline• Motivation & Challenges• Shortcomings of Prior Work• LCP: Key Idea • LCP: Implementation• Evaluation• Conclusion and Future Work

Page 17: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

17

LCP Overview• Page Table entry extension

• compression type and size (fixed encoding)• OS support for multiple page sizes

• 4 memory pools (512B, 1KB, 2KB, 4KB)• Handling uncompressible data• Hardware support

• memory controller logic• metadata (MD) cache

PTE

512B 1KB 2KB 4KB

Page 18: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

Page Table Entry Extension

18

c-bit (1b)c-type (3b)

Page Table Entry c-size (2b)

c-base (3b)• c-bit (1b) – compressed or uncompressed page• c-type (3b) – compression encoding used• c-size (2b) – LCP size (e.g., 1KB)• c-base (3b) – offset within a page

Page 19: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

19

Physical Memory Layout

1

4

4KB

2KB 2KB

1KB 1KB 1KB 1KB

512B 512B ... 512B

4KB

Page Table

PA1

PA2

PA2 + 512*1

PA1 + 512*4

PA0

Page 20: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

20

Memory Request Flow

1. Initial Page Compression

2. Cache Line Read

3. Cache Line Writeback

Page 21: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

21

Initial Page Compression (1/3)Memory Request Flow (2)

Last-LevelCache

Core TLB

Compress/ Decompress

MemoryController

MD Cache

Processor

Disk

DRAM

4KB

1KB

1. Initial Page Compression2. Cache Line Read

LD

LD

1KB$Line

3. Cache Line Writeback

$Line

2KB

$Line

Cache Line Read (2/3)Cache Line Writeback (3/3)

Page 22: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

22

Handling Page Overflows• Happens after writebacks, when all slots in the

exception storage are already taken

• Two possible scenarios:• Type-1 overflow: requires larger physical page size

(e.g., 2KB instead of 1KB)• Type-2 overflow: requires decompression and full

uncompressed physical page (e.g., 4KB)

$ line

M

Compressed Data

E0

Exception Storage

E1 E2

Happens infrequently -once per ~2M instructions

Page 23: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

23

Compression Algorithms• Key requirements:

• Low hardware complexity• Low decompression latency• High effective compression ratio

• Frequent Pattern Compression [ISCA’04]

• Uses simplified dictionary-based compression

• Base-Delta-Immediate Compression [PACT’12]

• Uses low-dynamic range in the data

Page 24: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

24

Base-Delta Encoding [PACT’12]32-byte Uncompressed Cache Line

0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8

0xC04039C0Base

0x00

1 byte

0x08

1 byte

0x10

1 byte

… 0x38 12-byte Compressed Cache Line

20 bytes saved Fast Decompression: vector addition

Simple Hardware: arithmetic and comparison

Effective: good compression ratio

BDI [PACT’12] has two bases:1. zero base (for narrow values)2. arbitrary base (first non-zero

value in the cache line)

Page 25: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

25

• Memory bandwidth reduction:

• Zero pages and zero cache lines• Handled separately in TLB (1-bit) and in metadata (1-bit per cache line)

LCP-Enabled Optimizations

64B 64B 64B 64B

1 transfer instead of 4

Page 26: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

26

Outline• Motivation & Challenges• Shortcomings of Prior Work• LCP: Key Idea • LCP: Implementation• Evaluation• Conclusion and Future Work

Page 27: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

27

Methodology• Simulator: x86 event-driven based on Simics

• Workloads (32 applications)• SPEC2006 benchmarks, TPC, Apache web server

• System Parameters• L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08]• 512kB - 16MB L2 caches • DDR3-1066, 1 memory channel

• Metrics• Performance: Instructions per cycle, weighted speedup• Capacity: Effective compression ratio• Bandwidth: Bytes per kilo-instruction (BPKI)• Energy: Memory subsystem energy

Page 28: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

28

Evaluated DesignsDesign Description

Baseline Baseline (no compression)

RMC Robust main memory compression[ISCA’05](RMC) and FPC[ISCA’04]

LCP-FPC LCP framework with FPC

LCP-BDI LCP framework with BDI[PACT’12]

LZ Lempel-Ziv compression (per page)

Page 29: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

29

Effect on Memory Capacity32 SPEC2006, databases, web workloads, 2MB L2 cache

LCP-based designs achieve competitive average compression ratios with prior work

0.00.51.01.52.02.5

1.00

1.59 1.52 1.62

2.60Baseline RMC LCP-FPC LCP-BDI LZ

Com

pres

sion

Ra

tio

Page 30: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

30

Effect on Bus Bandwidth32 SPEC2006, databases, web workloads, 2MB L2 cache

LCP-based designs significantly reduce bandwidth (24%)(due to data compression)

Bett

er

0.00.20.40.60.81.0 1.00

0.79 0.80 0.76

Baseline RMC LCP-FPC LCP-BDI

Norm

alize

d BP

KI

Page 31: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

31

Effect on Performance

LCP-based designs significantly improve performance over RMC

1-core 2-core 4-core0%2%4%6%8%

10%12%14%16%

RMC LCP-FPC LCP-BDI

Perf

orm

ance

Im

prov

emen

t

Page 32: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

32

Effect on Memory Subsystem Energy32 SPEC2006, databases, web workloads, 2MB L2 cache

LCP framework is more energy efficient than RMC

Bette

r

0.00.20.40.60.81.01.2

1.00 1.06 0.97 0.95

Baseline RMC LCP-FPC LCP-BDI

Norm

alize

d En

ergy

Page 33: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

33

Effect on Page Faults32 SPEC2006, databases, web workloads, 2MB L2 cache

LCP framework significantly decreases the number of page faults (up to 23% on average for 768MB)

256MB 512MB 768MB 1GB0

0.20.40.60.8

11.2

8%14% 23%

21%

Baseline LCP-BDI

DRAM Size

Norm

alize

d #

of

Page

Fau

lts

Page 34: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

34

Other Results and Analyses in the Paper

• Analysis of page overflows• Compressed page size distribution• Compression ratio over time• Number of exceptions (per page)• Detailed single-/multicore evaluation• Comparison with stride prefetching

• performance and bandwidth

Page 35: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

35

Conclusion• Old Idea: Compress data in main memory• Problem: How to avoid inefficiency in address

computation?• Solution: A new main memory compression framework

called LCP (Linearly Compressed Pages)• Key idea: fixed-size for compressed cache lines within a page

• Evaluation:1. Increases memory capacity (62% on average)2. Decreases bandwidth consumption (24%)3. Decreases memory energy consumption (4.9%)4. Improves overall performance (13.9%)

Page 36: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

Linearly Compressed Pages: A Main Memory Compression

Framework with Low Complexity and Low Latency

Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry

Phillip B. Gibbons, Michael A. Kozuch

Page 37: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

37

Backup Slides

Page 38: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

38

Large Pages (e.g., 2MB or 1GB)

• Splitting large pages into smaller 4KB sub-pages (compressed individually)

• 64-byte metadata chunks for every sub-page

2KB 2KB

2KB 2KBM

Page 39: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

Physically Tagged Caches

39

Core

TLB

tagtagtag

Physical Address

datadatadata

VirtualAddress

Critical PathAddress Translation

L2 CacheLines

Page 40: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

40

Changes to Cache Tagging Logic

Before:

tagtag

p-base

datadatadata

CacheLines

tag

• p-base – physical page base address• c-idx – cache line index within the page

After:p-base c-idx

Page 41: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

41

Analysis of Page Overflows

apache

bzip2

gcc

grom

acs

lbm

libqu

antum

omne

tpp

sjen

g

sphinx3

tpch6

zeusmp

Geo

Mea

n

1E-08

1E-07

1E-06

1E-05

1E-04

1E-03Ty

pe-1

Ove

rflo

ws p

er in

str.

(lo

g-sc

ale)

Page 42: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

42

Frequent Pattern CompressionIdea: encode cache lines based on frequently occurring patterns, e.g., first half of a word is zero

0x00000001 0x00000000 0xFFFFFFFF 0xABCDEFFF

0x00000001 001

0x00000000 000

0xFFFFFFFF 011

0xABCDEFFF 111

Frequent Patterns:000 – All zeros001 – First half zeros010 – Second half zeros011 – Repeated bytes100 – All ones…111 – Not a frequent pattern

001 0x0001 000 011 0xFF 111 0xABCDEFFF

0x0001

0xFF

0xABCDEFFF

Page 43: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

43

GPGPU Evaluation

• Gpgpu-sim v3.x• Card: NVIDIA GeForce GTX 480 (Fermi)• Caches:

– DL1: 16 KB with 128B lines– L2: 786 KB with 128B lines

• Memory: GDDR5

Page 44: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

44

Effect on Bandwidth ConsumptionBF

SM

UM JPEG NN LP

SST

OCO

NS SCP

spm

vsa

d

back

prop

hots

pot

stre

amcl

uste

r

PVC

PVR

InvI

dx SS bfs

bh dmr

mst sp

sssp

GeoM

ean

CUDA Parboil Rodinia Mars Lonestar

0.00.51.01.52.02.53.0

BDI LCP-BDI

Nor

mal

ized

BPK

I

Page 45: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

45

Effect on ThroughputBF

SM

UM JPEG NN LP

SST

OCO

NS SCP

spm

vsa

d

back

prop

hots

pot

stre

amcl

uste

r

PVC

PVR

InvI

dx SS bfs

bh dmr

mst sp

sssp

GeoM

ean

CUDA Parboil Rodinia Mars Lonestar

0.81.01.21.41.61.8

Baseline BDI

Nor

mal

ized

Per

form

ance

Page 46: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

46

Physical Memory Layout

1

4

4KB

2KB 2KB

1KB 1KB 1KB 1KB

512B 512B ... 512B

4KB

Page Table

PA1 c-basePA2

PA2 + 512*1

PA1 + 512*4

PA0

Page 47: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

47

Page Size Distribution

Page 48: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

48

Compression Ratio Over Time

Page 49: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

49

IPC (1-core)

Page 50: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

50

Weighted Speedup

Page 51: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

51

Bandwidth Consumption

Page 52: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

52

Page Overflows

Page 53: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

53

Stride Prefetching - IPC

Page 54: Gennady Pekhimenko ,  Vivek Seshadri ,  Yoongu  Kim,  Hongyi Xin ,  Onur Mutlu ,  Todd C.  Mowry

54

Stride Prefetching - Bandwidth