Top Banner
RapidMRC: Approximating L2 Miss Rate Curves on Commodity Systems for Online Optimizations David Tam, Reza Azimi, Livio Soares, Michael Stumm University of Toronto {tamda, azimi, livio, stumm}@eecg.toronto.edu March 10, 2009 ASPLOS
43
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RapidMRC

RapidMRC:Approximating L2 Miss Rate Curves

on Commodity Systemsfor Online Optimizations

David Tam, Reza Azimi, Livio Soares, Michael Stumm

University of Toronto

{tamda, azimi, livio, stumm}@eecg.toronto.edu

March 10, 2009ASPLOS

Page 2: RapidMRC

RapidMRC

Motivation

● In shared cache (L2/L3):● Applications compete for space● LRU replacement policy used● Interference: applications can evict each other's content

App 2App 2App 2App 2App 2App 2App 2App 2Core Core

Shared Cache

App 1App 1App 1App 1App 1App 1App 1App 1

1Image © Ian Junor, Creative Commons license

Page 3: RapidMRC

RapidMRC

Cache Partitioning

● Eliminates cache space interference● More flexible than private caches● Up to 50% improvement in IPC

Shared Cache

Core Core

App 1App 1App 1App 1App 1App 1App 1App 1 App 2App 2App 2App 2App 2App 2App 2App 2

2

Page 4: RapidMRC

RapidMRC

● Apply page-coloring technique to implement set-based partitioning

● Guide physical page allocation to control cache line usage● Works on existing processors [WIOSCA'07]

Physical PagesColor A

Color A

Color A

}Color A(N sets)

L2 Cache{

Virtual Pages

Application

Static Mapping(Hardware)

OS Managed

OS-Based Partitioning

3

Page 5: RapidMRC

RapidMRC

Physical PagesColor A

Color A

Color A

}Color A(N sets)

L2 Cache{

Virtual Pages

Application A

Static Mapping(Hardware)

OS Managed

Virtual Pages

Application B

Color B

Color B

Color B

}Color B(N sets)

{

OS-Based Partitioning● Apply page-coloring technique to implement set-based partitioning

● Guide physical page allocation to control cache line usage● Works on existing processors [WIOSCA'07]

3

Page 6: RapidMRC

RapidMRC

0 10 20 30 40 50 60 70 80 90 1000

10

20

30

40

50

60

70

80

90

100Application X

Allocated Cache Size (%)

Mis

s R

ate

(%)

● Use Miss Rate Curve (MRC)

● Shows trade-off spectrum● Impact of allocating more/less cache space

● Also used for online sizing of main memory● e.g. [Patterson95], [Zhou04], [Soundararajan08], etc...

Determining L2 Size

4

Page 7: RapidMRC

RapidMRC

Our Approach: RapidMRC● Online approximation of L2 MRC● Software-based + hardware performance counters

● Requires no change to application binary or source● Runs on commodity hardware

● Rapid: 230 ms latency

5

Page 8: RapidMRC

RapidMRC

(1) Track memory accesses

(2) Feed accesses into Mattson's stack algorithm (1970)● Emulates LRU aging of cache lines● Maintains histogram of stack distances● Generates MRC using histogram

RapidMRC Steps

6

Page 9: RapidMRC

RapidMRC

Tracking Accesses Online● Hardware technique

● e.g. [Berg04], [Suh04], [Qureshi06], etc...● Target future processors

● Software technique● Track accesses by instrumenting program code● High tracking cost: large volume of data

● Hybrid technique● Track accesses with hardware performance counters● Lower tracking cost: smaller volume of data

7

Page 10: RapidMRC

RapidMRC

Tracking Accesses in IBM POWER5 Hardware Performance Counter Configuration● Upon every L2 access:

(1) Update sampling register with data address(2) Trigger interrupt to copy register to trace log in main memory

● May miss some L2 cache accesses● Caused by multiple in-flight L2 accesses● Results show negligible impact

Register File

L1 Cache

L2 Cache

Main Memory

L2 Accesses

8

Page 11: RapidMRC

RapidMRC

● For each L2 access in trace log:● Find element, record stack distance, move element to top● Update histogram with stack distance

● Stack size● One element per L2 cache line

● Optimizations● Hashing: eliminates linear traversal● Coarse-grained stack distance: reduces update operations

Mattson's Stack Algorithm

...

stacktop stack

bottom

Stack distance

9

Page 12: RapidMRC

RapidMRC

Experimental Setup● 1.5 GHz dual-core POWER5

● 1.875 MB shared L2 cache● 128-byte line size● 10-way set-associative● 16 partitions (colors) possible

● Linux 2.6.x● Added RapidMRC mechanism● Added cache partitioning mechanism

● Trace log length● 10 x number of cache lines

● 30 applications● SPECjbb2000, SPECcpu2000, SPECcpu2006 10

Page 13: RapidMRC

RapidMRC

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 14: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 15: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 16: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 17: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 18: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 19: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 20: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 21: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 22: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 23: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 24: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 25: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 26: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 27: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 28: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 29: RapidMRC

RapidMRC

Slice

Real L2 MRC

L2 Cache Size

L2 M

iss

Rat

e (M

PKI)

Obtaining Real L2 MRCs● Offline method: run application 16 times

● Once for each cache partition size● Measure L2 cache miss rate

mcf

11

Page 30: RapidMRC

RapidMRC

RapidMRC vs Real MRC● Accuracy: execution slice at 10 billion instrs

Cache Size (# colors)

Mis

s R

ate

(MP

KI)

jbb

mcf 2k xalancbmk

gzip mgrid

ammp

12

Page 31: RapidMRC

RapidMRC

Latency of RapidMRC

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

13

Page 32: RapidMRC

RapidMRC

Latency of RapidMRC

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

InvokeRapidMRC

Phase Change

13

Page 33: RapidMRC

RapidMRC

Latency of RapidMRC

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

InvokeRapidMRC

Phase Change

Amortize costs

13

Page 34: RapidMRC

RapidMRC

Latency of RapidMRC

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

InvokeRapidMRC

Phase Change

Amortize costs

Phase length: 5 mins median

13

Page 35: RapidMRC

RapidMRC

Latency of RapidMRC

Phase change detection● Abrupt change in IPC, miss rate ● Detectable online with low cost using

hardware performance counters

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

InvokeRapidMRC

Phase length: 5 mins median

Amortize costs

Phase Change

astar

L2 M

iss

Rat

e (M

PKI)

0

5

10

15

20

25

30

Instructions Completed (Billions)0 200 400 600 800 1000 1200 1400

size = 2size = 4size = 6size = 8size = 10size = 12size = 14size = 16

13

Page 36: RapidMRC

RapidMRC

Latency of RapidMRC

Time

InvokeRapidMRC

147 msTrace

83 msMattson's Alg

230 ms RapidMRC

ObtainedRapidMRC

4X app slowdown App is paused

InvokeRapidMRC

Phase length: 5 mins median

Amortize costs

Phase Change

astar

L2 M

iss

Rat

e (M

PKI)

0

5

10

15

20

25

30

Instructions Completed (Billions)0 200 400 600 800 1000 1200 1400

size = 2size = 4size = 6size = 8size = 10size = 12size = 14size = 16

Phase change detection● Abrupt change in IPC, miss rate ● Detectable online with low cost using

hardware performance counters

13

Page 37: RapidMRC

RapidMRC

RapidMRC for Sizing Partitions

equake

Mis

s R

ate

(MP

KI)

twolf

Mis

s R

ate

(MP

KI)

Cache Size (# colors) Cache Size (# colors)

● For equake + twolf● Has one long stable phase

● Feed MRCs into utility function● e.g. Minimize total miss rate

14

Page 38: RapidMRC

RapidMRC

RapidMRC for Sizing Partitions

equake

Mis

s R

ate

(MP

KI)

twolf

Mis

s R

ate

(MP

KI)

Mis

s R

ate

(MP

KI)

Mis

s R

ate

(MP

KI)

● For equake + twolf● Has one long stable phase

● Feed MRCs into utility function● e.g. Minimize total miss rate

Cache Size (# colors) Cache Size (# colors)

14

Page 39: RapidMRC

RapidMRC

RapidMRC for Sizing Partitions

equake

Mis

s R

ate

(MP

KI)

twolf

Mis

s R

ate

(MP

KI)

RapidMRC

Cache Size (# colors) Cache Size (# colors)

Mis

s R

ate

(MP

KI)

Mis

s R

ate

(MP

KI)

● For equake + twolf● Has one long stable phase

● Feed MRCs into utility function● e.g. Minimize total miss rate

14

Page 40: RapidMRC

RapidMRC

RapidMRC for Sizing Partitions

equake

Mis

s R

ate

(MP

KI)

twolf

Mis

s R

ate

(MP

KI)

RapidMRCReal MRC

Cache Size (# colors) Cache Size (# colors)

Mis

s R

ate

(MP

KI)

Mis

s R

ate

(MP

KI)

● For equake + twolf● Has one long stable phase

● Feed MRCs into utility function● e.g. Minimize total miss rate

14

Page 41: RapidMRC

RapidMRC

Performance Impact of Sizes

● twolf: 27% IPC improvement● equake: unaffected

Performance ofuncontrolledsharingRapidMRC Real MRC

L2 Cache Sizes (# of colors)16 14 12 10 8 6 4 2 0

twolfequake

0 2 4 6 8 10 12 14 16

15

Page 42: RapidMRC

RapidMRC

● RapidMRC● A tracing mechanism can be built with

hardware performance counters● Accurately approximates L2 MRCs online in software● 230 ms latency, invocable upon phase change

● Application of RapidMRC● Enables online sizing of L2 cache partitions

● Up to 27% performance improvement

Conclusion

16

Page 43: RapidMRC

RapidMRC

● Explore online optimizations made possible by:● RapidMRC

● Reducing energy● Guiding co-scheduling

● Tracing mechanism

● Extend model● Account for non-uniform miss penalties

Future Work

17