Ubiquitous Memory Introspection (UMI)

1

Ubiquitous Memory Introspection (UMI)

Qin Zhao, NUSRodric Rabbah, IBM

Saman Amarasinghe, MITLarry Rudolph, MIT

Weng-Fai Wong, NUS

CGO 2007, March 14 2007San Jose, CA

2

The complexity crunch

hardware complexity ↑+

software complexity ↑

⇓too many things happening concurrently,

complex interactions and side-effects

⇓less understanding of program

execution behavior ↓

3

• S/W developer– Computation vs.

communication ratio– Function/path frequency– Test coverage

• H/W designer– Cache behavior– Branch prediction

• System administrator– Interaction between

processes

The importance of program behavior characterization

Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

4

Common approaches toprogram understanding

Overhead

Level of detail

Versatility

Space and time overhead

Coarse grained summaries vs. instruction level and contextual information

Portability and ability to adapt to different uses

5


ProfilingSimulatio

n

Overhead high very high

Level of detail

very high very high

Versatility

high very high

6


ProfilingSimulatio

nHW

Counters

Overhead high very high very low

Level of detail

very high very high very low

Versatility

high very high very low

7

Slowdown due to HW counters (counting L1 misses for 181.mcf on P4)

0100

200300400

500600700

800

native 10 100 1K 10K 100K 1M

HW counter sample size

Tim

e (

secon

ds)

(>2000%)

(1%)(10%) (1%)(35%)

(325%)

(% slowdown)

8


ProfilingSimulatio

nHW

CountersDesirable Approach

Overhead high very high very low low

Level of detail

very high very high very low high

Versatility

high very high very low high

9


ProfilingSimulatio

nHW

CountersDesirable Approach

Overhead high very high very low low

Level of detail

very high very high very low high

Versatility

high very high very low high

UMI

10

Key components

• Dynamic Binary Instrumentation – Complete coverage, transparent,

language independent, versatile, …

• Bursty Simulation– Sampling and fast forwarding

techniques– Detailed context information– Reasonable extrapolation and prediction

11

Ubiquitous Memory Introspection

online mini-simulations analyze short memory access profiles recorded from frequently executed code regions

• Key concepts– Focus on hot code regions– Selectively instrument instructions– Fast online mini-simulations– Actionable profiling results for online memory

optimizations

12

Working prototype

• Implemented in DynamoRIO– Runs on Linux– Used on Intel P4 and AMD K7

• Benchmarks– SPEC 2000, SPEC 2006, Olden– Server apps: MySQL, Apache– Desktop apps: Acrobat-reader,

MEncoder

13

UMI is cheap and non-intrusive(SPEC2K reference workloads on

P4)

Avera

ge

0%

20%

40%

60%

80%

168.

wupwise

171.

swim

172.

mgr

id

173.

applu

177.

mes

a

178.

galge

l

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

lucas

191.

fma3

d

200.

sixtra

ck

301.

apsi

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

56.b

zip2

300.

twolf

em3d

healt

hm

st

treea

dd tspft

DynamoRIO

UMI

•Average overhead is 14%•1% more than DynamoRIO

% s

low

dow

n c

om

pare

d t

o n

ati

ve

execu

tion

14

What can UMI do for you?

• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

15

Coarse grained memory analysis

• Experiment: measure cache misses in three ways – HW counters– Full cache simulator (Cachegrind)– UMI

• Report correlation between measurements– Linear relationship of two sets of data

1-1

strong positive

correlation

0strong negative

correlation

no correlation

16

Cache miss correlation results

• HW counter vs. Cachegrind– 0.99– 20x to 100x slowdown

• HW counter vs. UMI– 0.88– Less than 2x slowdown

in worst case0

0.2

0.4

0.6

0.8

1

(HW

, CG

)(H

W, U

MI)

17





• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

18

Fine grained memory analysis

• Experiment: predict delinquent loads using UMI– Individual loads with cache miss rate greater

than threshold

• Delinquent load set determined according to full cache simulator (Cachegrind)– Loads that contribute 90% of total cache misses

• Measure and report two metricsDelinquent loadsidentified byCachegrind

Delinquent loads predicted by

UMIrecallfalse

positive

19

UMI delinquent load prediction accuracy

Recall

(higher is better)

False positive

(lower is better)

benchmarks with ≥ 1% miss rate

88% 55%

benchmarks with < 1% miss rate

26% 59%

20





• Runtime memory-specific optimizations– Pluggable prefetcher, learning, adaptation

21

Experiment: online stride prefetching

• Use results of delinquent load prediction

• Discover stride patterns for delinquent loads

• Insert instructions to prefetch data to L2

• Compare runtime for UMI and P4 with HW stride prefetcher

22

Data prefetching results summary

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

run

nin

g t

ime

no

rmal

ized

to

nat

ive

exec

uti

on

(lo

wer

is

bet

ter)

SW prefetching + DynamoRIO + UMI

HW prefetching + Native Binary

Combined prefetching + DynamoRIO + UMI

ft

Avera

ge

0.20.30.40.50.60.70.80.91.01.1

181.

mcf

171.

swim

172.

mgr

id

179.

art

183.

equa

ke

188.

amm

p

191.

fma3

d

301.

apsi

em3d m

st ft

Avera

ge

More than offsets slowdowns from binary instrumentation!

23

The Gory Details

24

UMI components

• Region selector

• Instrumentor

• Profile analyzer

25

Region selector

• Identify representative code regions– Focus on traces, loops– Frequently executed code– Piggy back on binary instrumentor tricks

• Reinforce with sampling – Time based, or leverage HW counters– Naturally adapt to program phases

26

Instrumentor

• Record address references– Insert instructions to record address

referenced by memory operation

• Manage profiling overhead– Clone code trace (akin to Arnold-Ryder

scheme)– Selective instrumentation of memory

operations• E.g., ignore stack and static data

27

Recording profilesCode Trace Profile

T1

T1

T2

T1

counter

0x1040x0280x013

0x012

0x1000x0240x011

op3op2op1

Code Trace T1

Code Trace T2

0x0310x032

op2op1counter

page protection to detect profile overflow

Address Profiles

early trace exist

early trace exist

28

Mini-simulator

• Triggered when code or address profile is full• Simple cache simulator

– Currently simulate L2 cache of host– LRU replacement– Improve approximations with techniques similar

to offline fast forwarding simulators• Warm up and periodic flushing

• Other possible analyzer– Reference affinity model– Data reuse and locality analysis

29

Mini-simulations and parameter sensitivity

181.mcf

0.000.20

0.400.60

0.801.00

1.20

1 2 4 8 16 32 64 128 256 512 1024sampling frequency threshold

recall falsepositive

1.40

normalized performance recall ratio false positive ratio

• Regular data structures• If sampling threshold too high, starts to exceed

loop bounds: miss out on profiling important loops

• Adaptive threshold is best

30

Mini-simulations and parameter sensitivity

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

64 128 256 512 1K 2K 4K 8K 32K

address profile length

197.parserrecall false

positive

16K

normalized performance recall ratio false positive ratio

• Irregular data structures• Need longer profiles to reach useful conclusions

31

Summary• UMI is lightweight and has a low overhead

– 1% more than DynamoRIO– Can be done with Pin, Valgrind, etc.– No added hardware necessary– No synchronization or syscall headaches– Other cores can do real work!

• Practical for extracting detailed information– Online and workload specific– Instruction level memory reference profiles– Versatile and user-programmable analysis

• Facilitate migration of offline memory optimizations to online setting

32

Future work• More types of online analysis

– Include global information– Incremental (leverage previous execution info)– Combine analysis across multiple threads

• More runtime optimizations – E.g., advanced prefetch optimization

• Hot data stream prefetch• Markov prefetcher

– Locality enhancing data reorganization• Pool allocation• Cooperative allocation between different threads

• Your ideas here…

Ubiquitous Memory Introspection (UMI)

Documents