1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14 2007 San Jose, CA
Jan 28, 2016
1
Ubiquitous Memory Introspection (UMI)
Qin Zhao, NUSRodric Rabbah, IBM
Saman Amarasinghe, MITLarry Rudolph, MIT
Weng-Fai Wong, NUS
CGO 2007, March 14 2007San Jose, CA
2
The complexity crunch
hardware complexity ↑+
software complexity ↑
⇓too many things happening concurrently,
complex interactions and side-effects
⇓less understanding of program
execution behavior ↓
3
• S/W developer– Computation vs.
communication ratio– Function/path frequency– Test coverage
• H/W designer– Cache behavior– Branch prediction
• System administrator– Interaction between
processes
The importance of program behavior characterization
Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems
4
Common approaches toprogram understanding
Overhead
Level of detail
Versatility
Space and time overhead
Coarse grained summaries vs. instruction level and contextual information
Portability and ability to adapt to different uses
5
Common approaches toprogram understanding
ProfilingSimulatio
n
Overhead high very high
Level of detail
very high very high
Versatility
high very high
6
Common approaches toprogram understanding
ProfilingSimulatio
nHW
Counters
Overhead high very high very low
Level of detail
very high very high very low
Versatility
high very high very low
7
Slowdown due to HW counters (counting L1 misses for 181.mcf on P4)
0100
200300400
500600700
800
native 10 100 1K 10K 100K 1M
HW counter sample size
Tim
e (
secon
ds)
(>2000%)
(1%)(10%) (1%)(35%)
(325%)
(% slowdown)
8
Common approaches toprogram understanding
ProfilingSimulatio
nHW
CountersDesirable Approach
Overhead high very high very low low
Level of detail
very high very high very low high
Versatility
high very high very low high
9
Common approaches toprogram understanding
ProfilingSimulatio
nHW
CountersDesirable Approach
Overhead high very high very low low
Level of detail
very high very high very low high
Versatility
high very high very low high
UMI
10
Key components
• Dynamic Binary Instrumentation – Complete coverage, transparent,
language independent, versatile, …
• Bursty Simulation– Sampling and fast forwarding
techniques– Detailed context information– Reasonable extrapolation and prediction
11
Ubiquitous Memory Introspection
online mini-simulations analyze short memory access profiles recorded from frequently executed code regions
• Key concepts– Focus on hot code regions– Selectively instrument instructions– Fast online mini-simulations– Actionable profiling results for online memory
optimizations
12
Working prototype
• Implemented in DynamoRIO– Runs on Linux– Used on Intel P4 and AMD K7
• Benchmarks– SPEC 2000, SPEC 2006, Olden– Server apps: MySQL, Apache– Desktop apps: Acrobat-reader,
MEncoder
13
UMI is cheap and non-intrusive(SPEC2K reference workloads on
P4)
Avera
ge
0%
20%
40%
60%
80%
168.
wupwise
171.
swim
172.
mgr
id
173.
applu
177.
mes
a
178.
galge
l
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
lucas
191.
fma3
d
200.
sixtra
ck
301.
apsi
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
56.b
zip2
300.
twolf
em3d
healt
hm
st
treea
dd tspft
DynamoRIO
UMI
•Average overhead is 14%•1% more than DynamoRIO
% s
low
dow
n c
om
pare
d t
o n
ati
ve
execu
tion
14
What can UMI do for you?
• Inexpensive introspection everywhere
• Coarse grained memory analysis– Quick and dirty
• Fine grained memory analysis– Expose opportunities for optimizations
• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation
15
Coarse grained memory analysis
• Experiment: measure cache misses in three ways – HW counters– Full cache simulator (Cachegrind)– UMI
• Report correlation between measurements– Linear relationship of two sets of data
1-1
strong positive
correlation
0strong negative
correlation
no correlation
16
Cache miss correlation results
• HW counter vs. Cachegrind– 0.99– 20x to 100x slowdown
• HW counter vs. UMI– 0.88– Less than 2x slowdown
in worst case0
0.2
0.4
0.6
0.8
1
(HW
, CG
)(H
W, U
MI)
17
What can UMI do for you?
• Inexpensive introspection everywhere
• Coarse grained memory analysis– Quick and dirty
• Fine grained memory analysis– Expose opportunities for optimizations
• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation
18
Fine grained memory analysis
• Experiment: predict delinquent loads using UMI– Individual loads with cache miss rate greater
than threshold
• Delinquent load set determined according to full cache simulator (Cachegrind)– Loads that contribute 90% of total cache misses
• Measure and report two metricsDelinquent loadsidentified byCachegrind
Delinquent loads predicted by
UMIrecallfalse
positive
19
UMI delinquent load prediction accuracy
Recall
(higher is better)
False positive
(lower is better)
benchmarks with ≥ 1% miss rate
88% 55%
benchmarks with < 1% miss rate
26% 59%
20
What can UMI do for you?
• Inexpensive introspection everywhere
• Coarse grained memory analysis– Quick and dirty
• Fine grained memory analysis– Expose opportunities for optimizations
• Runtime memory-specific optimizations– Pluggable prefetcher, learning, adaptation
21
Experiment: online stride prefetching
• Use results of delinquent load prediction
• Discover stride patterns for delinquent loads
• Insert instructions to prefetch data to L2
• Compare runtime for UMI and P4 with HW stride prefetcher
22
Data prefetching results summary
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
run
nin
g t
ime
no
rmal
ized
to
nat
ive
exec
uti
on
(lo
wer
is
bet
ter)
SW prefetching + DynamoRIO + UMI
HW prefetching + Native Binary
Combined prefetching + DynamoRIO + UMI
ft
Avera
ge
0.20.30.40.50.60.70.80.91.01.1
181.
mcf
171.
swim
172.
mgr
id
179.
art
183.
equa
ke
188.
amm
p
191.
fma3
d
301.
apsi
em3d m
st ft
Avera
ge
More than offsets slowdowns from binary instrumentation!
23
The Gory Details
24
UMI components
• Region selector
• Instrumentor
• Profile analyzer
25
Region selector
• Identify representative code regions– Focus on traces, loops– Frequently executed code– Piggy back on binary instrumentor tricks
• Reinforce with sampling – Time based, or leverage HW counters– Naturally adapt to program phases
26
Instrumentor
• Record address references– Insert instructions to record address
referenced by memory operation
• Manage profiling overhead– Clone code trace (akin to Arnold-Ryder
scheme)– Selective instrumentation of memory
operations• E.g., ignore stack and static data
27
Recording profilesCode Trace Profile
T1
T1
T2
T1
counter
0x1040x0280x013
0x012
0x1000x0240x011
op3op2op1
Code Trace T1
Code Trace T2
0x0310x032
op2op1counter
page protection to detect profile overflow
Address Profiles
early trace exist
early trace exist
28
Mini-simulator
• Triggered when code or address profile is full• Simple cache simulator
– Currently simulate L2 cache of host– LRU replacement– Improve approximations with techniques similar
to offline fast forwarding simulators• Warm up and periodic flushing
• Other possible analyzer– Reference affinity model– Data reuse and locality analysis
29
Mini-simulations and parameter sensitivity
181.mcf
0.000.20
0.400.60
0.801.00
1.20
1 2 4 8 16 32 64 128 256 512 1024sampling frequency threshold
recall falsepositive
1.40
normalized performance recall ratio false positive ratio
• Regular data structures• If sampling threshold too high, starts to exceed
loop bounds: miss out on profiling important loops
• Adaptive threshold is best
30
Mini-simulations and parameter sensitivity
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
64 128 256 512 1K 2K 4K 8K 32K
address profile length
197.parserrecall false
positive
16K
normalized performance recall ratio false positive ratio
• Irregular data structures• Need longer profiles to reach useful conclusions
31
Summary• UMI is lightweight and has a low overhead
– 1% more than DynamoRIO– Can be done with Pin, Valgrind, etc.– No added hardware necessary– No synchronization or syscall headaches– Other cores can do real work!
• Practical for extracting detailed information– Online and workload specific– Instruction level memory reference profiles– Versatile and user-programmable analysis
• Facilitate migration of offline memory optimizations to online setting
32
Future work• More types of online analysis
– Include global information– Incremental (leverage previous execution info)– Combine analysis across multiple threads
• More runtime optimizations – E.g., advanced prefetch optimization
• Hot data stream prefetch• Markov prefetcher
– Locality enhancing data reorganization• Pool allocation• Cooperative allocation between different threads
• Your ideas here…