Page 1
© 2000 Universität Karlsruhe, System Architecture
Group
Efficient and Flexible Value Sampling
Michael Burrows (Compaq SRC)Ulfar Erlingson (deCODE Genetics)
Shun-Tak Leung (Compaq SRC)Mark Vandevoorde (AltaVista)
Carl Waldspurger (VMWare)Kip Walker (CMU)
Bill Weihl (Akamai)
ASPLOS-IX
Ninth International Conference onArchitectural Support forProgramming Languages andOperating Systems
November 12-15 2000, Cambridge, MA
Page 2
© 2000 Universität Karlsruhe, System Architecture Group
Goal: Value profiling
Record values during program execution
Find “semi-invariant” values for prefetching, specialization, speculation
For example: A load reads from 0x3C8 95% of the
time A function is always evaluated to zero
Page 3
© 2000 Universität Karlsruhe, System Architecture Group
Possible techniques
Instrument program with binary editor Interpret, and record values generated Sample values using periodic timer
interrupts
Last approach is explored.It is potentially far less intrusive.
Page 4
© 2000 Universität Karlsruhe, System Architecture Group
DCPI overview/review
DCPI profiles all address spaces Generates periodic interrupts Interrupt routine records process ID and
PC User-space daemon maps to
offset/executable, and aggregates samples in files
Tools report data for executables, procedures, and instructions
Page 5
© 2000 Universität Karlsruhe, System Architecture Group
Value sampling with DCPI
On each interrupt, collect values Somehow associate values with
PC, PID Summarize values, and aggregate
in files Tools analyze summaries to find
optimization opportunities
Page 6
© 2000 Universität Karlsruhe, System Architecture Group
Inherited properties of DCPI
It’s a sampling technique It has modest overhead
(Low enough for production use) It’s transparent to user It can be used on the whole system
(Including operating system kernel)
Page 7
© 2000 Universität Karlsruhe, System Architecture Group
Associating values with instructions
Which instructions generated which values?
On interrupt, we don’t know: Which instruction was last executed Which register was last written
Page 8
© 2000 Universität Karlsruhe, System Architecture Group
First try — “bounce back” interrupt
Alpha 21164 can interrupt after k user-mode cycles
On a periodic interrupt:1. Record PC2. Set k to be small3. Return
On “bounce-back” interrupt, match executed instructions against register content
Page 9
© 2000 Universität Karlsruhe, System Architecture Group
Bounce-back is tricky
Possible remedies: Evict i-cache line to improve predictability Start by setting k small, and increase
Works only in user-mode on the 21164 Hard to set k because timing is
unpredictable(E.g., Chip interrupts 6 cycles after event)
Occasionally confused by tight loops
Page 10
© 2000 Universität Karlsruhe, System Architecture Group
Collecting values with an interpreter
On each interrupt, interpret a few instructions
Save values associated with each instruction
Page 11
© 2000 Universität Karlsruhe, System Architecture Group
Should interpreter have side-effects?
No side effects interpreter correctness is less critical
But we want to profile the kernel, and loads have side-effects in device drivers
So interpreter must affect process state
Fortunately, testing is merely tedious
Page 12
© 2000 Universität Karlsruhe, System Architecture Group
Interpreter side-effects
ldt ...add ...and ...eqv ...stt ...mult ...and ...ldq ...ldq ...add ...
eqv ...stt ...mult ...and ...ldq ...
Interpreted Executed
Without side-effect
ldt ...add ...and ...eqv ...stt ...mult ...and ...ldq ...ldq ...add ...
eqv ...stt ...mult ...and ...ldq ...
Interpreted Executed
With side-effects
Page 13
© 2000 Universität Karlsruhe, System Architecture Group
Should interpreter have side-effects?
No side effects interpreter correctness is less critical
But we want to profile the kernel, and loads have side-effects in device drivers
So interpreter must affect process state
Fortunately, testing is merely tedious
Page 14
© 2000 Universität Karlsruhe, System Architecture Group
Interpreter advantages
It’s easy to associate values with instructions
We can gather other values, e.g., load latency, PC of calling procedure
User can configure what data to gather
We can interpret in user-mode via an up-call
Page 15
© 2000 Universität Karlsruhe, System Architecture Group
Interpreter limitations
Can’t interpret when interrupts are disabled
Can’t interpret through an OS trap Can’t interpret for too long
Page 16
© 2000 Universität Karlsruhe, System Architecture Group
Hotlists: Gibbons & Matias’ algorithm
One algorithm instance per PC Counts each value with probability p p is decreased so counts fit in
constant space Probabilistically yields most common
values and most frequent estimates It’s a great simplification over ad hoc
schemes
Page 17
© 2000 Universität Karlsruhe, System Architecture Group
Gibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 1.0
Page 18
© 2000 Universität Karlsruhe, System Architecture Group
Gibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 1.0
e
Need to replaceone value
Page 19
© 2000 Universität Karlsruhe, System Architecture Group
Gibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
c d
p = 0.75
e
Multiply p with someamount f
(f = 0.75)
Throw biasedcoins with
probability f
4 3 0 2
Replace countsby number
of seen heads
Page 20
© 2000 Universität Karlsruhe, System Architecture Group
Gibbons & Matias’ algorithm
0 0 0 0
a
a
1Hotlist:
b
b
1
a
2
Produced values: c a a b d b a d d
5 3 1 3
e d
p = 0.75
e
4 3 1 2
Replace a value whichhas zero count
Count/p is an estimate of number oftimes a value has been seen. E.g., the
value ‘a’ has been seen 4/p = 5.33 times
Page 21
© 2000 Universität Karlsruhe, System Architecture Group
A value profile
Cycles Instruction Hotlist 39 ldq ra, -16(t12) ra:(98.94% 0xff...ff)
(0.53% ... 0 and a1, s1, v0 v0:(4.76% 0x55...00)
(3.17% ... 0 and a1, s3, a1 a1:(100.00% 0x0) 0 eqv v0, s2, v0 v0:(4.23% 0x55...00)
(2.65% ... 0 xor a1, s4, a1 a1:(100.00% 0x0)9748 bic ra, v0, v0 v0:(100.00% 0x55...1c)
Page 22
© 2000 Universität Karlsruhe, System Architecture Group
Load latencies
Measured using CPU’s cycle counter
Cycles Instruction Latencies 0.0 ldt $f17, 8(t6) (94.3% D) (3.6% M) (2.1% B) ... 0.0 ldt $f11, 0(t2) (84.9% M) (15.1% D) (0.0% B) ...102.3 mult $f11,$f17,$f17
Page 23
© 2000 Universität Karlsruhe, System Architecture Group
Are latency values meaningful?
Usually, yes We displace a few percent of d-
cache lines
Can’t get i-cache fill latencies with interpreter
Nor mispredict penalties
Page 24
© 2000 Universität Karlsruhe, System Architecture Group
21264 replay traps
Reordering can violate memory sematics(E.g., a load of L reorded before a store to L)
A replay trap replays the offending instruction
Expensive: All later instructions are replayed
Hardware counters saw where the trap occurred, but not why
Page 25
© 2000 Universität Karlsruhe, System Architecture Group
Identifying replay trap cause: vreplay
Interpret >100 instructions at a time
Interpreter compares load/store addresses
Records which instructions could conflict
Later, combine results and hardware counts
Page 26
© 2000 Universität Karlsruhe, System Architecture Group
Vreplay output
Replays Count 0 0x...2a0 stt $f8, 104(sp) 5 (100.0% 0x...4f8) 0 0x...2a4 bis a0, a0, s5 0 0 0x...2a8 bis a1, a1, s6 0... 0 0x...2c8 bis v0, v0, s2 043 0x...2cc ldq at, 0(a0) 25 (100.0% 0x...0d0) 0 0x...2d0 bsr ra, 0x20027a50 0
Page 27
© 2000 Universität Karlsruhe, System Architecture Group
OverheadCPU2000 integer benchmarks
vprof interprets 4 instructions per 124K vreplay interprets 128 instructions per 8M
DCPI optionAverage
Slowdown %no vprof 3.9vprof 10.7vreplay 5.9
Page 28
© 2000 Universität Karlsruhe, System Architecture Group
Summary
Periodic interpretation: flexible, about 10% overhead
Provides value profiles and data not provided by hardware counters
Gibbons and Matias’ algorithm is useful
Download from:http://www.tru64unix.compaq.com/dcpi/