Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiP A
Dec 30, 2015
Pipelined Profiling and Analysis
on Multi-core Systems
Qin Zhao
Ioana Cutcutache
Weng-Fai Wong
PiPA
CGO 20082
Why PiPA?
Code profiling and analysis– very useful for understanding program behavior
– implemented using dynamic instrumentation systems
– several challenges – coverage, accuracy, overhead
• overhead due to instrumentation engine
• overhead due to profiling code
The performance problem!– Cachegrind - 100x slowdown
– Pin dcache - 32x slowdown
Need faster tools!
CGO 20083
Our Goals
Improve the performance– reduce the overall profiling and analysis overhead– but maintain the accuracy
How?– parallelize! – optimize
Keep it simple– easy to understand– easy to build new analysis tools
CGO 20084
Parallelized slice profiling– SuperPin, Shadow Profiling
Suitable for simple, independent tasks
Previous Approach
Uninstrumented application
Instrumented application
SuperPinned application
Original application
Instrumentation overhead
Profiling overhead
Instrumented slices
CGO 20085
Pipelining!
PiPA Key Idea
Instrumented application – stage 0
Profile processing – stage 1
Time
Analysis on profile 1
Analysis on profile 2
Analysis on profile 3
Analysis on profile 4
Parallel analysis stage 2
Th
rea
ds
or
Pro
ce
ss
es
Original application Instrumentation overhead Profiling overhead
Profile Inform
ation
CGO 20086
PiPA Challenges
Minimize the profiling overhead– Runtime Execution Profile (REP)
Minimize the communication between stages– double buffering
Design efficient parallel analysis algorithms– we focus on cache simulation
PiPA Prototype
Cache Simulation
CGO 20088
Our Prototype
Implemented in DynamoRIO
Three stages– Stage 0 : instrumented application – collect REP
– Stage 1 : parallel profile recovery and splitting
– Stage 2 : parallel cache simulation
Experiments– SPEC2000 & SPEC2006 benchmarks
– 3 systems : dual core, quad core, eight core
CGO 20089
Communication
Keys to minimize the overhead– double buffering– shared buffers– large buffers
Example – communication between stage 0 and stage 1
Shared buffers Processing threads at stage 1Profiling thread at stage 0
Stage 0: Profiling
compact profile
minimal overhead
CGO 200811
Stage 0 : Profiling
Runtime Execution Profile (REP)– fast profiling– small profile size– easy information extraction
Hierarchical Structure– profile buffers
• data units– slots
Can be customized for different analyses– in our prototype we consider cache simulation
CGO 200812
REP Example
bb1
REP
eax
esp
bb2
REP Unit
esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3
bb1:
REP Unit
tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0
CanaryZone
.
.
.
.
.
.
REPS REPD
profile basepointer
.
.
.
Next buffer
mov [eax + 0x0c] eaxmov ebp esppop ebpreturn
First buffer
type: read
offset: 12value_slot: 1size_slot: -1
pc: 0x080483dctype: readsize: 4offset: 0value_slot: 2size_slot: -1
pc: 0x080483ddtype: readsize: 4offset: 4value_slot: 2size_slot: -1
pc: 0x080483d7
size: 4
12 bytes
CGO 200813
Profiling Optimization
Store register values in REP– avoid computing the memory address
Register liveness analysis– avoid register stealing if possible
Record a single register value for multiple references– a single stack pointer value for a sequence of push/pop– the base address for multiple accesses to the same structure
More in the paper
CGO 200814
esp
size_slot: -1value_slot: 1
pc: 0x080483d7
size: 4
REP Example
bb1
REP
eax
bb2
REP Unit
esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3
bb1:
REP Unit
tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0
CanaryZone
.
.
.
.
.
.
REPS REPD
profile basepointer
.
.
.
Next buffer
mov [eax + 0x0c] eaxmov ebp esppop ebpreturn
First buffer
type: read
offset: 12
pc: 0x080483dctype: readsize: 4offset: 0
size_slot: -1
pc: 0x080483ddtype: readsize: 4offset: 4
size_slot: -1
value_slot: 2
value_slot: 2
CGO 200815
0
1
2
3
4
5
6
7
8
9
10
Slo
wd
ow
n r
ela
tiv
e
to n
ati
ve
ex
ecu
tio
n
SPECint2000 SPECfp2000 SPEC2000
0
1
2
3
4
5
6
7
8
9
Slo
wd
ow
n r
ela
tiv
e
to n
ati
ve
ex
ecu
tio
n
SPECint2000 SPECfp2000 SPEC2000
0
1
2
3
4
5
6
7
8
Slo
wd
ow
n r
ela
tiv
e
to n
ati
ve
ex
ecu
tio
n
SPECint2000 SPECfp2000 SPEC2000
Profiling Overhead
optimized instrumentation
instrumentation without optimization
2-core 4-core
8-core
Avg slowdown : ~ 3x
Stage 1: Profile Recovery
fast recovery
CGO 200817
Stage 1 : Profile Recovery
Need to reconstruct the full memory reference information– <pc, address, type, size>
bb1
REP
0x2304
0x141a
bb2
REP Unit
0x1423REP Unit
tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0
CanaryZone
.
.
.
.
.
.
.
.
.
pc: 0x080483d7type: read size: 4offset: 12value_slot: 1size_slot: -1
pc: 0x080483dctype: read size: 4offset: 0value_slot: 2size_slot: -1
. . .
PC Address Type Size.... ............. ........ .........
0x080483d7 read 4
0x080483dc read 4
0x2310
0x141a.... ............. ........ .........
CGO 200818
Profile Recovery Overhead
Factor 1 : buffer size
Experiments done on the 8-core system, using 8 recovery threads
0
1
2
3
4
5
6
7
8
9
Slo
wd
ow
n r
elat
ive
to n
ativ
e ex
ecu
tio
n
SPECint2000 SPECfp2000 SPEC2000
small (64k) medium (1M) large (16M)
CGO 200819
Profile Recovery Overhead
Factor 2 : the number of recovery threads
Experiments done on the 8-core system, using 16MB buffers
0
2
4
6
8
10
12
14
16
18
20
Slo
wd
ow
n r
elat
ive
to n
ativ
e ex
ecu
tio
n
SPECint2000 SPECfp2000 SPEC2000
0 threads
2 threads
4 threads
6 threads
8 threads
CGO 200820
Profile Recovery Overhead
Factor 3 : the number of available cores
Experiments done using 16MB buffers and 8 recovery threads
0.00
0.50
1.00
1.50
2.00
2.50
Slo
wd
ow
n r
elat
ive
to p
rofi
ling
SPECint2000 SPECfp2000 SPEC2000
2 cores
4 cores
8 cores
CGO 200821
Profile Recovery Overhead
Factor 4 : the impact of using REP– experiments done on the 8-core system with 16MB buffers and 8 threads
PIPA using REP
PIPA using standard profile format PIPA-standard : 20.7x
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
168.
wupwise
171.
swim
172.
mgr
id
173.
applu
177.
mes
a
178.
galge
l
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
lucas
191.
fma3
d
200.
sixtra
ck
301.
apsi
FP Ave
rage
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
256.
bzip2
300.
twolf
INT A
vera
ge
SPEC2000
Ave
rage
Slo
wd
ow
n r
elat
ive
to
nat
ive
exec
uti
on
PIPA-REP : 4.5x
<pc, address, type, size>
Stage 2: Cache Simulation
parallel analysis
independent simulators
CGO 200823
818382485648
Stage 2 : Parallel Cache Simulation
How to parallelize?– split the address trace into independent groups
Set associative caches– partition the cache sets and simulate them using several independent simulators– merge the results (no of hits and misses) at the end of the simulation
Example:– 32K cache, 32-byte line, 4-way associative => 256 sets– 4 independent simulators, each one simulates 64 sets (round-robin distribution)
PC Address Type Size
.... r 4
.... w 4
.... r 4
.... w 4
.... r 4
.... r 4
0xbf9c46140xbf9c47050xbf9c4a340xbf9c4a600xbf9c4a5c
Set index:
0:
1:
2:
3:
0xbf9c4614, 0xbf9c4705
0xbf9c4a34
0xbf9c4a60
0xbf9c4a5c
...
...
...
... , 0xbf9c460d
0xbf9c460d
– two memory references that access different sets are independent
CGO 200824
Cache Simulation Overhead
0.00
10.00
20.00
30.00
40.00
50.00
60.00
168.
wupwise
171.
swim
172.
mgr
id
173.
applu
177.
mes
a
178.
galge
l
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
lucas
191.
fma3
d
200.
sixtra
ck
301.
apsi
FP Ave
rage
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
56.b
zip2
300.
twolf
INT A
vera
ge
SPEC2000
Ave
rage
Slo
wd
ow
n r
ela
tiv
e t
o
na
tiv
e e
xec
uti
on
PiPA speedup over dcache : 3x
Experiments done on the 8-core system– 8 recovery threads and 8 cache simulators
PiPA
Pin dcache
10.5x
32x
CGO 200825
SPEC 2006 Results
Experiments done using the 8-core system
Profiling
Profiling + recovery
Full cache simulation
Average speedup over dcache : 3x
3.7x
10.2x
3.27x
0
2
4
6
8
10
12
14
16
18
410.
bwav
es
416.
gam
ess
433.
milc
435.
grom
acs
436.
cact
usADM
437.
leslie
3d
444.
nam
d
447.
dealI
I
450.
sople
x
453.
povr
ay
454.
calcu
lix
459.
GemsF
DTD
465.
tont
o
470.
lbm
481.
wrf
482.
sphin
x3
FP Ave
rage
400.
perlb
ench
401.
bzip2
403.
gcc
429.
mcf
445.
gobm
k
456.
hmm
er
458.
sjeng
462.
libqu
antu
m
464.
h264
ref
471.
omne
tpp
473.
asta
r
483.
xalan
cbm
k
INT A
vera
ge
SPEC2006
Ave
rage
Slo
wd
ow
n r
ela
tiv
e t
o
na
tiv
e e
xec
uti
on
CGO 200826
Summary
PiPA is an effective technique for parallel profiling and analysis– based on pipelining
– drastically reduces both• profiling time• analysis time
– full cache simulation incurs only 10.5x slowdown
Runtime Execution Profile– requires minimal instrumentation code
– compact enough to ensure optimal buffer usage
– makes it easy for next stages to recover the full trace
Parallel cache simulation– the cache is partitioned into several independent simulators
CGO 200827
Future Work
Design APIs– hide the communication between the pipeline stages– focus only on the instrumentation and analysis tasks
Further improve the efficiency– parallel profiling– workload monitoring
More analysis algorithms– branch prediction simulation– memory dependence analysis– ...
CGO 200828
Pin Prototype
Second implementation in Pin
Preliminary results– 2.6x speedup over Pin dcache
Plan to release PiPAwww.comp.nus.edu.sg/~ioana