Western Research Laboratory Design and Design and Evaluation of Evaluation of Architectures for Architectures for Commercial Commercial Applications Applications Luiz André Barroso Luiz André Barroso Part II: tools & methods Part II: tools & methods
Dec 20, 2015
Western Research Laboratory
Design and Evaluation of Design and Evaluation of Architectures for Architectures for Commercial ApplicationsCommercial Applications
Luiz André BarrosoLuiz André Barroso
Part II: tools & methodsPart II: tools & methods
2 UPC, February 1999
OverviewOverview
Evaluation methods/toolsEvaluation methods/tools IntroductionIntroductionSoftware instrumentation (ATOM) Software instrumentation (ATOM) Hardware measurement & profilingHardware measurement & profiling
– IPROBEIPROBE– DCPIDCPI– ProfileMeProfileMe
Tracing & trace-driven simulationTracing & trace-driven simulationUser-level simulatorsUser-level simulatorsComplete machine simulators (SimOS)Complete machine simulators (SimOS)
3 UPC, February 1999
Studying commercial applications: challengesStudying commercial applications: challenges
Size of the data sets and programsSize of the data sets and programs Complex control flowComplex control flow Complex interactions with Operating SystemComplex interactions with Operating System Difficult tuning processDifficult tuning process Lack of access to source code (?)Lack of access to source code (?) Vendor restrictions on publicationsVendor restrictions on publications
important to have a rich set of toolsimportant to have a rich set of tools
4 UPC, February 1999
Tools are useful in many phasesTools are useful in many phases
Understanding behavior of workloadsUnderstanding behavior of workloads TuningTuning Performance measurements in existing systemsPerformance measurements in existing systems Performance estimation for future systemsPerformance estimation for future systems
5 UPC, February 1999
Using ordinary system toolsUsing ordinary system tools
Measuring CPU utilization and balanceMeasuring CPU utilization and balance Determining user/system breakdownDetermining user/system breakdown Detecting I/O bottlenecksDetecting I/O bottlenecks
DisksDisksNetworksNetworks
Monitoring memory utilization and swap activityMonitoring memory utilization and swap activity
6 UPC, February 1999
Gathering symbol table informationGathering symbol table information
Most database programs are large statically linked Most database programs are large statically linked stripped binariesstripped binaries
Most tools will require symbol table informationMost tools will require symbol table information However, distributions typically consist of object However, distributions typically consist of object
files with symbolic datafiles with symbolic data Simple trick:Simple trick:
replace system linker with wrapper that remove replace system linker with wrapper that remove “strip” flag, then calls real linker“strip” flag, then calls real linker
7 UPC, February 1999
ATOM: A Tool-Building SystemATOM: A Tool-Building System
Developed at WRL by Alan Eustace & Amitabh SrivastavaDeveloped at WRL by Alan Eustace & Amitabh Srivastava
EasyEasy to build new tools to build new tools
FlexibleFlexible enough to build interesting tools enough to build interesting tools
Fast Fast enough to run on real applicationsenough to run on real applications
Compiler independentCompiler independent: works on existing binaries: works on existing binaries
8 UPC, February 1999
Code InstrumentationCode Instrumentation
Application appears unchangedApplication appears unchanged ATOM adds code and data to the applicationATOM adds code and data to the application Information collected as a side effect of executionInformation collected as a side effect of execution
VV
TOOL
Trojan Horse
9 UPC, February 1999
ATOM Programming InterfaceATOM Programming Interface
Given an application program:Given an application program:
NavigationNavigation: Move around : Move around InterrogationInterrogation: Ask questions: Ask questions DefinitionDefinition: Define interface to analysis procedures: Define interface to analysis procedures InstrumentationInstrumentation: Add calls to analysis procedures: Add calls to analysis procedures
Pass ANYTHING as arguments!Pass ANYTHING as arguments!
PC, effective addresses, constants, register PC, effective addresses, constants, register values, arrays, function arguments, line values, arrays, function arguments, line numbers, procedure names, file names, etc.numbers, procedure names, file names, etc.
UPC, February 1999
Navigation PrimitivesNavigation Primitives
Get{First,Last,Next,Prev}ObjGet{First,Last,Next,Prev}Obj Get{First,Last,Next,Prev}ObjProcGet{First,Last,Next,Prev}ObjProc Get{First,Last,Next,Prev}BlockGet{First,Last,Next,Prev}Block Get{First,Last,Next,Prev}InstGet{First,Last,Next,Prev}Inst GetInstBlock - Find enclosing blockGetInstBlock - Find enclosing block GetBlockProc - Find enclosing procedureGetBlockProc - Find enclosing procedure GetProcObj - Find enclosing objectGetProcObj - Find enclosing object GetInstBranchTarget - Find branch targetGetInstBranchTarget - Find branch target ResolveTargetProc - Find subroutine destinationResolveTargetProc - Find subroutine destination
UPC, February 1999
InterrogationInterrogation
GetProgramInfo(PInfo)GetProgramInfo(PInfo) number of procedures, blocks, and instructions.number of procedures, blocks, and instructions. text and data addressestext and data addresses
GetProcInfo(Proc *, BlockInfo)GetProcInfo(Proc *, BlockInfo) Number of blocks or instructionsNumber of blocks or instructions Procedure frame size, integer and floating point save Procedure frame size, integer and floating point save
masksmasks GetBlockInfo(Inst *, InstInfo)GetBlockInfo(Inst *, InstInfo)
Number of instructionsNumber of instructions Any piece of the instruction (opcode, ra, rb, displacement)Any piece of the instruction (opcode, ra, rb, displacement)
UPC, February 1999
Interrogation(2)Interrogation(2)
ProcFileNameProcFileName Returns the file name for this procedureReturns the file name for this procedure
InstLineNoInstLineNo Returns the line number of this procedureReturns the line number of this procedure
GetInstRegEnumGetInstRegEnum Returns a unique register specifierReturns a unique register specifier
GetInstRegUsageGetInstRegUsage Computes Source and Destination masksComputes Source and Destination masks
UPC, February 1999
Interrogation(3)Interrogation(3)
GetInstRegUsageGetInstRegUsage Computes instruction source and destination masksComputes instruction source and destination masks
GetInstRegUsage(instFirst, &usageFirst);GetInstRegUsage(instFirst, &usageFirst);
GetInstRegUsage(instSecond, &usageSecond);GetInstRegUsage(instSecond, &usageSecond);
if (usageFirst.dreg_bitvec[0] & if (usageFirst.dreg_bitvec[0] & usageSecond.ureg_bitvec[0]) {usageSecond.ureg_bitvec[0]) {
/* set followed by a use *//* set followed by a use */
}}
Exactly what you need to find static pipeline stalls!Exactly what you need to find static pipeline stalls!
UPC, February 1999
DefinitionDefinition
AddCallProto(“function(argument list)”)AddCallProto(“function(argument list)”) ConstantsConstants Character stringsCharacter strings Program counterProgram counter Register contentsRegister contents Cycle counterCycle counter Constant arraysConstant arrays Effective AddressesEffective Addresses Branch Condition ValuesBranch Condition Values
UPC, February 1999
InstrumentationInstrumentation
AddCallProgram(Program{Before,After}, “name”,args)AddCallProgram(Program{Before,After}, “name”,args) AddCallProc(p, Proc{Before,After}, “name”,args)AddCallProc(p, Proc{Before,After}, “name”,args) AddCallBlock(b, Block{Before,After}, “name”,args)AddCallBlock(b, Block{Before,After}, “name”,args) AddCallInst(i, Inst{Before,After}, “name”,args)AddCallInst(i, Inst{Before,After}, “name”,args) ReplaceProc(p, “new”)ReplaceProc(p, “new”)
UPC, February 1999
Example #1: Procedure TracingExample #1: Procedure Tracing
What procedures are executed by the following mystery What procedures are executed by the following mystery program?program?
#include <stdio.h>main() { printf(“Hello world!\n”);}
Hint: main => printf => ???
UPC, February 1999
Procedure Tracing ExampleProcedure Tracing Example
> cc hello.c -non_shared -g1 -o hello> atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace> hello.ptrace=> __start=> main=> printf=> _doprnt=> __getmbcurmaz<= __getmbcurmax=> memcpy<= memcpy=> fwrite
UPC, February 1999
Procedure Trace (2)Procedure Trace (2)
=> _wrtchk <= memcpy => __ldr_atexit <= fflush=> _findbuf => memchr => __ldr_context_atexit => __close=> __geterrno <= memchr <= __ldr_context_atexit <= __close<= __geterrno => _xflsbuf <= __ldr_atexit <= fclose=> __isatty => __write => _cleanup => fclose=> __ioctl Hello world! => fclose => __close<= __ioctl <= __write => fflush<= __isatty <= _xflsbuf <= fflush=> __seterrno <= fwrite => __close<= __seterrno <= _doprnt <= __close<= _findbuf <= printf <= fclose<= _wrtchk <= main => fclose=> memcpy => exit => fflush
UPC, February 1999
Example #2: Cache SimulatorExample #2: Cache Simulator
Write a tool that computes the miss rate of the Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data application running in a 64KB, direct mapped data cache with 32 byte lines.cache with 32 byte lines.
> atom spice cache.inst.o cache.anal.o -o spice.cache> atom spice cache.inst.o cache.anal.o -o spice.cache
> spice.cache < ref.in > ref.out> spice.cache < ref.in > ref.out
> more cache.out> more cache.out
5,387,822,402 620,855,884 11.523%5,387,822,402 620,855,884 11.523%
Great use for 64 bit integers!Great use for 64 bit integers!
UPC, February 1999
Cache Tool ImplementationCache Tool Implementation Application Instrumentation
Reference(-32592(gp));
Reference(-32592(gp));main:
lw v1,-32592(gp)move v0,zeroli a0,20
loop:addiu v1,v1,4addiu v0,v0,4sw v1,-32592(gp)bne v0,a0,loop:
jr raPrintResults();
Note: Passes addresses as if uninstrumented!
UPC, February 1999
Cache Instrumentation FileCache Instrumentation File#include <stdio.h>#include <stdio.h>#include <cmplrs/atom.inst.h>#include <cmplrs/atom.inst.h>unsigned InstrumentAll(int argc, char **argv) {unsigned InstrumentAll(int argc, char **argv) { AddCallProto(“AddCallProto(“Reference(VALUE)Reference(VALUE)”);”); AddCallProto(“Print()”);AddCallProto(“Print()”); for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) {for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) { if (BuildObj(o)) return (1);if (BuildObj(o)) return (1);
if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”);if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = for (b = GetFirstBlockGetFirstBlock(p); b != NULL; b = (p); b != NULL; b = GetNextBlockGetNextBlock(b))(b)) for (i = for (i = GetFirstInstGetFirstInst(b); i != NULL; i = (b); i != NULL; i = GetNextInstGetNextInst(i))(i)) if (if (IsInstType(i, InstTypeLoadIsInstType(i, InstTypeLoad) || ) || IsInstType(i,InstTypeStoreIsInstType(i,InstTypeStore)))) AddCallInst(i, InstBefore, “Reference”, EffAddrValue)AddCallInst(i, InstBefore, “Reference”, EffAddrValue);; WriteObj(o);WriteObj(o); }} return (0);return (0); }}
UPC, February 1999
Cache Analysis FileCache Analysis File#include <stdio.h>#include <stdio.h>#define CACHE_SIZE 65536#define CACHE_SIZE 65536#define BLOCK_SHIFT 5#define BLOCK_SHIFT 5long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;Reference(long address) Reference(long address) {{ int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT; long tag = address >> BLOCK_SHIFT;long tag = address >> BLOCK_SHIFT; if (cache[index] != tag) { misses++; cache[index] = tag ; }if (cache[index] != tag) { misses++; cache[index] = tag ; } refs++;refs++;}}Print() Print() {{
FILE *file = fopen(“cache.out”,”w”);FILE *file = fopen(“cache.out”,”w”); printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);
fclose(file);fclose(file);}}
23 UPC, February 1999
Example #3: TPC-B runtime informationExample #3: TPC-B runtime information
Statistics per transaction:Statistics per transaction: InstructionsInstructions 180,398180,398 Loads (% shared)Loads (% shared) 47,643 (24%)47,643 (24%) Stores (% shared)Stores (% shared) 21,380 (22%)21,380 (22%) Lock/UnlockLock/Unlock 118118 MBsMBs 241241
Footprints/CPUFootprints/CPU Instr.Instr. 300 KB300 KB (1.6 MB in pages)(1.6 MB in pages) Private dataPrivate data 470 KB470 KB (4 MB in pages)(4 MB in pages) Shared dataShared data 7 MB7 MB (26 MB in pages)(26 MB in pages)
– 50% of the shared data footprint is touched by at 50% of the shared data footprint is touched by at least one other processleast one other process
24 UPC, February 1999
TPC-B (2)TPC-B (2)
Memory Footprint vs. Transactions
0.000E+00
1.000E+06
2.000E+06
3.000E+06
4.000E+06
5.000E+06
6.000E+06
7.000E+06
8.000E+06
72 216
360
504
648
792
936
1080
1224
1368
1512
1656
1800
Transactions
Byt
es
Instruction
Private data
Shared data
25 UPC, February 1999
TPC-B (3)TPC-B (3)
Memory Footprint vs. Server processes
0.000E+00
1.000E+06
2.000E+06
3.000E+06
4.000E+06
5.000E+06
6.000E+06
7.000E+06
8.000E+06
1 2 3 4 5 6
Server processes
Byt
es
Shared data
Private data
Instructions
26 UPC, February 1999
Oracle SGA activity in TPC-BOracle SGA activity in TPC-B
27 UPC, February 1999
ATOM wrap-upATOM wrap-up
Very flexible “hack-it-yourself” toolVery flexible “hack-it-yourself” tool Discover detailed information on dynamic behavior Discover detailed information on dynamic behavior
of programsof programs Especially good when you don’t have source codeEspecially good when you don’t have source code Shipped with Digital UnixShipped with Digital Unix Can be used for tracing (later) Can be used for tracing (later)
28 UPC, February 1999
Hardware measurement toolsHardware measurement tools
IPROBEIPROBE interface to CPU event countersinterface to CPU event counters
DCPIDCPIhardware assisted profilinghardware assisted profiling
ProfileMeProfileMehardware assisted profiling for complex CPU coreshardware assisted profiling for complex CPU cores
29 UPC, February 1999
IPROBEIPROBE Developed by Digital’s Performance GroupDeveloped by Digital’s Performance Group Use event counters provided by AlphasUse event counters provided by Alphas Operation:Operation:
set counter to monitor a particular event (e.g., icache_miss)set counter to monitor a particular event (e.g., icache_miss) start counterstart counter every counter overflow, interrupt wakes up handler and events every counter overflow, interrupt wakes up handler and events
are accumulatedare accumulated stop counter and read totalstop counter and read total
User can select:User can select:which processes to countwhich processes to countuser level, kernel level, bothuser level, kernel level, both
30 UPC, February 1999
IPROBE: 21164 event typesIPROBE: 21164 event types
issues single_issue_cycles long_stalls issues single_issue_cycles long_stalls
cycles dual_issue_cycles branch_mispr cycles dual_issue_cycles branch_mispr
triple_issue_cycles pc_mispr triple_issue_cycles pc_mispr
quad_issue_cycles icache_miss quad_issue_cycles icache_miss
split_issue_cycles dcache_miss split_issue_cycles dcache_miss
pipe_dry dtb_miss pipe_dry dtb_miss
pipe_frozen loads_merged pipe_frozen loads_merged
replay_trap ldu_replays replay_trap ldu_replays
branches cycles branches cycles
cond_branches scache_miss cond_branches scache_miss
jsr_ret scache_read_miss jsr_ret scache_read_miss
integer_ops scache_write integer_ops scache_write
float_ops scache_sh_write float_ops scache_sh_write
loads scache_write_miss loads scache_write_miss
stores bcache_miss stores bcache_miss
icache_access sys_inv icache_access sys_inv
dcache_access itb_miss dcache_access itb_miss
scache_access wb_maf_full_replays scache_access wb_maf_full_replays
scache_read sys_read_req scache_read sys_read_req
scache_write external scache_write external
bcache_hit mem_barrier_cycles bcache_hit mem_barrier_cycles
bcache_victim load_locked bcache_victim load_locked
sys_req sys_req
scache_victimscache_victim
31 UPC, February 1999
IPROBE: what you can doIPROBE: what you can do
Directly measure relevant events (e.g. cache Directly measure relevant events (e.g. cache performance)performance)
Overall CPU cycle breakdown diagnosis:Overall CPU cycle breakdown diagnosis:microbenchmark machine to estimate latenciesmicrobenchmark machine to estimate latenciescombine latencies with event countscombine latencies with event counts
Main of inaccuracyMain of inaccuracyload/store overlap in the memory systemload/store overlap in the memory system
32 UPC, February 1999
IPROBE example: 4-CPU SMPIPROBE example: 4-CPU SMP
issuing10%
data stall46%
instruction stall 44%
CPI = CPI = 7.47.4
Estimated breakdown of stall cyclesBreakdown of CPU cycles
Bcache hit27%
Bcache miss42%
Replay trap 6%
Mem. barrier
5%
Branch mispr. 2%
TLB 2%
Scache hit 16%
33 UPC, February 1999
Why did it run so bad?!?Why did it run so bad?!?
Nominal memory latencies were good: 80 cyclesNominal memory latencies were good: 80 cycles Micro-benchmarks determined that:Micro-benchmarks determined that:
latency under load is over 120 cycles on 4 latency under load is over 120 cycles on 4 processorsprocessors
base dirty miss latency was over 130 cycles base dirty miss latency was over 130 cycles off-chip cache latency was highoff-chip cache latency was high
IPROBE data uncovered significant sharing:IPROBE data uncovered significant sharing: for P=2, 15% of bcache misses are to dirty blocksfor P=2, 15% of bcache misses are to dirty blocks for P=4, 20% of bcache misses are to dirty blocksfor P=4, 20% of bcache misses are to dirty blocks
34 UPC, February 1999
Dirty miss latency on RISC SMPsDirty miss latency on RISC SMPs
SPEC benchmark has no significant sharingSPEC benchmark has no significant sharing Current processors/systems optimize local cache Current processors/systems optimize local cache
accessaccess All RISC SMPs have high dirty miss penaltiesAll RISC SMPs have high dirty miss penalties
Distribution of bus stall latencies for dirty misses
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
bus cycles
% d
irty
mis
ses
35 UPC, February 1999
DCPI: continuous profiling infrastructureDCPI: continuous profiling infrastructure
Developed by SRC and WRL researchersDeveloped by SRC and WRL researchers Based on periodic samplingBased on periodic sampling Hardware generates periodic interruptsHardware generates periodic interrupts OS handles the interrupts and stores dataOS handles the interrupts and stores data
Program Counter (PC) and any extra infoProgram Counter (PC) and any extra info Analysis Tools convert dataAnalysis Tools convert data
for usersfor users for compilersfor compilers
Other examples:Other examples:
SGI Speedshop, Unix’s prof(), VTuneSGI Speedshop, Unix’s prof(), VTune
36 UPC, February 1999
Sampling vs. InstrumentationSampling vs. Instrumentation
Much lower overhead than instrumentationMuch lower overhead than instrumentationDCPI: program 1%-3% slowerDCPI: program 1%-3% slowerPixie: program 2-3 times slowerPixie: program 2-3 times slower
Applicable to large workloadsApplicable to large workloads100,000 TPS on Alpha100,000 TPS on AlphaAltaVistaAltaVista
Easier to apply to whole systems (kernel, device Easier to apply to whole systems (kernel, device drivers, shared libraries, ...)drivers, shared libraries, ...)
Instrumenting kernels is very trickyInstrumenting kernels is very trickyNo source code neededNo source code needed
37 UPC, February 1999
Information from ProfilesInformation from Profiles
DCPI DCPI estimatesestimates Where CPU cycles went, broken down byWhere CPU cycles went, broken down by
image, procedure, instructionimage, procedure, instruction How often code was executedHow often code was executed
basic blocks and CFG edgesbasic blocks and CFG edges Where peak performance was lost and whyWhere peak performance was lost and why
38 UPC, February 1999
Example: Getting the Big PictureExample: Getting the Big PictureTotal samples for event type cycles = 6095201
cycles % cum% load file
2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so
cycles % cum% procedure load file
2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
39 UPC, February 1999
Example: Using the MicroscopeExample: Using the Microscope
...
...
21.0 cycles
3.5 cycles
Address Instruction Samples Culprits CPI
9618 addq s0,t6,t6 643 1.0 cyclesb (b = data dep on 2nd operand)
D (D = DTLB miss)
D 961c ldl t4,0(t6) 2111 9618 3.5 cycles
aa (a = data dep on 1st operand)a
di (d = d-cache miss) (i = i-cache miss)di
9620 xor t4,t12,t5 14152 961c 21.0 cycles 9624 beq 0x963c 0 0.0 cycles
Where peak performance is lost and why
40 UPC, February 1999
Example: Summarizing StallsExample: Summarizing Stalls
I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0%
Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3%------------------------------------------------------------- Subtotal dynamic 44.1%
Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0%------------------------------------------------------------- Subtotal static 4.8%------------------------------------------------------------- Total stall 48.9% Execution 51.2%Net sampling error -0.1%------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)
41 UPC, February 1999
Example: Sorting StallsExample: Sorting Stalls
% cum% cycles cnt cpi blame PC file:line10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488
42 UPC, February 1999
Typical Hardware SupportTypical Hardware Support
TimersTimersClock interrupt after N units of timeClock interrupt after N units of time
Performance CountersPerformance Counters Interrupt after NInterrupt after N
cycles, issues, loads, L1 Dcache misses, branch cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ...mispredicts, uops retired, ...
Alpha 21064, 21164; PPro, PII;…Alpha 21064, 21164; PPro, PII;…Easy to measure total cycles, issues, CPI, etc.Easy to measure total cycles, issues, CPI, etc.
Only extra information is restart PCOnly extra information is restart PC
43 UPC, February 1999
Problem: Inaccurate AttributionProblem: Inaccurate Attribution ExperimentExperiment
count data loadscount data loads loop: single load +loop: single load +
hundreds of nopshundreds of nops In-Order ProcessorIn-Order Processor
Alpha 21164Alpha 21164 skewskew large peaklarge peak
Out-of-Order ProcessorOut-of-Order Processor Intel Pentium ProIntel Pentium Pro skewskew smearsmear
0 50 100 150 200
0
2
4
6
8
10
12
14
16
18
20
22
24
Histogram of Restart PCs
782
load
44 UPC, February 1999
Ramification of MisattributionRamification of Misattribution
No skew or smearNo skew or smear Instruction-level analysis is easy!Instruction-level analysis is easy!
Skew is a constant number of cyclesSkew is a constant number of cycles Instruction-level analysis is possibleInstruction-level analysis is possibleAdjust sampling period by amount of skewAdjust sampling period by amount of skew Infer execution counts, CPI, stalls, and stall Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and programexplanations from cycles samples and program SmearSmear
Instruction-level analysis seems hopelessInstruction-level analysis seems hopelessExamples: PII, StrongARMExamples: PII, StrongARM
45 UPC, February 1999
Desired Hardware SupportDesired Hardware Support
Sample fetched instructionsSample fetched instructions Save PC of sampled instructionSave PC of sampled instruction
E.g., interrupt handler reads Internal Processor E.g., interrupt handler reads Internal Processor RegisterRegister
Makes skew and smear irrelevantMakes skew and smear irrelevant Gather more informationGather more information
46 UPC, February 1999
random selection
ProfileMeProfileMe: Instruction-Centric : Instruction-Centric ProfilingProfiling
fetch map issue exec retire
icache
branchpredict
dcache
interrupt!arithunits
done?
Fetch counter
overflow?
pc addr retired?miss?stage latencies
ProfileMe tag!
tagged?
historymp?capture!
internal processor registers
miss?
47 UPC, February 1999
Instruction-Level StatisticsInstruction-Level Statistics PC + Retire Status PC + Retire Status execution frequency execution frequency PC + Cache Miss Flag PC + Cache Miss Flag cache miss rates cache miss rates PC + Branch Mispredict PC + Branch Mispredict mispredict rates mispredict rates PC + PC + EventEvent Flag Flag eventevent rates rates
PC + Branch Direction PC + Branch Direction edge frequencies edge frequencies PC + Branch History PC + Branch History path execution rates path execution rates
PC + Latency PC + Latency instruction stalls instruction stalls““100-cycle dcache miss” vs. “dcache miss”100-cycle dcache miss” vs. “dcache miss”
48 UPC, February 1999
Compiled code
Samples
ANALYSIS Stall explanations
Frequency
Cycles per instruction
Data AnalysisData Analysis
Cycle samples are proportional to total time Cycle samples are proportional to total time at head of issue queue (at least on in-order at head of issue queue (at least on in-order Alphas)Alphas)
Frequency indicates frequent pathsFrequency indicates frequent paths CPI indicates stallsCPI indicates stalls
49 UPC, February 1999
1,000,000 1 CPI
? 10,000 100 CPI1,000,000 Cycles
Estimating Frequency from SamplesEstimating Frequency from Samples
ProblemProblemgiven cycle samples, compute frequency and CPIgiven cycle samples, compute frequency and CPI
ApproachApproachLet F = Frequency / Sampling PeriodLet F = Frequency / Sampling PeriodE(Cycle Samples) = F X CPIE(Cycle Samples) = F X CPISo … F = E(Cycle Samples) / CPISo … F = E(Cycle Samples) / CPI
50 UPC, February 1999
Estimating Frequency (cont.)Estimating Frequency (cont.)
F = E(Cycle Samples) / CPIF = E(Cycle Samples) / CPI IdeaIdea
If no dynamic stall, then know CPI, so can estimate FIf no dynamic stall, then know CPI, so can estimate F So… assume some instructions have no dynamic stallsSo… assume some instructions have no dynamic stalls
Consider a group of instructions with the same frequency Consider a group of instructions with the same frequency (e.g., basic block)(e.g., basic block)
Identify instructions w/o dynamic stalls; then average their Identify instructions w/o dynamic stalls; then average their sample counts for better accuracysample counts for better accuracy
Key insight:Key insight: Instructions without stalls have smaller sample countsInstructions without stalls have smaller sample counts
51 UPC, February 1999
Address Instruction Samples MinCPI Samples/MinCPI
9600 subl s6, a1, s6 792 1 7929604 lda a3, 16411(s6) 611 1 6119608 cmovlt s6, a3, s6 649 1 649960c bis zero, zero, s3 0 0 Estimate 6309610 sll s6, 0x5, t6 1389 2 695 (Actual 615)9614 addl zero, t6, t6 616 1 6169618 addq s0, t6, t6 643 1 643961c ldl t4, 0(t6) 2111 1 21119620 xor t4, t12, t5 13152 2 65769624 beq t5, 963c 0 0
Estimating Frequency (Example)Estimating Frequency (Example)
Compute MinCPI from CodeCompute MinCPI from Code Compute Samples/MinCPI Compute Samples/MinCPI Select Data to AverageSelect Data to Average
Does badly when:Does badly when: Few issue pointsFew issue points All issue points stallAll issue points stall
52 UPC, February 1999
Frequency Estimate AccuracyFrequency Estimate Accuracy
Compare frequency estimates for blocks to Compare frequency estimates for blocks to measured values obtained with pixie-like toolmeasured values obtained with pixie-like tool
53 UPC, February 1999
Explaining StallsExplaining Stalls
Static stallsStatic stallsSchedule instructions in each basic block Schedule instructions in each basic block
optimistically using a detailed pipeline model for the optimistically using a detailed pipeline model for the processorprocessor
Dynamic stallsDynamic stallsStart with all possible explanationsStart with all possible explanations
– I-cache miss, D-cache miss, DTB miss, branch I-cache miss, D-cache miss, DTB miss, branch mispredict, ...mispredict, ...
Rule out unlikely explanations Rule out unlikely explanations List the remaining possibilitiesList the remaining possibilities
54 UPC, February 1999
Is the previous occurrence of an operand register Is the previous occurrence of an operand register the destination of a load instruction?the destination of a load instruction?
Search backward across basic block boundariesSearch backward across basic block boundaries Prune by block and edge execution frequenciesPrune by block and edge execution frequencies
ldq t0,0(s1)
subq t0,t1,t2
addq t3,t0,t4
OR
subq t0,t1,t2
Ruling Out D-cache MissesRuling Out D-cache Misses
55 UPC, February 1999
DCPI wrap-upDCPI wrap-up
Very precise, non-intrusive profiling toolVery precise, non-intrusive profiling tool Gathers both user-level and kernel profilesGathers both user-level and kernel profiles Relates architectural events back to original codeRelates architectural events back to original code Used for profile-based code optimizationsUsed for profile-based code optimizations
56 UPC, February 1999
Simulation of commercial workloadsSimulation of commercial workloads
Requires scaling downRequires scaling down Options:Options:
Trace-driven simulationTrace-driven simulationUser-level execution-driven simulationUser-level execution-driven simulationComplete machine simulationComplete machine simulation
57 UPC, February 1999
Trace-driven simulationTrace-driven simulation
Methodology:Methodology: create ATOM instrumentation tool that logs a complete trace create ATOM instrumentation tool that logs a complete trace
per Oracle server processper Oracle server process– instruction pathinstruction path– data accessesdata accesses– synchronization accessessynchronization accesses– system callssystem calls
run “atomized” version to derive tracerun “atomized” version to derive trace feed traces to simulatorfeed traces to simulator
58 UPC, February 1999
Trace-driven studies: limitationsTrace-driven studies: limitations No OS activity (in OLTP OS takes 10-15% of the No OS activity (in OLTP OS takes 10-15% of the
time)time) Trace selected processes only (e.g. server Trace selected processes only (e.g. server
processes)processes) Time dilation alters system behaviorTime dilation alters system behavior
I/O looks fasterI/O looks fastermany places with hardwired timeout values have to many places with hardwired timeout values have to
be patchedbe patched Capturing synchronization correctly is difficultCapturing synchronization correctly is difficult
need to reproduce correct concurrency for shared need to reproduce correct concurrency for shared data structuresdata structures
DB has complex synchronization structure, many DB has complex synchronization structure, many levels of procedureslevels of procedures
59 UPC, February 1999
Trace-driven studies: limitations(2)Trace-driven studies: limitations(2) Scheduling traces into simulated processorsScheduling traces into simulated processors
need enough information in the trace to reproduce need enough information in the trace to reproduce OS schedulingOS scheduling
need to suspend processes for I/O & other blocking need to suspend processes for I/O & other blocking operationsoperations
need to model activity of background processes that need to model activity of background processes that are not traced (e.g. log writer)are not traced (e.g. log writer)
Re-create OS virtual-physical mapping, page Re-create OS virtual-physical mapping, page coloring schemecoloring scheme
Very difficult to simulate wrong-path executionVery difficult to simulate wrong-path execution
60 UPC, February 1999
User-level execution-driven simulatorUser-level execution-driven simulator
Our experience was to modify AINT (MINT for Our experience was to modify AINT (MINT for Alpha)Alpha)
Problems:Problems:no OS activity measuredno OS activity measuredOracle/OS interactions are very complexOracle/OS interactions are very complexOS system call interface has to be virtualizedOS system call interface has to be virtualizedThat’s a hard one to crack…That’s a hard one to crack…
Our status:Our status:Oracle/TPC-B ran with 1 server process onlyOracle/TPC-B ran with 1 server process onlywe gave up...we gave up...
61 UPC, February 1999
Complete machine simulatorComplete machine simulator
Bite the bullet: model the machine at the hardware Bite the bullet: model the machine at the hardware levellevel
The good news is:The good news is:hardware interface is cleaner & better documented hardware interface is cleaner & better documented
than any software interface (including OS)than any software interface (including OS)all software JUST RUNS!! Including OSall software JUST RUNS!! Including OSapplications don’t have to be ported to simulatorapplications don’t have to be ported to simulator
We ported SimOS (from Stanford) to AlphaWe ported SimOS (from Stanford) to Alpha
62 UPC, February 1999
SimOSSimOS
A complete machine simulatorA complete machine simulatorSpeed-detail tradeoff for maximum flexibilitySpeed-detail tradeoff for maximum flexibilityFlexible data collection and classificationFlexible data collection and classification
Originally developed at Stanford University (MIPS ISA)Originally developed at Stanford University (MIPS ISA) SimOS-Alpha effort started at WRL in Fall 1996SimOS-Alpha effort started at WRL in Fall 1996
Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben Verghese, Basem Nayfeh, and Jamey Hicks (CRL)Verghese, Basem Nayfeh, and Jamey Hicks (CRL)
63 UPC, February 1999
SimOS - Complete Machine SimulationSimOS - Complete Machine Simulation
Models CPUs, caches, buses, memory, disks, network, …Complete enough to run OS and any applications
Pmake Oracle VCS
Operating System of Simulated Machine
Workloads
SimOSHardware
HostHost Machine
Disks
TTY
Caches
Ethernet CPU/MMU
Memory System
64 UPC, February 1999
Multiple Levels of DetailMultiple Levels of Detail
Tradeoff between speed of simulation and the Tradeoff between speed of simulation and the amount of detail that is simulatedamount of detail that is simulated
Multiple modes of CPU simulationMultiple modes of CPU simulationFast “on-the-fly compilation”: 10X slowdown!Fast “on-the-fly compilation”: 10X slowdown!
– Workload placementWorkload placementSimple pipeline emulator, no caches: 50-100X Simple pipeline emulator, no caches: 50-100X
slowdownslowdown– Rough characterizationRough characterization
Simple pipeline emulator, full cache simulation: 100-Simple pipeline emulator, full cache simulation: 100-200X slowdown200X slowdown
– More accurate characterization of workloads More accurate characterization of workloads
65 UPC, February 1999
Multiple Models for each ComponentMultiple Models for each Component Multiple models for CPU, cache, memory,and disk.Multiple models for CPU, cache, memory,and disk. CPUCPU
simple pipeline emulator: 100-200X slowdown (EV5)simple pipeline emulator: 100-200X slowdown (EV5) dynamically-scheduled processor: 1000-10000X slowdown dynamically-scheduled processor: 1000-10000X slowdown
(e.g.21264)(e.g.21264) CachesCaches
Two level set associative cachesTwo level set associative caches Shared cachesShared caches
MemoryMemory Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)
DiskDisk Fixed latency or more complex HP disk modelFixed latency or more complex HP disk model
Modular: add your own flavorsModular: add your own flavors
66 UPC, February 1999
Checkpoint and SamplingCheckpoint and Sampling Checkpoint capability for entire machine stateCheckpoint capability for entire machine state
CPU state, main memory, and disk changesCPU state, main memory, and disk changes Important for positioning workload for detailed simulationImportant for positioning workload for detailed simulation Switching detail level in a “sampling” studySwitching detail level in a “sampling” study
Run in faster modes, sample in more detailed modesRun in faster modes, sample in more detailed modes RepeatabilityRepeatability
Change parameters for studiesChange parameters for studies– Cache sizeCache size– Memory type and latenciesMemory type and latencies– Disk models and latenciesDisk models and latencies– Many othersMany others
Debugging race conditionsDebugging race conditions
67 UPC, February 1999
Data Collection and ClassificationData Collection and Classification Exploits visibility and non-intrusiveness offered by simulationExploits visibility and non-intrusiveness offered by simulation
Can observe low-level events such as cache misses, Can observe low-level events such as cache misses, references and TLB missesreferences and TLB misses
Tcl-based configuration and control provides ease of useTcl-based configuration and control provides ease of use Powerful annotation mechanism for triggering eventsPowerful annotation mechanism for triggering events
Hardware, OS, or ApplicationHardware, OS, or Application Apps and mechanisms to organize and classify dataApps and mechanisms to organize and classify data
Some already provided (cache miss counts and classification)Some already provided (cache miss counts and classification) Mechanisms to do more (timing trees and detail tables)Mechanisms to do more (timing trees and detail tables)
68 UPC, February 1999
Easy configurationEasy configuration TCL based configuration of the machine parametersTCL based configuration of the machine parameters Example:Example:
set PARAM(CPU.Model) DELTAset PARAM(CPU.Model) DELTA
set detailLevel 1set detailLevel 1
set PARAM(CPU.Clock) 1000set PARAM(CPU.Clock) 1000
set PARAM(CPU.Count) 4set PARAM(CPU.Count) 4
set PARAM(CACHE.2Level.L2Size) 1024set PARAM(CACHE.2Level.L2Size) 1024
set PARAM(CACHE.2Level.L2Line) 64set PARAM(CACHE.2Level.L2Line) 64
set PARAM(CACHE.2Level.L2HitTime) 15set PARAM(CACHE.2Level.L2HitTime) 15
set PARAM(MEMSYS.MemSize) 1024 set PARAM(MEMSYS.MemSize) 1024
set PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count) set PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count)
set PARAM(MEMSYS.Model) Numaset PARAM(MEMSYS.Model) Numa
set PARAM(DISK.Fixed.Latency) 10set PARAM(DISK.Fixed.Latency) 10
69 UPC, February 1999
Annotations - The building blockAnnotations - The building block Small procedures to be run on encountering certain eventsSmall procedures to be run on encountering certain events
PC, hardware events (cache miss, TLB, …), simulator eventsPC, hardware events (cache miss, TLB, …), simulator eventsannotation set pc vmunix::idle_thread:START {annotation set pc vmunix::idle_thread:START {
set PROCESS($CPU) idleset PROCESS($CPU) idle
annotation exec osEvent startIdleannotation exec osEvent startIdle
}}
annotation set osEvent switchIn {annotation set osEvent switchIn {
log "$CYCLES ContextSwitch $CPU,$PID($CPU),log "$CYCLES ContextSwitch $CPU,$PID($CPU),$PROCESS($CPU)\n"$PROCESS($CPU)\n"
}}
annotation set pc 0x12004ba90 {annotation set pc 0x12004ba90 {
incr tpcbTOGO -1incr tpcbTOGO -1
console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"
if {$tpcbTOGO == 0} {simosExit}if {$tpcbTOGO == 0} {simosExit}
}}
70 UPC, February 1999
Example: Kernel Detail (TPCB)Example: Kernel Detail (TPCB)
30%
21%5%
2%
17%
7%
2%1%
5%2%
2%1% 5%SYS_read
SYS_write
SYS_pid_block
SYS_pid_unblock
lock
Int_clock
Int_IPI
Int_IO
DTLB
ITLB
2XTLB
MM_FOW
Other
71 UPC, February 1999
SimOS MethodologySimOS Methodology Configure and tune the workload on existing machineConfigure and tune the workload on existing machine
build the database schema, create indexes, load data, optimize build the database schema, create indexes, load data, optimize queriesqueries
more difficult if simulated system much different from existing more difficult if simulated system much different from existing platformplatform
Create file(s) with disk image (dd) of the database disk(s)Create file(s) with disk image (dd) of the database disk(s) write-protect “dd” files to prevent permanent modification (i.e. write-protect “dd” files to prevent permanent modification (i.e.
use copy-on-write)use copy-on-write) optionally, umount disks and let SimOS use them as raw optionally, umount disks and let SimOS use them as raw
devicesdevices Configure SimOS to see the “dd” files as raw disksConfigure SimOS to see the “dd” files as raw disks ““Boot” a SimOS configuration and mount the disksBoot” a SimOS configuration and mount the disks
72 UPC, February 1999
SimOS Methodology (2)SimOS Methodology (2) Boot and startup the database engine on “fast Boot and startup the database engine on “fast
mode”mode”
Startup the workloadStartup the workload
When in steady state: create a checkpoint and exitWhen in steady state: create a checkpoint and exit Resume from checkpoint with complex (slower) Resume from checkpoint with complex (slower)
simulatorsimulator
73 UPC, February 1999
Sample NUMA TPC-B Profile:Sample NUMA TPC-B Profile:
74 UPC, February 1999
Running from a CheckpointRunning from a Checkpoint
What can be changed:What can be changed:processor modelprocessor modeldisk modeldisk modelcache sizes, hierarchy, organization, replacementcache sizes, hierarchy, organization, replacementhow long to run the simulationhow long to run the simulation
What cannot be changed:What cannot be changed:number of processorsnumber of processorssize of physical memorysize of physical memory
75 UPC, February 1999
Tools wrap-upTools wrap-up
No single tool will get the job doneNo single tool will get the job done Monitoring application execution in a real system is Monitoring application execution in a real system is
invaluableinvaluable Complete machine simulation advantages:Complete machine simulation advantages:
see the whole thingsee the whole thingportability of software is non-issueportability of software is non-issuespeed/detail trade-off essential for detailed studiesspeed/detail trade-off essential for detailed studies