Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part II: tools & methods.

Western Research Laboratory

Design and Evaluation of Design and Evaluation of Architectures for Architectures for Commercial ApplicationsCommercial Applications

Luiz André BarrosoLuiz André Barroso

Part II: tools & methodsPart II: tools & methods

2 UPC, February 1999

OverviewOverview

Evaluation methods/toolsEvaluation methods/tools IntroductionIntroductionSoftware instrumentation (ATOM) Software instrumentation (ATOM) Hardware measurement & profilingHardware measurement & profiling

– IPROBEIPROBE– DCPIDCPI– ProfileMeProfileMe

Tracing & trace-driven simulationTracing & trace-driven simulationUser-level simulatorsUser-level simulatorsComplete machine simulators (SimOS)Complete machine simulators (SimOS)


Studying commercial applications: challengesStudying commercial applications: challenges

Size of the data sets and programsSize of the data sets and programs Complex control flowComplex control flow Complex interactions with Operating SystemComplex interactions with Operating System Difficult tuning processDifficult tuning process Lack of access to source code (?)Lack of access to source code (?) Vendor restrictions on publicationsVendor restrictions on publications

important to have a rich set of toolsimportant to have a rich set of tools


Tools are useful in many phasesTools are useful in many phases

Understanding behavior of workloadsUnderstanding behavior of workloads TuningTuning Performance measurements in existing systemsPerformance measurements in existing systems Performance estimation for future systemsPerformance estimation for future systems


Using ordinary system toolsUsing ordinary system tools

Measuring CPU utilization and balanceMeasuring CPU utilization and balance Determining user/system breakdownDetermining user/system breakdown Detecting I/O bottlenecksDetecting I/O bottlenecks

DisksDisksNetworksNetworks

Monitoring memory utilization and swap activityMonitoring memory utilization and swap activity


Gathering symbol table informationGathering symbol table information

Most database programs are large statically linked Most database programs are large statically linked stripped binariesstripped binaries

Most tools will require symbol table informationMost tools will require symbol table information However, distributions typically consist of object However, distributions typically consist of object

files with symbolic datafiles with symbolic data Simple trick:Simple trick:

replace system linker with wrapper that remove replace system linker with wrapper that remove “strip” flag, then calls real linker“strip” flag, then calls real linker


ATOM: A Tool-Building SystemATOM: A Tool-Building System

Developed at WRL by Alan Eustace & Amitabh SrivastavaDeveloped at WRL by Alan Eustace & Amitabh Srivastava

EasyEasy to build new tools to build new tools

FlexibleFlexible enough to build interesting tools enough to build interesting tools

Fast Fast enough to run on real applicationsenough to run on real applications

Compiler independentCompiler independent: works on existing binaries: works on existing binaries


Code InstrumentationCode Instrumentation

Application appears unchangedApplication appears unchanged ATOM adds code and data to the applicationATOM adds code and data to the application Information collected as a side effect of executionInformation collected as a side effect of execution

VV

TOOL

Trojan Horse


ATOM Programming InterfaceATOM Programming Interface

Given an application program:Given an application program:

NavigationNavigation: Move around : Move around InterrogationInterrogation: Ask questions: Ask questions DefinitionDefinition: Define interface to analysis procedures: Define interface to analysis procedures InstrumentationInstrumentation: Add calls to analysis procedures: Add calls to analysis procedures

Pass ANYTHING as arguments!Pass ANYTHING as arguments!

PC, effective addresses, constants, register PC, effective addresses, constants, register values, arrays, function arguments, line values, arrays, function arguments, line numbers, procedure names, file names, etc.numbers, procedure names, file names, etc.

UPC, February 1999

Navigation PrimitivesNavigation Primitives

Get{First,Last,Next,Prev}ObjGet{First,Last,Next,Prev}Obj Get{First,Last,Next,Prev}ObjProcGet{First,Last,Next,Prev}ObjProc Get{First,Last,Next,Prev}BlockGet{First,Last,Next,Prev}Block Get{First,Last,Next,Prev}InstGet{First,Last,Next,Prev}Inst GetInstBlock - Find enclosing blockGetInstBlock - Find enclosing block GetBlockProc - Find enclosing procedureGetBlockProc - Find enclosing procedure GetProcObj - Find enclosing objectGetProcObj - Find enclosing object GetInstBranchTarget - Find branch targetGetInstBranchTarget - Find branch target ResolveTargetProc - Find subroutine destinationResolveTargetProc - Find subroutine destination

UPC, February 1999

InterrogationInterrogation

GetProgramInfo(PInfo)GetProgramInfo(PInfo) number of procedures, blocks, and instructions.number of procedures, blocks, and instructions. text and data addressestext and data addresses

GetProcInfo(Proc *, BlockInfo)GetProcInfo(Proc *, BlockInfo) Number of blocks or instructionsNumber of blocks or instructions Procedure frame size, integer and floating point save Procedure frame size, integer and floating point save

masksmasks GetBlockInfo(Inst *, InstInfo)GetBlockInfo(Inst *, InstInfo)

Number of instructionsNumber of instructions Any piece of the instruction (opcode, ra, rb, displacement)Any piece of the instruction (opcode, ra, rb, displacement)

UPC, February 1999

Interrogation(2)Interrogation(2)

ProcFileNameProcFileName Returns the file name for this procedureReturns the file name for this procedure

InstLineNoInstLineNo Returns the line number of this procedureReturns the line number of this procedure

GetInstRegEnumGetInstRegEnum Returns a unique register specifierReturns a unique register specifier

GetInstRegUsageGetInstRegUsage Computes Source and Destination masksComputes Source and Destination masks

UPC, February 1999

Interrogation(3)Interrogation(3)

GetInstRegUsageGetInstRegUsage Computes instruction source and destination masksComputes instruction source and destination masks

GetInstRegUsage(instFirst, &usageFirst);GetInstRegUsage(instFirst, &usageFirst);

GetInstRegUsage(instSecond, &usageSecond);GetInstRegUsage(instSecond, &usageSecond);

if (usageFirst.dreg_bitvec[0] & if (usageFirst.dreg_bitvec[0] & usageSecond.ureg_bitvec[0]) {usageSecond.ureg_bitvec[0]) {

/* set followed by a use *//* set followed by a use */

}}

Exactly what you need to find static pipeline stalls!Exactly what you need to find static pipeline stalls!

UPC, February 1999

DefinitionDefinition

AddCallProto(“function(argument list)”)AddCallProto(“function(argument list)”) ConstantsConstants Character stringsCharacter strings Program counterProgram counter Register contentsRegister contents Cycle counterCycle counter Constant arraysConstant arrays Effective AddressesEffective Addresses Branch Condition ValuesBranch Condition Values

UPC, February 1999

InstrumentationInstrumentation

AddCallProgram(Program{Before,After}, “name”,args)AddCallProgram(Program{Before,After}, “name”,args) AddCallProc(p, Proc{Before,After}, “name”,args)AddCallProc(p, Proc{Before,After}, “name”,args) AddCallBlock(b, Block{Before,After}, “name”,args)AddCallBlock(b, Block{Before,After}, “name”,args) AddCallInst(i, Inst{Before,After}, “name”,args)AddCallInst(i, Inst{Before,After}, “name”,args) ReplaceProc(p, “new”)ReplaceProc(p, “new”)

UPC, February 1999

Example #1: Procedure TracingExample #1: Procedure Tracing

What procedures are executed by the following mystery What procedures are executed by the following mystery program?program?

#include <stdio.h>main() { printf(“Hello world!\n”);}

Hint: main => printf => ???

UPC, February 1999

Procedure Tracing ExampleProcedure Tracing Example

> cc hello.c -non_shared -g1 -o hello> atom hello ptrace.inst.c ptrace.anal.c -o hw.ptrace> hello.ptrace=> __start=> main=> printf=> _doprnt=> __getmbcurmaz<= __getmbcurmax=> memcpy<= memcpy=> fwrite

UPC, February 1999

Procedure Trace (2)Procedure Trace (2)

=> _wrtchk <= memcpy => __ldr_atexit <= fflush=> _findbuf => memchr => __ldr_context_atexit => __close=> __geterrno <= memchr <= __ldr_context_atexit <= __close<= __geterrno => _xflsbuf <= __ldr_atexit <= fclose=> __isatty => __write => _cleanup => fclose=> __ioctl Hello world! => fclose => __close<= __ioctl <= __write => fflush<= __isatty <= _xflsbuf <= fflush=> __seterrno <= fwrite => __close<= __seterrno <= _doprnt <= __close<= _findbuf <= printf <= fclose<= _wrtchk <= main => fclose=> memcpy => exit => fflush

UPC, February 1999

Example #2: Cache SimulatorExample #2: Cache Simulator

Write a tool that computes the miss rate of the Write a tool that computes the miss rate of the application running in a 64KB, direct mapped data application running in a 64KB, direct mapped data cache with 32 byte lines.cache with 32 byte lines.

> atom spice cache.inst.o cache.anal.o -o spice.cache> atom spice cache.inst.o cache.anal.o -o spice.cache

> spice.cache < ref.in > ref.out> spice.cache < ref.in > ref.out

> more cache.out> more cache.out

5,387,822,402 620,855,884 11.523%5,387,822,402 620,855,884 11.523%

Great use for 64 bit integers!Great use for 64 bit integers!

UPC, February 1999

Cache Tool ImplementationCache Tool Implementation Application Instrumentation

Reference(-32592(gp));

Reference(-32592(gp));main:

lw v1,-32592(gp)move v0,zeroli a0,20

loop:addiu v1,v1,4addiu v0,v0,4sw v1,-32592(gp)bne v0,a0,loop:

jr raPrintResults();

Note: Passes addresses as if uninstrumented!

UPC, February 1999

Cache Instrumentation FileCache Instrumentation File#include <stdio.h>#include <stdio.h>#include <cmplrs/atom.inst.h>#include <cmplrs/atom.inst.h>unsigned InstrumentAll(int argc, char **argv) {unsigned InstrumentAll(int argc, char **argv) { AddCallProto(“AddCallProto(“Reference(VALUE)Reference(VALUE)”);”); AddCallProto(“Print()”);AddCallProto(“Print()”); for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) {for (o = GetFirstObj(); p != NULL; p = GetNextObj(o)) { if (BuildObj(o)) return (1);if (BuildObj(o)) return (1);

if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”);if (o == GetFirstObj()) AddCallObj(o,ObjAfter,”Print”); for (p = GetFirstProc(); p != NULL; p = GetNextProc(p))for (p = GetFirstProc(); p != NULL; p = GetNextProc(p)) for (b = for (b = GetFirstBlockGetFirstBlock(p); b != NULL; b = (p); b != NULL; b = GetNextBlockGetNextBlock(b))(b)) for (i = for (i = GetFirstInstGetFirstInst(b); i != NULL; i = (b); i != NULL; i = GetNextInstGetNextInst(i))(i)) if (if (IsInstType(i, InstTypeLoadIsInstType(i, InstTypeLoad) || ) || IsInstType(i,InstTypeStoreIsInstType(i,InstTypeStore)))) AddCallInst(i, InstBefore, “Reference”, EffAddrValue)AddCallInst(i, InstBefore, “Reference”, EffAddrValue);; WriteObj(o);WriteObj(o); }} return (0);return (0); }}

UPC, February 1999

Cache Analysis FileCache Analysis File#include <stdio.h>#include <stdio.h>#define CACHE_SIZE 65536#define CACHE_SIZE 65536#define BLOCK_SHIFT 5#define BLOCK_SHIFT 5long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;long cache[CACHE_SIZE >> BLOCK_SHIFT], refs,misses;Reference(long address) Reference(long address) {{ int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT;int index = address & (CACHE_SIZE-1) >> BLOCK_SHIFT; long tag = address >> BLOCK_SHIFT;long tag = address >> BLOCK_SHIFT; if (cache[index] != tag) { misses++; cache[index] = tag ; }if (cache[index] != tag) { misses++; cache[index] = tag ; } refs++;refs++;}}Print() Print() {{

FILE *file = fopen(“cache.out”,”w”);FILE *file = fopen(“cache.out”,”w”); printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);printf(file,”%ld %ld %.2f\n”,refs, misses, 100.0 * misses / refs);

fclose(file);fclose(file);}}


Example #3: TPC-B runtime informationExample #3: TPC-B runtime information

Statistics per transaction:Statistics per transaction: InstructionsInstructions 180,398180,398 Loads (% shared)Loads (% shared) 47,643 (24%)47,643 (24%) Stores (% shared)Stores (% shared) 21,380 (22%)21,380 (22%) Lock/UnlockLock/Unlock 118118 MBsMBs 241241

Footprints/CPUFootprints/CPU Instr.Instr. 300 KB300 KB (1.6 MB in pages)(1.6 MB in pages) Private dataPrivate data 470 KB470 KB (4 MB in pages)(4 MB in pages) Shared dataShared data 7 MB7 MB (26 MB in pages)(26 MB in pages)

– 50% of the shared data footprint is touched by at 50% of the shared data footprint is touched by at least one other processleast one other process


TPC-B (2)TPC-B (2)

Memory Footprint vs. Transactions

0.000E+00

1.000E+06

2.000E+06

3.000E+06

4.000E+06

5.000E+06

6.000E+06

7.000E+06

8.000E+06

72 216

360

504

648

792

936

1080

1224

1368

1512

1656

1800

Transactions

Byt

es

Instruction

Private data

Shared data


TPC-B (3)TPC-B (3)

Memory Footprint vs. Server processes

0.000E+00

1.000E+06

2.000E+06

3.000E+06

4.000E+06

5.000E+06

6.000E+06

7.000E+06

8.000E+06

1 2 3 4 5 6

Server processes

Byt

es

Shared data

Private data

Instructions


Oracle SGA activity in TPC-BOracle SGA activity in TPC-B


ATOM wrap-upATOM wrap-up

Very flexible “hack-it-yourself” toolVery flexible “hack-it-yourself” tool Discover detailed information on dynamic behavior Discover detailed information on dynamic behavior

of programsof programs Especially good when you don’t have source codeEspecially good when you don’t have source code Shipped with Digital UnixShipped with Digital Unix Can be used for tracing (later) Can be used for tracing (later)


Hardware measurement toolsHardware measurement tools

IPROBEIPROBE interface to CPU event countersinterface to CPU event counters

DCPIDCPIhardware assisted profilinghardware assisted profiling

ProfileMeProfileMehardware assisted profiling for complex CPU coreshardware assisted profiling for complex CPU cores


IPROBEIPROBE Developed by Digital’s Performance GroupDeveloped by Digital’s Performance Group Use event counters provided by AlphasUse event counters provided by Alphas Operation:Operation:

set counter to monitor a particular event (e.g., icache_miss)set counter to monitor a particular event (e.g., icache_miss) start counterstart counter every counter overflow, interrupt wakes up handler and events every counter overflow, interrupt wakes up handler and events

are accumulatedare accumulated stop counter and read totalstop counter and read total

User can select:User can select:which processes to countwhich processes to countuser level, kernel level, bothuser level, kernel level, both


IPROBE: 21164 event typesIPROBE: 21164 event types

issues single_issue_cycles long_stalls issues single_issue_cycles long_stalls

cycles dual_issue_cycles branch_mispr cycles dual_issue_cycles branch_mispr

triple_issue_cycles pc_mispr triple_issue_cycles pc_mispr

quad_issue_cycles icache_miss quad_issue_cycles icache_miss

split_issue_cycles dcache_miss split_issue_cycles dcache_miss

pipe_dry dtb_miss pipe_dry dtb_miss

pipe_frozen loads_merged pipe_frozen loads_merged

replay_trap ldu_replays replay_trap ldu_replays

branches cycles branches cycles

cond_branches scache_miss cond_branches scache_miss

jsr_ret scache_read_miss jsr_ret scache_read_miss

integer_ops scache_write integer_ops scache_write

float_ops scache_sh_write float_ops scache_sh_write

loads scache_write_miss loads scache_write_miss

stores bcache_miss stores bcache_miss

icache_access sys_inv icache_access sys_inv

dcache_access itb_miss dcache_access itb_miss

scache_access wb_maf_full_replays scache_access wb_maf_full_replays

scache_read sys_read_req scache_read sys_read_req

scache_write external scache_write external

bcache_hit mem_barrier_cycles bcache_hit mem_barrier_cycles

bcache_victim load_locked bcache_victim load_locked

sys_req sys_req

scache_victimscache_victim


IPROBE: what you can doIPROBE: what you can do

Directly measure relevant events (e.g. cache Directly measure relevant events (e.g. cache performance)performance)

Overall CPU cycle breakdown diagnosis:Overall CPU cycle breakdown diagnosis:microbenchmark machine to estimate latenciesmicrobenchmark machine to estimate latenciescombine latencies with event countscombine latencies with event counts

Main of inaccuracyMain of inaccuracyload/store overlap in the memory systemload/store overlap in the memory system


IPROBE example: 4-CPU SMPIPROBE example: 4-CPU SMP

issuing10%

data stall46%

instruction stall 44%

CPI = CPI = 7.47.4

Estimated breakdown of stall cyclesBreakdown of CPU cycles

Bcache hit27%

Bcache miss42%

Replay trap 6%

Mem. barrier

5%

Branch mispr. 2%

TLB 2%

Scache hit 16%


Why did it run so bad?!?Why did it run so bad?!?

Nominal memory latencies were good: 80 cyclesNominal memory latencies were good: 80 cycles Micro-benchmarks determined that:Micro-benchmarks determined that:

latency under load is over 120 cycles on 4 latency under load is over 120 cycles on 4 processorsprocessors

base dirty miss latency was over 130 cycles base dirty miss latency was over 130 cycles off-chip cache latency was highoff-chip cache latency was high

IPROBE data uncovered significant sharing:IPROBE data uncovered significant sharing: for P=2, 15% of bcache misses are to dirty blocksfor P=2, 15% of bcache misses are to dirty blocks for P=4, 20% of bcache misses are to dirty blocksfor P=4, 20% of bcache misses are to dirty blocks


Dirty miss latency on RISC SMPsDirty miss latency on RISC SMPs

SPEC benchmark has no significant sharingSPEC benchmark has no significant sharing Current processors/systems optimize local cache Current processors/systems optimize local cache

accessaccess All RISC SMPs have high dirty miss penaltiesAll RISC SMPs have high dirty miss penalties

Distribution of bus stall latencies for dirty misses

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

bus cycles

% d

irty

mis

ses


DCPI: continuous profiling infrastructureDCPI: continuous profiling infrastructure

Developed by SRC and WRL researchersDeveloped by SRC and WRL researchers Based on periodic samplingBased on periodic sampling Hardware generates periodic interruptsHardware generates periodic interrupts OS handles the interrupts and stores dataOS handles the interrupts and stores data

Program Counter (PC) and any extra infoProgram Counter (PC) and any extra info Analysis Tools convert dataAnalysis Tools convert data

for usersfor users for compilersfor compilers

Other examples:Other examples:

SGI Speedshop, Unix’s prof(), VTuneSGI Speedshop, Unix’s prof(), VTune


Sampling vs. InstrumentationSampling vs. Instrumentation

Much lower overhead than instrumentationMuch lower overhead than instrumentationDCPI: program 1%-3% slowerDCPI: program 1%-3% slowerPixie: program 2-3 times slowerPixie: program 2-3 times slower

Applicable to large workloadsApplicable to large workloads100,000 TPS on Alpha100,000 TPS on AlphaAltaVistaAltaVista

Easier to apply to whole systems (kernel, device Easier to apply to whole systems (kernel, device drivers, shared libraries, ...)drivers, shared libraries, ...)

Instrumenting kernels is very trickyInstrumenting kernels is very trickyNo source code neededNo source code needed


Information from ProfilesInformation from Profiles

DCPI DCPI estimatesestimates Where CPU cycles went, broken down byWhere CPU cycles went, broken down by

image, procedure, instructionimage, procedure, instruction How often code was executedHow often code was executed

basic blocks and CFG edgesbasic blocks and CFG edges Where peak performance was lost and whyWhere peak performance was lost and why


Example: Getting the Big PictureExample: Getting the Big PictureTotal samples for event type cycles = 6095201

cycles % cum% load file

2257103 37.03% 37.03% /usr/shlib/X11/lib_dec_ffb_ev5.so 1658462 27.21% 64.24% /vmunix 928318 15.23% 79.47% /usr/shlib/X11/libmi.so 650299 10.67% 90.14% /usr/shlib/X11/libos.so

cycles % cum% procedure load file

2064143 33.87% 33.87% ffb8ZeroPolyArc /usr/shlib/X11/lib_dec_ffb_ev5.so 517464 8.49% 42.35% ReadRequestFromClient /usr/shlib/X11/libos.so 305072 5.01% 47.36% miCreateETandAET /usr/shlib/X11/libmi.so 271158 4.45% 51.81% miZeroArcSetup /usr/shlib/X11/libmi.so 245450 4.03% 55.84% bcopy /vmunix 209835 3.44% 59.28% Dispatch /usr/shlib/X11/libdix.so 186413 3.06% 62.34% ffb8FillPolygon /usr/shlib/X11/lib_dec_ffb_ev5.so 170723 2.80% 65.14% in_checksum /vmunix 161326 2.65% 67.78% miInsertEdgeInET /usr/shlib/X11/libmi.so 133768 2.19% 69.98% miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so


Example: Using the MicroscopeExample: Using the Microscope

...

...

21.0 cycles

3.5 cycles

Address Instruction Samples Culprits CPI

9618 addq s0,t6,t6 643 1.0 cyclesb (b = data dep on 2nd operand)

D (D = DTLB miss)

D 961c ldl t4,0(t6) 2111 9618 3.5 cycles

aa (a = data dep on 1st operand)a

di (d = d-cache miss) (i = i-cache miss)di

9620 xor t4,t12,t5 14152 961c 21.0 cycles 9624 beq 0x963c 0 0.0 cycles

Where peak performance is lost and why


Example: Summarizing StallsExample: Summarizing Stalls

I-cache (not ITB) 0.0% to 0.3% ITB/I-cache miss 0.0% to 0.0% D-cache miss 27.9% to 27.9% DTB miss 9.2% to 18.3% Write buffer 0.0% to 6.3% Synchronization 0.0% to 0.0%

Branch mispredict 0.0% to 2.6% IMUL busy 0.0% to 0.0% FDIV busy 0.0% to 0.0% Other 0.0% to 0.0% Unexplained stall 2.3% to 2.3% Unexplained gain -4.3% to -4.3%------------------------------------------------------------- Subtotal dynamic 44.1%

Slotting 1.8% Ra dependency 2.0% Rb dependency 1.0% Rc dependency 0.0% FU dependency 0.0%------------------------------------------------------------- Subtotal static 4.8%------------------------------------------------------------- Total stall 48.9% Execution 51.2%Net sampling error -0.1%------------------------------------------------------------- Total tallied 100.0% (35171, 93.1% of all samples)


Example: Sorting StallsExample: Sorting Stalls

% cum% cycles cnt cpi blame PC file:line10.0% 10.0% 109885 4998 22.0 dcache 957c comp.c:484 9.9% 19.8% 108776 5513 19.7 dcache 9530 comp.c:477 7.8% 27.6% 85668 3836 22.3 dcache 959c comp.c:488


Typical Hardware SupportTypical Hardware Support

TimersTimersClock interrupt after N units of timeClock interrupt after N units of time

Performance CountersPerformance Counters Interrupt after NInterrupt after N

cycles, issues, loads, L1 Dcache misses, branch cycles, issues, loads, L1 Dcache misses, branch mispredicts, uops retired, ...mispredicts, uops retired, ...

Alpha 21064, 21164; PPro, PII;…Alpha 21064, 21164; PPro, PII;…Easy to measure total cycles, issues, CPI, etc.Easy to measure total cycles, issues, CPI, etc.

Only extra information is restart PCOnly extra information is restart PC


Problem: Inaccurate AttributionProblem: Inaccurate Attribution ExperimentExperiment

count data loadscount data loads loop: single load +loop: single load +

hundreds of nopshundreds of nops In-Order ProcessorIn-Order Processor

Alpha 21164Alpha 21164 skewskew large peaklarge peak

Out-of-Order ProcessorOut-of-Order Processor Intel Pentium ProIntel Pentium Pro skewskew smearsmear

0 50 100 150 200

0

2

4

6

8

10

12

14

16

18

20

22

24

Histogram of Restart PCs

782

load


Ramification of MisattributionRamification of Misattribution

No skew or smearNo skew or smear Instruction-level analysis is easy!Instruction-level analysis is easy!

Skew is a constant number of cyclesSkew is a constant number of cycles Instruction-level analysis is possibleInstruction-level analysis is possibleAdjust sampling period by amount of skewAdjust sampling period by amount of skew Infer execution counts, CPI, stalls, and stall Infer execution counts, CPI, stalls, and stall

explanations from cycles samples and programexplanations from cycles samples and program SmearSmear

Instruction-level analysis seems hopelessInstruction-level analysis seems hopelessExamples: PII, StrongARMExamples: PII, StrongARM


Desired Hardware SupportDesired Hardware Support

Sample fetched instructionsSample fetched instructions Save PC of sampled instructionSave PC of sampled instruction

E.g., interrupt handler reads Internal Processor E.g., interrupt handler reads Internal Processor RegisterRegister

Makes skew and smear irrelevantMakes skew and smear irrelevant Gather more informationGather more information


random selection

ProfileMeProfileMe: Instruction-Centric : Instruction-Centric ProfilingProfiling

fetch map issue exec retire

icache

branchpredict

dcache

interrupt!arithunits

done?

Fetch counter

overflow?

pc addr retired?miss?stage latencies

ProfileMe tag!

tagged?

historymp?capture!

internal processor registers

miss?


Instruction-Level StatisticsInstruction-Level Statistics PC + Retire Status PC + Retire Status execution frequency execution frequency PC + Cache Miss Flag PC + Cache Miss Flag cache miss rates cache miss rates PC + Branch Mispredict PC + Branch Mispredict mispredict rates mispredict rates PC + PC + EventEvent Flag Flag eventevent rates rates

PC + Branch Direction PC + Branch Direction edge frequencies edge frequencies PC + Branch History PC + Branch History path execution rates path execution rates

PC + Latency PC + Latency instruction stalls instruction stalls““100-cycle dcache miss” vs. “dcache miss”100-cycle dcache miss” vs. “dcache miss”


Compiled code

Samples

ANALYSIS Stall explanations

Frequency

Cycles per instruction

Data AnalysisData Analysis

Cycle samples are proportional to total time Cycle samples are proportional to total time at head of issue queue (at least on in-order at head of issue queue (at least on in-order Alphas)Alphas)

Frequency indicates frequent pathsFrequency indicates frequent paths CPI indicates stallsCPI indicates stalls


1,000,000 1 CPI

? 10,000 100 CPI1,000,000 Cycles

Estimating Frequency from SamplesEstimating Frequency from Samples

ProblemProblemgiven cycle samples, compute frequency and CPIgiven cycle samples, compute frequency and CPI

ApproachApproachLet F = Frequency / Sampling PeriodLet F = Frequency / Sampling PeriodE(Cycle Samples) = F X CPIE(Cycle Samples) = F X CPISo … F = E(Cycle Samples) / CPISo … F = E(Cycle Samples) / CPI


Estimating Frequency (cont.)Estimating Frequency (cont.)

F = E(Cycle Samples) / CPIF = E(Cycle Samples) / CPI IdeaIdea

If no dynamic stall, then know CPI, so can estimate FIf no dynamic stall, then know CPI, so can estimate F So… assume some instructions have no dynamic stallsSo… assume some instructions have no dynamic stalls

Consider a group of instructions with the same frequency Consider a group of instructions with the same frequency (e.g., basic block)(e.g., basic block)

Identify instructions w/o dynamic stalls; then average their Identify instructions w/o dynamic stalls; then average their sample counts for better accuracysample counts for better accuracy

Key insight:Key insight: Instructions without stalls have smaller sample countsInstructions without stalls have smaller sample counts


Address Instruction Samples MinCPI Samples/MinCPI

9600 subl s6, a1, s6 792 1 7929604 lda a3, 16411(s6) 611 1 6119608 cmovlt s6, a3, s6 649 1 649960c bis zero, zero, s3 0 0 Estimate 6309610 sll s6, 0x5, t6 1389 2 695 (Actual 615)9614 addl zero, t6, t6 616 1 6169618 addq s0, t6, t6 643 1 643961c ldl t4, 0(t6) 2111 1 21119620 xor t4, t12, t5 13152 2 65769624 beq t5, 963c 0 0

Estimating Frequency (Example)Estimating Frequency (Example)

Compute MinCPI from CodeCompute MinCPI from Code Compute Samples/MinCPI Compute Samples/MinCPI Select Data to AverageSelect Data to Average

Does badly when:Does badly when: Few issue pointsFew issue points All issue points stallAll issue points stall


Frequency Estimate AccuracyFrequency Estimate Accuracy

Compare frequency estimates for blocks to Compare frequency estimates for blocks to measured values obtained with pixie-like toolmeasured values obtained with pixie-like tool


Explaining StallsExplaining Stalls

Static stallsStatic stallsSchedule instructions in each basic block Schedule instructions in each basic block

optimistically using a detailed pipeline model for the optimistically using a detailed pipeline model for the processorprocessor

Dynamic stallsDynamic stallsStart with all possible explanationsStart with all possible explanations

– I-cache miss, D-cache miss, DTB miss, branch I-cache miss, D-cache miss, DTB miss, branch mispredict, ...mispredict, ...

Rule out unlikely explanations Rule out unlikely explanations List the remaining possibilitiesList the remaining possibilities


Is the previous occurrence of an operand register Is the previous occurrence of an operand register the destination of a load instruction?the destination of a load instruction?

Search backward across basic block boundariesSearch backward across basic block boundaries Prune by block and edge execution frequenciesPrune by block and edge execution frequencies

ldq t0,0(s1)

subq t0,t1,t2

addq t3,t0,t4

OR

subq t0,t1,t2

Ruling Out D-cache MissesRuling Out D-cache Misses


DCPI wrap-upDCPI wrap-up

Very precise, non-intrusive profiling toolVery precise, non-intrusive profiling tool Gathers both user-level and kernel profilesGathers both user-level and kernel profiles Relates architectural events back to original codeRelates architectural events back to original code Used for profile-based code optimizationsUsed for profile-based code optimizations


Simulation of commercial workloadsSimulation of commercial workloads

Requires scaling downRequires scaling down Options:Options:

Trace-driven simulationTrace-driven simulationUser-level execution-driven simulationUser-level execution-driven simulationComplete machine simulationComplete machine simulation


Trace-driven simulationTrace-driven simulation

Methodology:Methodology: create ATOM instrumentation tool that logs a complete trace create ATOM instrumentation tool that logs a complete trace

per Oracle server processper Oracle server process– instruction pathinstruction path– data accessesdata accesses– synchronization accessessynchronization accesses– system callssystem calls

run “atomized” version to derive tracerun “atomized” version to derive trace feed traces to simulatorfeed traces to simulator


Trace-driven studies: limitationsTrace-driven studies: limitations No OS activity (in OLTP OS takes 10-15% of the No OS activity (in OLTP OS takes 10-15% of the

time)time) Trace selected processes only (e.g. server Trace selected processes only (e.g. server

processes)processes) Time dilation alters system behaviorTime dilation alters system behavior

I/O looks fasterI/O looks fastermany places with hardwired timeout values have to many places with hardwired timeout values have to

be patchedbe patched Capturing synchronization correctly is difficultCapturing synchronization correctly is difficult

need to reproduce correct concurrency for shared need to reproduce correct concurrency for shared data structuresdata structures

DB has complex synchronization structure, many DB has complex synchronization structure, many levels of procedureslevels of procedures


Trace-driven studies: limitations(2)Trace-driven studies: limitations(2) Scheduling traces into simulated processorsScheduling traces into simulated processors

need enough information in the trace to reproduce need enough information in the trace to reproduce OS schedulingOS scheduling

need to suspend processes for I/O & other blocking need to suspend processes for I/O & other blocking operationsoperations

need to model activity of background processes that need to model activity of background processes that are not traced (e.g. log writer)are not traced (e.g. log writer)

Re-create OS virtual-physical mapping, page Re-create OS virtual-physical mapping, page coloring schemecoloring scheme

Very difficult to simulate wrong-path executionVery difficult to simulate wrong-path execution


User-level execution-driven simulatorUser-level execution-driven simulator

Our experience was to modify AINT (MINT for Our experience was to modify AINT (MINT for Alpha)Alpha)

Problems:Problems:no OS activity measuredno OS activity measuredOracle/OS interactions are very complexOracle/OS interactions are very complexOS system call interface has to be virtualizedOS system call interface has to be virtualizedThat’s a hard one to crack…That’s a hard one to crack…

Our status:Our status:Oracle/TPC-B ran with 1 server process onlyOracle/TPC-B ran with 1 server process onlywe gave up...we gave up...


Complete machine simulatorComplete machine simulator

Bite the bullet: model the machine at the hardware Bite the bullet: model the machine at the hardware levellevel

The good news is:The good news is:hardware interface is cleaner & better documented hardware interface is cleaner & better documented

than any software interface (including OS)than any software interface (including OS)all software JUST RUNS!! Including OSall software JUST RUNS!! Including OSapplications don’t have to be ported to simulatorapplications don’t have to be ported to simulator

We ported SimOS (from Stanford) to AlphaWe ported SimOS (from Stanford) to Alpha


SimOSSimOS

A complete machine simulatorA complete machine simulatorSpeed-detail tradeoff for maximum flexibilitySpeed-detail tradeoff for maximum flexibilityFlexible data collection and classificationFlexible data collection and classification

Originally developed at Stanford University (MIPS ISA)Originally developed at Stanford University (MIPS ISA) SimOS-Alpha effort started at WRL in Fall 1996SimOS-Alpha effort started at WRL in Fall 1996

Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo, Ben Verghese, Basem Nayfeh, and Jamey Hicks (CRL)Verghese, Basem Nayfeh, and Jamey Hicks (CRL)


SimOS - Complete Machine SimulationSimOS - Complete Machine Simulation

Models CPUs, caches, buses, memory, disks, network, …Complete enough to run OS and any applications

Pmake Oracle VCS

Operating System of Simulated Machine

Workloads

SimOSHardware

HostHost Machine

Disks

TTY

Caches

Ethernet CPU/MMU

Memory System


Multiple Levels of DetailMultiple Levels of Detail

Tradeoff between speed of simulation and the Tradeoff between speed of simulation and the amount of detail that is simulatedamount of detail that is simulated

Multiple modes of CPU simulationMultiple modes of CPU simulationFast “on-the-fly compilation”: 10X slowdown!Fast “on-the-fly compilation”: 10X slowdown!

– Workload placementWorkload placementSimple pipeline emulator, no caches: 50-100X Simple pipeline emulator, no caches: 50-100X

slowdownslowdown– Rough characterizationRough characterization

Simple pipeline emulator, full cache simulation: 100-Simple pipeline emulator, full cache simulation: 100-200X slowdown200X slowdown

– More accurate characterization of workloads More accurate characterization of workloads


Multiple Models for each ComponentMultiple Models for each Component Multiple models for CPU, cache, memory,and disk.Multiple models for CPU, cache, memory,and disk. CPUCPU

simple pipeline emulator: 100-200X slowdown (EV5)simple pipeline emulator: 100-200X slowdown (EV5) dynamically-scheduled processor: 1000-10000X slowdown dynamically-scheduled processor: 1000-10000X slowdown

(e.g.21264)(e.g.21264) CachesCaches

Two level set associative cachesTwo level set associative caches Shared cachesShared caches

MemoryMemory Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)Perfect (0-latency), Bus-based (Tlaser), NUMA (Wildfire)

DiskDisk Fixed latency or more complex HP disk modelFixed latency or more complex HP disk model

Modular: add your own flavorsModular: add your own flavors


Checkpoint and SamplingCheckpoint and Sampling Checkpoint capability for entire machine stateCheckpoint capability for entire machine state

CPU state, main memory, and disk changesCPU state, main memory, and disk changes Important for positioning workload for detailed simulationImportant for positioning workload for detailed simulation Switching detail level in a “sampling” studySwitching detail level in a “sampling” study

Run in faster modes, sample in more detailed modesRun in faster modes, sample in more detailed modes RepeatabilityRepeatability

Change parameters for studiesChange parameters for studies– Cache sizeCache size– Memory type and latenciesMemory type and latencies– Disk models and latenciesDisk models and latencies– Many othersMany others

Debugging race conditionsDebugging race conditions


Data Collection and ClassificationData Collection and Classification Exploits visibility and non-intrusiveness offered by simulationExploits visibility and non-intrusiveness offered by simulation

Can observe low-level events such as cache misses, Can observe low-level events such as cache misses, references and TLB missesreferences and TLB misses

Tcl-based configuration and control provides ease of useTcl-based configuration and control provides ease of use Powerful annotation mechanism for triggering eventsPowerful annotation mechanism for triggering events

Hardware, OS, or ApplicationHardware, OS, or Application Apps and mechanisms to organize and classify dataApps and mechanisms to organize and classify data

Some already provided (cache miss counts and classification)Some already provided (cache miss counts and classification) Mechanisms to do more (timing trees and detail tables)Mechanisms to do more (timing trees and detail tables)


Easy configurationEasy configuration TCL based configuration of the machine parametersTCL based configuration of the machine parameters Example:Example:

set PARAM(CPU.Model) DELTAset PARAM(CPU.Model) DELTA

set detailLevel 1set detailLevel 1

set PARAM(CPU.Clock) 1000set PARAM(CPU.Clock) 1000

set PARAM(CPU.Count) 4set PARAM(CPU.Count) 4

set PARAM(CACHE.2Level.L2Size) 1024set PARAM(CACHE.2Level.L2Size) 1024

set PARAM(CACHE.2Level.L2Line) 64set PARAM(CACHE.2Level.L2Line) 64

set PARAM(CACHE.2Level.L2HitTime) 15set PARAM(CACHE.2Level.L2HitTime) 15

set PARAM(MEMSYS.MemSize) 1024 set PARAM(MEMSYS.MemSize) 1024

set PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count) set PARAM(MEMSYS.Numa.NumMemories) $PARAM(CPU.Count)

set PARAM(MEMSYS.Model) Numaset PARAM(MEMSYS.Model) Numa

set PARAM(DISK.Fixed.Latency) 10set PARAM(DISK.Fixed.Latency) 10


Annotations - The building blockAnnotations - The building block Small procedures to be run on encountering certain eventsSmall procedures to be run on encountering certain events

PC, hardware events (cache miss, TLB, …), simulator eventsPC, hardware events (cache miss, TLB, …), simulator eventsannotation set pc vmunix::idle_thread:START {annotation set pc vmunix::idle_thread:START {

set PROCESS($CPU) idleset PROCESS($CPU) idle

annotation exec osEvent startIdleannotation exec osEvent startIdle

}}

annotation set osEvent switchIn {annotation set osEvent switchIn {

log "$CYCLES ContextSwitch $CPU,$PID($CPU),log "$CYCLES ContextSwitch $CPU,$PID($CPU),$PROCESS($CPU)\n"$PROCESS($CPU)\n"

}}

annotation set pc 0x12004ba90 {annotation set pc 0x12004ba90 {

incr tpcbTOGO -1incr tpcbTOGO -1

console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"console "TRANSACTION $CYCLES togo=$tpcbTOGO \n"

if {$tpcbTOGO == 0} {simosExit}if {$tpcbTOGO == 0} {simosExit}

}}


Example: Kernel Detail (TPCB)Example: Kernel Detail (TPCB)

30%

21%5%

2%

17%

7%

2%1%

5%2%

2%1% 5%SYS_read

SYS_write

SYS_pid_block

SYS_pid_unblock

lock

Int_clock

Int_IPI

Int_IO

DTLB

ITLB

2XTLB

MM_FOW

Other


SimOS MethodologySimOS Methodology Configure and tune the workload on existing machineConfigure and tune the workload on existing machine

build the database schema, create indexes, load data, optimize build the database schema, create indexes, load data, optimize queriesqueries

more difficult if simulated system much different from existing more difficult if simulated system much different from existing platformplatform

Create file(s) with disk image (dd) of the database disk(s)Create file(s) with disk image (dd) of the database disk(s) write-protect “dd” files to prevent permanent modification (i.e. write-protect “dd” files to prevent permanent modification (i.e.

use copy-on-write)use copy-on-write) optionally, umount disks and let SimOS use them as raw optionally, umount disks and let SimOS use them as raw

devicesdevices Configure SimOS to see the “dd” files as raw disksConfigure SimOS to see the “dd” files as raw disks ““Boot” a SimOS configuration and mount the disksBoot” a SimOS configuration and mount the disks


SimOS Methodology (2)SimOS Methodology (2) Boot and startup the database engine on “fast Boot and startup the database engine on “fast

mode”mode”

Startup the workloadStartup the workload

When in steady state: create a checkpoint and exitWhen in steady state: create a checkpoint and exit Resume from checkpoint with complex (slower) Resume from checkpoint with complex (slower)

simulatorsimulator


Sample NUMA TPC-B Profile:Sample NUMA TPC-B Profile:


Running from a CheckpointRunning from a Checkpoint

What can be changed:What can be changed:processor modelprocessor modeldisk modeldisk modelcache sizes, hierarchy, organization, replacementcache sizes, hierarchy, organization, replacementhow long to run the simulationhow long to run the simulation

What cannot be changed:What cannot be changed:number of processorsnumber of processorssize of physical memorysize of physical memory


Tools wrap-upTools wrap-up

No single tool will get the job doneNo single tool will get the job done Monitoring application execution in a real system is Monitoring application execution in a real system is

invaluableinvaluable Complete machine simulation advantages:Complete machine simulation advantages:

see the whole thingsee the whole thingportability of software is non-issueportability of software is non-issuespeed/detail trade-off essential for detailed studiesspeed/detail trade-off essential for detailed studies

Western Research Laboratory Design and Evaluation of Architectures for Commercial Applications Luiz André Barroso Part II: tools & methods.

Documents

real linker slide

activity slide

rich set of tools slide

system linker

ordinary system tools

interesting tools

tools methods

unchanged atom