Accuracy of Accuracy of Performance Performance Monitoring Monitoring Hardware Hardware Michael E. Maxwell, Patricia J. Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia Teller, and Leonardo Salayandia University of Texas-El Paso University of Texas-El Paso and and Shirley Moore Shirley Moore University of Tennessee-Knoxville University of Tennessee-Knoxville
47
Embed
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Accuracy of Accuracy of Performance Performance
Monitoring HardwareMonitoring HardwareMichael E. Maxwell, Patricia J. Teller, and Michael E. Maxwell, Patricia J. Teller, and
Leonardo SalayandiaLeonardo SalayandiaUniversity of Texas-El PasoUniversity of Texas-El Paso
andandShirley MooreShirley Moore
University of Tennessee-KnoxvilleUniversity of Tennessee-Knoxville
PCAT - The University of Texas at El Paso
PCAT TeamPCAT Team
Dr. Patricia TellerDr. Patricia Teller Alonso Bayona - UndergraduateAlonso Bayona - Undergraduate Alexander Sainz - UndergraduateAlexander Sainz - Undergraduate Trevor Morgan - UndergraduateTrevor Morgan - Undergraduate Leonardo Salayandia – M.S. Leonardo Salayandia – M.S.
StudentStudent Michael Maxwell – Ph.D. StudentMichael Maxwell – Ph.D. Student
PCAT - The University of Texas at El Paso
Credits (Financial)Credits (Financial) DoD PET ProgramDoD PET Program NSF MIE (Model Institutions of NSF MIE (Model Institutions of
Excellence) REU (Research Excellence) REU (Research Experiences for Undergraduates) Experiences for Undergraduates) ProgramProgram
UTEP Dodson EndowmentUTEP Dodson Endowment
PCAT - The University of Texas at El Paso
MotivationMotivation
Facilitate performance-tuning Facilitate performance-tuning efforts that employ aggregate efforts that employ aggregate event countsevent counts
When possible provide calibration When possible provide calibration datadata
Identify unexpected results, errors Identify unexpected results, errors Clarify misunderstandings of Clarify misunderstandings of
processor functionalityprocessor functionality
PCAT - The University of Texas at El Paso
Road MapRoad Map
Scope of ResearchScope of Research MethodologyMethodology ResultsResults Future Work and ConclusionsFuture Work and Conclusions
PCAT - The University of Texas at El Paso
Processors Under StudyProcessors Under Study
MIPS R10K and R12K: 2 counters, MIPS R10K and R12K: 2 counters, 32 events32 events
IBM Power3: 8 counters, 100+ IBM Power3: 8 counters, 100+ eventsevents
Events Studied So FarEvents Studied So Far Number of load and store instructions Number of load and store instructions
executedexecuted Number of floating-point instructions Number of floating-point instructions
executedexecuted Total number of instructions executed Total number of instructions executed
(issued/committed)(issued/committed) Number of L1 I-cache and L1 D-cache missesNumber of L1 I-cache and L1 D-cache misses Number of L2 cache missesNumber of L2 cache misses Number of TLB missesNumber of TLB misses Number of branch mispredictionsNumber of branch mispredictions
PCAT - The University of Texas at El Paso
PAPI OverheadPAPI Overhead Extra instructionsExtra instructions
Read counter before and after workloadRead counter before and after workload Processing of counter overflow Processing of counter overflow
via PAPI (instrumented benchmark run via PAPI (instrumented benchmark run 100 times; mean event count and 100 times; mean event count and standard deviation calculated)standard deviation calculated)
Validation Micro-benchmarkValidation Micro-benchmark Simple, usually small programSimple, usually small program Stresses a portion of the Stresses a portion of the
microarchitecture or memory microarchitecture or memory hierarchyhierarchy
Its size, simplicity, or execution time Its size, simplicity, or execution time facilitates the tracing of its execution facilitates the tracing of its execution path and/or prediction of the number path and/or prediction of the number of times an event is generatedof times an event is generated
Scalable w.r.t. granularity, i.e., Scalable w.r.t. granularity, i.e., number of generated eventsnumber of generated events
PCAT - The University of Texas at El Paso
Example – Loop Validation Example – Loop Validation Micro-benchmarkMicro-benchmark
For (I = 0; I < number_of_loops; I++)For (I = 0; I < number_of_loops; I++){{
sequence of 100 instructions with data sequence of 100 instructions with data dependencies that prevent compiler dependencies that prevent compiler reorder or optimizationreorder or optimization
}}
Used to stress a particular functional unit,e.g., Used to stress a particular functional unit,e.g., the load/store unitthe load/store unit
Program designed to provide insight Program designed to provide insight into microarchitecture organization into microarchitecture organization and/or the algorithms that control itand/or the algorithms that control it
ExamplesExamples Page size used – for TLB miss countsPage size used – for TLB miss counts Cache prefetch algorithmCache prefetch algorithm Branch prediction buffer Branch prediction buffer
But are they useful without But are they useful without knowing more about the algorithm knowing more about the algorithm used by the vendor?used by the vendor?
PCAT - The University of Texas at El Paso
Example 1: Total Data TLB Example 1: Total Data TLB MissesMisses
Replacement policy can Replacement policy can (unpredictably) affect event counts(unpredictably) affect event counts
PAPI may (unpredictably) affect PAPI may (unpredictably) affect event countsevent counts
Other processes may Other processes may (unpredictably) affect event counts(unpredictably) affect event counts
Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses# misses relatively constant as # of array # misses relatively constant as # of array
references increasereferences increaseL1 D cache misses using sequential access
Example 3: L1 D-Cache Misses with Example 3: L1 D-Cache Misses with Random Access Random Access
(Foil Prefetch Scheme used by Stream (Foil Prefetch Scheme used by Stream Buffers)Buffers)L1 D cache misses as a function of % filled
-150.0
-100.0
-50.0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Power3
R12k
Pentium
Example 4: A Mathematical Model that Example 4: A Mathematical Model that Verifies that Execution Time increases Verifies that Execution Time increases Proportionately with L1 D-Cache MissesProportionately with L1 D-Cache Misses
Reported Event Counts: Reported Event Counts: Unexpected but Unexpected but
ConsistentConsistentResultsResults
Predicted counts and reported Predicted counts and reported counts differ significantly but in a counts differ significantly but in a consistent mannerconsistent manner
Is this an error?Is this an error? Are we missing something?Are we missing something?
Example: Compulsory Data Example: Compulsory Data TLB MissesTLB Misses
% difference per no. % difference per no. of referencesof references
Reported counts are Reported counts are consistentconsistent
Study sampling on Power4; IBM Study sampling on Power4; IBM collaboration re: workload collaboration re: workload characterization/system resource characterization/system resource usage using samplingusage using sampling
PCAT - The University of Texas at El Paso
ConclusionsConclusions Performance counters provide informative data that can Performance counters provide informative data that can
be used for performance tuningbe used for performance tuning Expected frequency of event may determine usefulness Expected frequency of event may determine usefulness
of event countsof event counts Calibration data can make event counts more useful to Calibration data can make event counts more useful to
The usefulness of some event counts -- as well as our The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaborationresearch – could be enhanced with vendor collaboration
The usefulness of some event counts is questionable The usefulness of some event counts is questionable without documentation of the related behaviorwithout documentation of the related behavior
PCAT - The University of Texas at El Paso
Should we attach the Should we attach the following warning to some following warning to some
event counts on some event counts on some platforms?platforms?
CAUTION: The values in CAUTION: The values in the performance the performance counters may be greater counters may be greater than you think.than you think.
PCAT - The University of Texas at El Paso
And should we attach the And should we attach the PCAT Seal of Approval on PCAT Seal of Approval on
others?others?
PCAT
PCAT - The University of Texas at El Paso
Invitation to VendorsInvitation to Vendors
Help us understand what’s going on, Help us understand what’s going on, when to attach the “warning”,when to attach the “warning”,and when to attach the “seal of and when to attach the “seal of approval.” Application programmers approval.” Application programmers will appreciate your efforts and so will appreciate your efforts and so will we!will we!
PCAT - The University of Texas at El Paso
Question to YouQuestion to You
On-board Performance Counters: On-board Performance Counters: What do they really tell you?What do they really tell you?
With all the caveats, are they useful With all the caveats, are they useful nonetheless?nonetheless?
PCAT - The University of Texas at El Paso
Example 1: Total Example 1: Total CompulsoryCompulsory Data TLB Misses for R10K Data TLB Misses for R10K
% difference per no. of % difference per no. of referencesreferences
Predicted values Predicted values consistently lower than consistently lower than reportedreported
Small standard Small standard deviationsdeviations
Greater predictability Greater predictability with increased no. of with increased no. of referencesreferences
3%
6%
9%
12%
15%
1
10
100
1000
10000
Example 1: Example 1: CompulsoryCompulsory Data TLB Misses for ItaniumData TLB Misses for Itanium
% difference per no. % difference per no. of referencesof references
Reported counts Reported counts consistently ~5 consistently ~5 times greater than times greater than predictedpredicted
399%
400%
401%
402%
403%
404%
1
10
100
1000
10000
Example 3: Compulsory Example 3: Compulsory Data TLB Misses for Power 3Data TLB Misses for Power 3
% difference per no. % difference per no. of referencesof references
Reported counts Reported counts consistently ~5/~2 consistently ~5/~2 times greater than times greater than predicted for predicted for small/large countssmall/large counts
Total TLB misses (Power3)% Discrepancy
150%
200%
250%
300%
350%
400%
450%
500%
550%
1 10 100 1000 10000
Example 3: L1 D-Cache Misses Example 3: L1 D-Cache Misses with Random Access – Itaniumwith Random Access – Itanium
only when at array size = 10x cache size only when at array size = 10x cache size L1 D cache misses as a function of % filled
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Itanium
Power3
R12k
Pentium
Example 2: Example 2: L1 D-Cache MissesL1 D-Cache Misses
On some of the processors studied, as On some of the processors studied, as the number of accesses increased, the the number of accesses increased, the miss rate approached 0miss rate approached 0
Accessing the array in strides of size Accessing the array in strides of size two cache-size units plus one cache-line two cache-size units plus one cache-line resulted in approximately the same resulted in approximately the same event count as accessing the array in event count as accessing the array in strides of one wordstrides of one word
What’s going on?What’s going on?
Example 2: R10K Example 2: R10K Floating-Point Division InstructionsFloating-Point Division Instructions
a = init_value;a = init_value;
b = init_value;b = init_value;
c = init_value;c = init_value;
a = b / init_value;a = b / init_value;
b = a / init_value;b = a / init_value;
c = b / init_value;c = b / init_value;
a = init_value;a = init_value;
b = init_value;b = init_value;
c = init_value;c = init_value;
a = a / init_value;a = a / init_value;
b = b / init_value;b = b / init_value;
c = c / init_value;c = c / init_value;1 FP Instruction
Counted3 FP Instructions
Counted
Example 2: Assembler Example 2: Assembler Code AnalysisCode Analysis
No optimizationNo optimization Same instructionsSame instructions Different (expected) Different (expected)
operandsoperands Three division Three division
instructions in bothinstructions in both No reason for No reason for