Benchmarking ECE 561
Sudeep Pasricha
Daniel Dauwe1/9/2014
Presentation Outline• Project Goals
• Tools for Benchmarking:
• Performance counters, PAPI,
• HPC Toolkit, Phoronix Test Suits,
• Power Measurement
• How testing was accomplished
• List of additional data points for application to processor affinity
• A simple continuation of Ryan’s test work
• Results from Memory/Cache Interference Testing for multiple applications run simultaneously pinned to specific cores
Project Goals• Benchmarking Processors
– Monitor both performance counters and the system's power usage
– Gathering more data for looking at application affinity for performance on a particular processor architecture• Memory Intensive Applications• CPU Intensive Applications
– Analyze the Interaction/Interference of multiple applications run simultaneously on different cores of the same processor
• This data collection is intermediate work for future unspecified projects
Performance Counters and PAPI
• Performance counters– Counters built into processor hardware that record the number
of occurrences of user specified events in hardware• PAPI – Performance Application Programming Interface
– PAPI was developed in the hope of identifying bottlenecks in current architectural development of high performance computing
– A standardized list of performance counters available for most processors
– PAPI makes it easier to have consistent tests across multiple processor architectures
What do the Performance Counter Measurements mean?
• Can mean different things based on which counters are being monitored Ex:– PAPI_L1_DCA - Level 1 data cache accesses– PAPI_FAD_INS - Floating point add instructions– PAPI_L2_DCM - Level 2 data cache misses
• The raw count data provided by the Performance Counter will need to be meaningfully interpreted by the user
Matching Performance counters to Processor Architectures
• Performance Counters used for these tests :– PAPI_TOT_INS – Total Instructions Executed– PAPI_L2_TCM – Data and Instruction Level 2 Cache Misses
• These should be pretty universally available across different processor architectures
• Future inclusion of other tests may require other Performance Counters, but available Performance Counters vary greatly between processor architectures…
HPC Toolkit
• “An Integrated suite of tools for measurement and analysis of program performance”
• Essentially – HPC Toolkit makes it easier to interface with the local machine's
performance counters– Makes collecting program performance data easier
Phoronix Test Suite• Phoronix Provides lots of test applications capable of testing many
aspects of processor performance– Phoronix tests are responsible for all of the benchmarking data
gathered for this presentation• However many other groups write application suites useful for
benchmarking– SPEC CPU2000 / 2006– PARSEC
• Several resources such as “OpenBenchmarking.org” provide a substantial amount of results from tests run from these suites on many processor architectures– This could prove to be a useful resource, however they do not
include information about power usage
Applications used for testing Cross-Core cache interference
• C-Ray– A Ray Tracing Program– CPU Intensive– Many Floating Point Calculation Operations– Relatively Little Memory Access
• Ramspeed– Integer and Floating Point Writes and Reads to memory– Memory Intensive– More interaction with the caches
Monitoring Power Usage• “Watts Up? PRO” power meter
– Measures power consumption from a single standard power outlet
– Has a USB port to interface with a computer and dump recorded power measurements
How tests were run• Minimalist Ubuntu Operating System allows the processor's
attention to be dedicated to the test applications– Terminal Based User Interface– Unnecessary background processes not included in the
operating system• Power usage and selected program counters are recorded and
saved while the various test applications are run.• For Testing Interference between programs:
– “taskset” was used to pin the applications to specific processor cores
– The applications were run concurrently, while performance counter results were measured
Measuring Memory Interference between Applications
• How this is tested:• Simultaneously pin different types of applications to run only on specific cores in the
processor,• Then use performance counters and the power meter to measure the interference
• Interference could be defined as:• An increase in the number of cache misses • Increase in application execution time• Possibly defined by an increase in power consumption
• Test plan:• Tests were run:
• First on an AMD Turion II Dual-Core M520 Processor (2 cores, 5 P-states)• Later also on an Intel Pentium Dual Core CPU (2 cores, 4 P-states)
• Run control tests for running each processor alone (pinned to a single core )• Run the tests together and analyze the differences
Control Results: Intel Pentium dual CPU T2330
0 1 2 30
100
200
300
400
Intel Pentium Dual Core: C-Ray L2
Cache Miss Control Results
CPU Control Test
0 1 2 30
500100015002000
Intel Pentium Dual Core : C-Ray Exe-
cution Time Control Results CPU Control
Test
0 1 2 3140
180
220
Intel Pentium Dual Core: Ramspeed Execution Time Control Results Memory
Control Test
0 1 2 353950
54000
54050
54100
Intel Pentium Dual Core: Ramspeed L2 Cache Miss Control
Results Memory Control Test
0 1 2 30
1000020000300004000050000
Intel Pentium Dual Core C-Ray Power Usage Control Re-
sults CPU Control Energy
0 1 2 38000
8500
9000
9500
10000
Intel Pentium Dual Core Ramspeed
Power Usage Con-trol Results Memory
Control En-ergy
Control Results: AMD Turion II Dual Core Mobile M520
0 1 2 3 4576
578
580
582
AMD Turion II Dual-Core C-ray Execution Time control Results CPU Control
Test
0 1 2 3 40
200
400
600
AMD Turion II Dual-Core C-ray L2 Cache Miss control Results
CPU Control Test
0 1 2 3 475
80
85
AMD Turion II Dual-Core Ram-speed Execution
Time control Results
Memory Control Test
0 1 2 3 44000420044004600
AMD Turion II Dual-Core Ram-speed L2 Cache
Miss control Results
Memory Control Test
0 1 2 3 40
1000020000300004000050000
AMD Turion II Dual-Core C-ray Power Usage control Re-
sults CPU Control Energy
0 1 2 3 40
5000
10000
15000
AMD Turion II Dual-Core Ramspeed
Power Usage con-trol Results Memory
Control En-ergy
Taking a Closer Look at the AMD Control Results from the previous slide:
• It seems suspect that the results from the control test should produce the same execution time across all p-states, even though this result for the C-Ray execution control test was consistent over multiple runs on the AMD Turion II processor, a test execution on a secondary Intel Pentium Dual Core processor produced results that were closer to what seems realistic:
0 1 2 3 40
500
1000
1500
2000
2500
C-Ray Execution Time
(AMD First Run)
Control TestInterference Test
0 1 2 3 40
200
400
600
800
1000
1200
1400
1600
1800
C-ray Execution Time
(AMD Second Run)
CPU Control TestCPU Inter-ference Test
0 1 2 30
200
400
600
800
1000
1200
1400
1600
1800
C-ray Execution Time (Intel
Run)
CPU Control TestCPU Inter-ference Test
• The third column of data represents Adjusted interference results
Interference Results (Joint Pinning Results on C-Ray):
Intel Pentium dual CPU T2330
0 1 2 30
200
400
600
800
1000
1200
1400
1600
1800
C-ray Execution Time Interference (Ramspeed test on
second core)
CPU Control TestOriginal CPU Interference TestAdjusted CPU Interference Test
0 1 2 30
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
C-ray L2 Cache Misses Interference (Ramspeed test on
second core)
CPU Control TestOriginal CPU Interference TestAdjusted CPU Interference Test
0 1 2 30
5000
10000
15000
20000
25000
30000
35000
40000
45000
Power usage for C-ray and Ramspeed tests run together
CPU Control Energy1 CPU and 1 Memory In-terference Test Energy
Interference Results (Joint Pinning Results on Ramspeed):
Intel Pentium dual CPU T2330
0 1 2 3150
160
170
180
190
200
210
220
Ramspeed Execution Time Interference
(C-ray test on second core)
Memory Control TestMemory Inter-ference Test
0 1 2 353850
53900
53950
54000
54050
54100
Ramspeed L2 Cache Misses Interference
(C-ray test on second core)
Memory Control TestMemory Inter-ference Test
Interference Results (2 CPU Intensive Application Pinning Results):
Intel Pentium dual CPU T2330
0 1 2 30
200
400
600
800
1000
1200
1400
1600
C-ray Execution Time Interference
(C-ray test on second core)
CPU Control TestCPU Inter-ference TestCPU Inter-ference Test
0 1 2 30
100
200
300
400
500
600
700
C-ray L2 Cache Misses Inter-
ference (C-ray test on second core)
CPU Control TestCPU Inter-ference TestCPU Inter-ference Test
0 1 2 30
5000
10000
15000
20000
25000
30000
35000
40000
45000
Power usage for 2 C-ray tests run-ning on separate
cores
CPU Control Energy2 CPU Inter-ference Test Energy
Interference Results (2 Memory Intensive Application
Pinning Results): Intel Pentium dual CPU T2330
0 1 2 30
50
100
150
200
250
300
350
400
Ramspeed Execu-tion Time Inter-
ference (Ramspeed test on
second core) Memory Control TestMemory Interfer-ence TestMemory Interfer-ence Test
0 1 2 353900
53950
54000
54050
54100
54150
54200
54250
Ramspeed L2 Cache Misses In-
terference (Ramspeed test on
second core) Memory Control TestMemory Interfer-ence TestMemory Interfer-ence Test
0 1 2 30
5000
10000
15000
20000
25000
30000
35000
40000
45000
Power usage for 2 Ramspeed tests running on sepa-
rate cores
Memory Control Energy2 Memory Interference Test Energy
Interference between simultaneous applications:
Future TestsThe foundation scripts have been written so in the future it will be very easy to add support for testing:
– Interference of 1 type of application pinned to N cores for a processor with a substantial number of cores (ie >2)
– Interference from 2 CPU intensive or 2 Memory intensive test applications
– Measure memory interference with M applications mapped to N cores (Obviously N > 2)– Testing a larger sample size might produce more interesting results
– Find which application to core mappings can provide the best performance for specific architectures/cache sizes
Presentation Outline• Project Goals
• Tools for Benchmarking:
• Performance counters, PAPI,
• HPC Toolkit, Phoronix Test Suits,
• Power Measurement
• How testing was accomplished
• List of additional data points for application to processor affinity
• A simple continuation of Ryan’s test work
• Results from Interference Testing for applications pinned to specific cores
Thank You For Your Attention