BENCHMARK INSTRUMENTATION Umit Cavus BUYUKSAHIN Measurements Tools & Techinics, Spring ‘12 4/17/2012
May 24, 2015
BENCHMARK
INSTRUMENTATION
Umit Cavus BUYUKSAHIN Measurements Tools & Techinics, Spring ‘12
4/17/2012
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 2
NAS Benchmark Suite
• NAS ... is a set of benchmarks.
... evaluates performance of highly parallel supercomputers.
... developed and maintained by NASA Advanced Supercomputing(NAS).
Benchmark Instrumentation 3
NAS Benchmark Suite
• NAS Kernel Applications
• IS - Integer Sort
• EP - Embarrassingly Parallel
• CG - Conjugate Gradient
• MG - Multi-Grid
• FT - discrete 3D fast Fourier Transform
• Problem Sizes
• S : small size
• W : workstation size
• A, B, C : standart test size; ~4X size in increasing order
• D, E, F : large test size; ~16X size in increasing order
Benchmark Instrumentation 4
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 5
Experiments
• NAS Parallel Benchmark version 3.2.1
• IS Kernel Application: ... sorts N keys in parallel.
... tests
• integer computation speed
• communication perfomance
• S Problem Size: ... small for quick test purposes
... has 216 keys
Benchmark Instrumentation 6
Experiments
• IS Benchmarking Procedure (generally)
1. Generating sequence of N keys
2. Loading N keys into the memory systems
3. Time begins
4. Loop
Sorting & partial verification
5. Time ends
6. Full verification.
Benchmark Instrumentation 7
Experiments
Machines:
• My Computer
i686 GNU/Linux
3Gb Ram
2 CPUSs with 800Mhz
• Boada
x86_64 x86_64 x86_64 GNU/Linux
24Gb Ram
24 CPUS with 1596Mhz
Benchmark Instrumentation 8
Experiments
Procedure:
• Not manually instrumented.
• Paraver traces are automatically generated
• LD_PRELOAD is exported.
• Benchmarks are executed with 2,4,8,16,32, and 64 processors.
• Benchmark results are analyzed
• Generated traces are examined in paraver tools.
Benchmark Instrumentation 9
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 10
Paraver Visualization – Code View
• My Computer
• Boada
Benchmark Instrumentation 11
Paraver Visualization – Communication
• My Computer
• Boada
Benchmark Instrumentation 12
Paraver Visualization – Disk I/O
• My Computer
• Boada
Benchmark Instrumentation 13
Paraver Visualization – Load Balance
• My Computer
....
Benchmark Instrumentation 14
Paraver Visualization – Load Balance
• Boada
....
Benchmark Instrumentation 15
Paraver Visualization – LD1 Cache Miss
• My Computer
Benchmark Instrumentation 16
Paraver Visualization – LD1 Cache Miss
• Boada
Benchmark Instrumentation 17
Paraver Visualization – CPI
• My Computer
Benchmark Instrumentation 18
Paraver Visualization – CPI
• Boada
Benchmark Instrumentation 19
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 20
Execution Time
Benchmark Instrumentation 21
0
2000
4000
6000
8000
10000
12000
14000
16000
2 4 8 16 32 64
MyComputer
Boada
# of processors
Tim
e (
ms)
Execution Time
• Relative Speedup =𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎
Benchmark Instrumentation 22
0
10
20
30
40
50
60
1 2 4 8 16 32 64
# of processors
Sp
ee
dU
p
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 23
Benchmarking Time - reminder
• IS Benchmarking Procedure (generally)
1. Generating sequence of N keys
2. Loading N keys into the memory systems
3. Time begins
4. Loop
Sorting & partial verification
5. Time ends
6. Full verification.
• Benchmarking Time = execution time of the parallel
algorithm
Benchmark Instrumentation 24
Benchmarking Time
Benchmark Instrumentation 25
0,000
0,200
0,400
0,600
0,800
1,000
1,200
1,400
1,600
1,800
2,000
1 2 4 8 16 32 64
MyComputer
Boada
# of processors
Tim
e (
se
c)
Benchmarking Time
• Relative Speedup =𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟
𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎
Benchmark Instrumentation 26
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
1 2 4 8 16 32 64# of processors
Sp
ee
dU
p
Benchmarking Time
Benchmark Instrumentation 27
• SpeedUp of My Computer
0
0,2
0,4
0,6
0,8
1
1,2
1 2 4 8 16 32 64# of processors
Sp
ee
dU
p
Benchmarking Time
Benchmark Instrumentation 28
• SpeedUp of Boada
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 # of processors
Sp
ee
dU
p
OUTLINE
• NAS Benchmark Suite
• Experiments
• Paraver Visualization
• Code View
• Communication
• Disk I/O
• Load Balancing
• LD1 Cache Miss
• Cycles per Instruction (CPI)
• Execution Time
• Benchmarking Time
• Conclusion
Benchmark Instrumentation 29
Conclusion
• IS application • ... does not have so much communication.
• ... is based on computation and memory loading.
• ... has low cache miss and high CPI values in computation phase.
• NAS is designed for highly parallel supercomputers. • MyComputer is inadequate to meet requierments of NAS.
• MyComputer can not speed up in this application.
• Boada can speed up untill number of processors that it has.
• Mycomputer saves less time for disk I/O operations.
• CPI values in Boada’ s computation phase less.
Benchmark Instrumentation 30
BENCHMARK
INSTRUMENTATION
Umit Cavus BUYUKSAHIN Measurements & Tools, Spring ‘12
4/17/2012