M&t presentation

BENCHMARK

INSTRUMENTATION

Umit Cavus BUYUKSAHIN Measurements Tools & Techinics, Spring ‘12

4/17/2012

OUTLINE

• NAS Benchmark Suite

• Experiments

• Paraver Visualization

• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss

• Cycles per Instruction (CPI)

• Execution Time

• Benchmarking Time

• Conclusion

Benchmark Instrumentation 2

NAS Benchmark Suite

• NAS ... is a set of benchmarks.

... evaluates performance of highly parallel supercomputers.

... developed and maintained by NASA Advanced Supercomputing(NAS).


NAS Benchmark Suite

• NAS Kernel Applications

• IS - Integer Sort

• EP - Embarrassingly Parallel

• CG - Conjugate Gradient

• MG - Multi-Grid

• FT - discrete 3D fast Fourier Transform

• Problem Sizes

• S : small size

• W : workstation size

• A, B, C : standart test size; ~4X size in increasing order

• D, E, F : large test size; ~16X size in increasing order


OUTLINE


• Experiments


• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss


• Execution Time


• Conclusion


Experiments

• NAS Parallel Benchmark version 3.2.1

• IS Kernel Application: ... sorts N keys in parallel.

... tests

• integer computation speed

• communication perfomance

• S Problem Size: ... small for quick test purposes

... has 216 keys


Experiments

• IS Benchmarking Procedure (generally)

1. Generating sequence of N keys

2. Loading N keys into the memory systems

3. Time begins

4. Loop

Sorting & partial verification

5. Time ends

6. Full verification.


Experiments

Machines:

• My Computer

i686 GNU/Linux

3Gb Ram

2 CPUSs with 800Mhz

• Boada

x86_64 x86_64 x86_64 GNU/Linux

24Gb Ram

24 CPUS with 1596Mhz


Experiments

Procedure:

• Not manually instrumented.

• Paraver traces are automatically generated

• LD_PRELOAD is exported.

• Benchmarks are executed with 2,4,8,16,32, and 64 processors.

• Benchmark results are analyzed

• Generated traces are examined in paraver tools.


OUTLINE


• Experiments


• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss


• Execution Time


• Conclusion


Paraver Visualization – Code View

• My Computer

• Boada


Paraver Visualization – Communication

• My Computer

• Boada


Paraver Visualization – Disk I/O

• My Computer

• Boada


Paraver Visualization – Load Balance

• My Computer

....


Paraver Visualization – Load Balance

• Boada

....


Paraver Visualization – LD1 Cache Miss

• My Computer


Paraver Visualization – LD1 Cache Miss

• Boada


Paraver Visualization – CPI

• My Computer


Paraver Visualization – CPI

• Boada


OUTLINE


• Experiments


• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss


• Execution Time


• Conclusion


Execution Time


0

2000

4000

6000

8000

10000

12000

14000

16000

2 4 8 16 32 64

MyComputer

Boada

# of processors

Tim

e (

ms)

Execution Time

• Relative Speedup =𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎


0

10

20

30

40

50

60

1 2 4 8 16 32 64

# of processors

Sp

ee

dU

p

OUTLINE


• Experiments


• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss


• Execution Time


• Conclusion


Benchmarking Time - reminder

• IS Benchmarking Procedure (generally)

1. Generating sequence of N keys

2. Loading N keys into the memory systems

3. Time begins

4. Loop

Sorting & partial verification

5. Time ends

6. Full verification.

• Benchmarking Time = execution time of the parallel

algorithm


Benchmarking Time


0,000

0,200

0,400

0,600

0,800

1,000

1,200

1,400

1,600

1,800

2,000

1 2 4 8 16 32 64

MyComputer

Boada

# of processors

Tim

e (

se

c)

Benchmarking Time

• Relative Speedup =𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝑀𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟

𝐵𝑒𝑛𝑐ℎ𝑚𝑎𝑟𝑘𝑖𝑛𝑔𝑇𝑖𝑚𝑒 𝑜𝑓 𝐵𝑜𝑎𝑑𝑎


0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

1 2 4 8 16 32 64# of processors

Sp

ee

dU

p

Benchmarking Time


• SpeedUp of My Computer

0

0,2

0,4

0,6

0,8

1

1,2

1 2 4 8 16 32 64# of processors

Sp

ee

dU

p

Benchmarking Time


• SpeedUp of Boada

0

1

2

3

4

5

6

7

1 2 4 8 16 32 64 # of processors

Sp

ee

dU

p

OUTLINE


• Experiments


• Code View

• Communication

• Disk I/O

• Load Balancing

• LD1 Cache Miss


• Execution Time


• Conclusion


Conclusion

• IS application • ... does not have so much communication.

• ... is based on computation and memory loading.

• ... has low cache miss and high CPI values in computation phase.

• NAS is designed for highly parallel supercomputers. • MyComputer is inadequate to meet requierments of NAS.

• MyComputer can not speed up in this application.

• Boada can speed up untill number of processors that it has.

• Mycomputer saves less time for disk I/O operations.

• CPI values in Boada’ s computation phase less.


BENCHMARK

INSTRUMENTATION

Umit Cavus BUYUKSAHIN Measurements & Tools, Spring ‘12

4/17/2012

M&t presentation

Technology

benchmark results

time sec

time begins4

time ends6

benchmarking time reminder

paraver visualization

paraver traces

paraver tools