“Understanding Bulldozer architecture through Linpack benchmark” By [email protected]Abstract: AMD has recently introduced a new core architecture named Bulldozer in the newly released multicore processors such as Interlagos processor. We will cover experimentally through Linpack benchmark some of the key features of this new processor such as the shared floating point unit on the Bulldozer core, the fuse multiply add instructions (FMA4) and power management that allows cores to boost. Emphasis on the appropriate software ecosystem such as optimized libraries (ACML), compiler flags (open64) and operating system will be discussed as well so you can fully exploit the new generation of AMD processors. HPC Advisory Council, ISC 2012, Hamburg
22
Embed
through Linpack benchmark” - HPC Advisory · PDF file“Understanding Bulldozer architecture through Linpack benchmark ... Round robin scheduling 6clks ... “Understanding...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
“Understanding Bulldozer architecture through Linpack benchmark”
Abstract: AMD has recently introduced a new core architecture named Bulldozer in the newly released multicore processors such as Interlagos processor. We will cover experimentally through Linpack benchmark some of the key features of this new processor such as the shared floating point unit on the Bulldozer core, the fuse multiply add instructions (FMA4) and power management that allows cores to boost. Emphasis on the appropriate software ecosystem such as optimized libraries (ACML), compiler flags (open64) and operating system will be discussed as well so you can fully exploit the new generation of AMD processors.
Example of Interlagos processor, with 16 cores, 2.3GHz in G34 socket. 4+4 Bulldozer modules on 2 numanodes connected through coherent HyperTransport. Each numanode has 2 memory channels. Delivers 18.5 GB/s x 2, 60 DP GF/s x2 under 130W (115W TDP).
Individual and shared resources
• HPC workloads are using all the cores for the same nature of computation, mostly synchronized.
• High workload flexibility such as in Cloud under power budget.
• Example: Cloud workloads can use 1 core for integer work and the other the whole FPU for number crunching
HPC Advisory Council, ISC 2012, Hamburg
Cache hierarchy, how data flows
HPC Advisory Council, ISC 2012, Hamburg
Focusing into the FPU of BD
Tens of “operations in flight” through the FP scheduler SSE2 and FMA4 instructions are executed in pipes 0 and 1 Ex. SSE2: ADDPD, MULPD Ex. FMA4: VFMADDPD
HPC Advisory Council, ISC 2012, Hamburg
FMA instruction latencies on FMAC 0/1 pipes
From SWOG Family 15h
HPC Advisory Council, ISC 2012, Hamburg
SSE2 instruction execution on 2 x 128bit FMAC units (4 DP F/clk/BD)
At each clock you can have many adds or multiplies in flight (6 clocks latency per operation) per pipeline (just pictured 2 + 1clock overhead). It takes 13 clocks to do 1 multiply + 1 add (eg. axb + cxd ) with SSE2. It can only crunch 2 DP FLOP per clock per pipeline: (4 DP F/clk/BD module)
FMA4 instruction execution on 2 x 128bit FMAC units (8 DP F/clk/BD)
At each clock you can have many fused multiply-adds in flight (6 clocks latency per operation) per pipeline (just pictured 2 + 0 clock overhead). It takes 6 clocks to do 2 fused multiply-adds (eg. d=cxb + a ) with FMA4. It can crunch 4 DP FLOP per clock per pipeline: (8 DP F/clk/BD module)
Example: [root@bdnode]# ./runtt.sh OK TT second processor is thermal throttling. Processor operating at lower frequency impacts on HPL score. If TT: Fix the cooling for that processor and rerun HPL in order to achieve good scores.
Thanks, Q&A • USEFUL FREE resources at AMD website developer.amd.com:
– Tools > Open64 compiler
– Tools > CodeAnalyst
– Libraries > ACML (AMD Core Math Library), AMDlibM