What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming.

What hardware accelerators are you using/evaluating?

Cells in a Roadrunner configuration◦ 8-way SPE threads w/ local memory, DMA &

vector unit programming issues but tremendous flexibility

◦ Fast (25.6 GB/s) & large memory (4GB or larger)◦ Augmented C language; also C++ & now Fortran;

GNU & XL variants; OpenMP is new; OpenCL is being prototyped

◦ Opterons can run bulk of code not needing acceleration; Cell-only clusters possible

What hardware accelerators are you using/evaluating? Several years ago…◦ GPUs (pre CUDA & Tesla)

Brook & Scout (LANL data-parallel language) No 32bit at the time; limited memory; everything is a data-parallel

problem No ECC memory ; insufficient parity/ECC protection of data paths and

logic Others at LANL still working in this area including Tesla & CUDA)

◦ Clearspeed (several years ago) Earliest Clearspeeds before the Advance families Augmented C language; 96 SIMD PEs Everything is done as long SIMD data parallel and in synch Low power

◦ FPGAs (HDL, several years ago) Programming is hard -- very hard Logic space limited the number of 64bit ops Fast SRAM but small; external DRAM modest size but no faster than

CPUs One algorithm at a time, so significant impact to use for multi-physics Low power

Describe the applications that you are porting to accelerators?◦ MD (materials), laser-plasma PIC, IMC X-ray (particle) transport,

GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD

◦ Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities

My personal observations:◦ Particle methods are generally easiest◦ Codes with good characteristics:

A few computationally intense “algorithms” pre-existing or obvious “fine-grain” parallel work units C language versus Fortran or highly OO C++

Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)?◦ 5x to 10X over single-Opteron-core for code with high memory BW

intensive and 5%-10% peak◦ 10x to 25x on particle methods, searches, etc.

How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements?◦ Scale out via more sockets is better – BUT!

Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL-sized machines means $$$ for HW, space, & power

Scaling out by multi-core is not a clear winner◦ Memory BW and cache architectures often limit performance which

Cells mostly get around◦ Memory BW per core is decreasing at “inverse Moore’s law”

rate!

Describe the programming effort required to make use of the accelerator.◦ ½ to 1 man-year to “convert” a code, mostly dealing with data

structures and threaded parallelism designs.◦ Lack of debugging & similar tools are like the earliest days of parallel

computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs)

◦ We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level)

Amortization◦ Ready for the future – codes and skilled programmers. We expect our

dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well.

◦ It’s not just about running codes this year. Others will have to work through new forms of parallelism soon.

◦ We can do science now that isn’t possible with most other machines

Compare accelerator cost to scaling out cost◦ Commodity-processor-only machines would have cost 2X what

Roadrunner did in 2006-2007 (~$80M more)◦ Used 2X or more power (~$1M per MW)◦ Significantly larger nodes counts cause scaling & reliability issues◦ Accelerators or heterogeneous chips should be Greener

Ease of use issues◦ Newer Cell programming techniques (ALF, OpenMP) could make

this easier.◦ A Cell cluster would be easier, but the PPE is really, really slow for

non- SPU accelerated code segments.◦ Not for the faint of heart, but Top20 machines never are

What is the future direction of hardware based accelerators?◦ Domain specific libraries can make them far more useful in those specific areas◦ Some may appear on Intel QPI or AMD HT.◦ Specialized cores will show up within commodity microprocessors – ignore them or use them◦ GPU-based systems will have to adopt ECC & partity protection◦ Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-

processor)

Software futures?◦ OpenCL looks promising but doesn’t address programming the specialized accelerator

devices themselves◦ The uber-auto-wizard-compiler will never come◦ Heterogeneous compilers may come.◦ Debuggers & tools may come

What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators?◦ Create next generation versions and sell as mainstream products

Compile & run on PowerPC PPE Identify & isolate algorithm & data to run parallel on 8

“remote” SPEs Compile scalar version of algorithm on SPE

◦ Add SPE thread process control◦ Add DMAs

Use “blocking” DMAs at this stage just for functionality Worry about data alignments

◦ First on a single SPE, then on 8 SPEs Optimize SPE code

◦ SIMD, branchesmerges◦ Add asynch double/triple buffering of DMAs

For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”

Roadrunner is more than a petascale supercomputer for today’s use◦ provides a balanced platform to explore new algorithm

design, programming models, and to refresh developer skills

LANL has been an early adopter of transformational technology*:◦ 1970s: HPC is scalar

LANL adopts vector (Cray 1 w/ no OS)

◦ 1980s: HPC is vectorLANL adopts data parallel (big CM-2)

◦ 2000s: HPC is multi-core clustersLANL adopts hybrid (Roadrunner)

Slide 9

*Credit to Scott Pakin, CCS-1, for this list idea

Opteron Cell PPC Cell SPE (x8 parallel)

Host data pushed/pulled to Cell

Cell spawns parallelthreads on SPEs

Parallel threads completed

Node may need to push/pull more data to/from Cell& to/from cluster

or could be available forconcurrent workduring this time

Host launches Cell code

Cell code completed

(1)

(2)

(3)

(6)

(5b) (5a)

MPI

MPI

MPI

Updated data pushed/pulled to Host

Non-accelerated code

Non-accelerated code

Each SPE computes withinits local memory buffers

Each SPE DMA multi-buffersdata back to Cell memory

Each SPE DMA multi-buffersCell data into local memory

(4)untildone

Sim

ult

an

eou

sly

Node(Opteron)

Serial PPCProcessor

NodeMemory

CellMemory Parallel SPE

ProcessorsLocal Memories

(1)(2)(6)

(3)

(4)

8-way parallel

MPI(5B)

PCIelink

(5a)

How much can be automatedHow much can be automatedin compilers or languages?in compilers or languages?

DaCS

DMA

DMA

DaCS

What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming.

Documents

dataparallel problem

time limited memory

oo c slide

long simd data parallel

lanl applications

gbs large memory

larger augmented c language

data structures