Introduction to Research 2011

Introduction to Research 2011Introduction to Research 2011

Ashok Srinivasan

Florida State University

www.cs.fsu.edu/~asriniva

Ashok Srinivasan

Florida State University

www.cs.fsu.edu/~asriniva

Images from ORNL, IBM, NVIDIAImages from ORNL, IBM, NVIDIA

Part of the machine room at ORNLPart of the machine room at ORNL

The Cell processor powers the Roadrunner at LANL

The Cell processor powers the Roadrunner at LANL

NVIDIA GPUs power Tianhe-1A in China

NVIDIA GPUs power Tianhe-1A in China

OutlineOutline

Research High Performance Computing Applications and Software

Multicore processors Massively parallel processors Computational nanotechnology Simulation-based policy making

Potential Research Topics

Research AreasResearch Areas

High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software

Current topics: Computational Nanotechnology, HPC on Multicore Processors, Massively Parallel Applications

New Topics: Simulation-based policy analysis

Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression

Importance of SupercomputingImportance of Supercomputing

Fundamental scientific understanding Nano-materials, drug design

Solution of bigger problems Climate modeling

More accurate solutions Automobile crash tests

Solutions with time constraints Disaster mitigation

Study of complex interactions for policy decisions Urban planning

Some ApplicationsSome Applications

Increasing relevance to industry In 1993, fewer than 30% of top 500 supercomputers were

commercial, now, 57% are commercial A variety of application areas

Commercial Finance and insurance Medicine Aerospace and Automobiles Telecom Oil exploration Shoes! (Nike) Potato chips! Toys!

Scientific Weather prediction Earthquake modeling Epidemic modeling Materials Energy Computational biology Astro-physics

Supercomputing PowerSupercomputing Power

The amount of parallelism too is increasing, with the high end having over 200,000 cores

The amount of parallelism too is increasing, with the high end having over 200,000 cores

Geographic DistributionGeographic Distribution

North America has over half the top 500 systems However, Europe and East Asia too have a significant

share China is determined to be a supercomputing

superpower Two of its national supercomputing centers have top-five

supercomputers

Japan has the top machine and two in the top five Planning a $ 1.3 billion exascale supercomputer in 2020

Asian Supercomputing TrendsAsian Supercomputing Trends

Challenges in SupercomputingChallenges in Supercomputing

Hardware can be obtained with enough money But obtaining good performance on large systems is difficult

Some DOE applications ran at 1% efficiency on 10,000 cores They will have to deal with a million threads soon, and with a

billion at the exa-scale Don’t think of supercomputing as a means of solving current

problems faster, but as a means of solving problems we earlier thought we could not solve

Development of software tools to make use of the machines easier

Architectural TrendsArchitectural Trends

Massive parallelism 10K processor systems will be commonplace Large end already has over 500K processors

Single chip multiprocessing All processors will be multicore Heterogeneous multicore processors

Cell used in the PS3 GPGPU 80-core processor from Intel Processors with hundreds of cores are already commercially

available

Distributed environments, such as the Grid But it is hard to get good performance on these

systems

Accelerating Applications with GPUsAccelerating Applications with GPUs

Over a hundred cores per GPU Hide memory latency with thousands of threads Can accelerate a traditional computer to a teraflop GPU cluster at FSU

Quantum Monte Carlo applications Algorithms

Linear algebra, FFT, compression, etc

Small Discrete Fourier Transforms Small Discrete Fourier Transforms (DFT) on GPUs(DFT) on GPUs

GPUs are effective for large DFTs, but not small DFTs However, they can be effective for a large number of small DFTs

Useful for AFQMC

We use the asymptotically slow matrix-multiplication based DFT for very small sizes

We combine it with mixed-radix for larger sizes

We use asynchronous memory transfer to deal with host-device data transfer overhead

Comparison of DFT PerformanceComparison of DFT Performance

Comparison of 512 simultaneous DFTs without host-device data transfer

2-D DFTs

3-D DFTs

Petascale Quantum Monte CarloPetascale Quantum Monte Carlo

Originally a DOE funded project involving collaboration between ORNL, UIUC, Cornell, UTK, CWM, and NCSU

Now funded by ORAU/ORNL

Scale Quantum Monte Carlo applications to petascale (one million gigaflops) machines Load balancing, fault tolerance, other optimizations

Load BalancingLoad Balancing

In current implementations, such as QWalk and QMCPack, cores send excess walkers to cores with fewer walkers

In the new algorithm (alias method), cores may send more than their excess, and receive walkers even if they originally had an excess Load can be balanced with each core receiving from at most

one other core

Also optimal in maximum number of walkers received

Total number of walkers sent may be twice the optimal

Performance ComparisonPerformance Comparison

Mean number of walkers migrated

Maximum number of receives

Comparisons with QWalk

Process-Node AffinityProcess-Node Affinity

Node allocation is not necessarily ideal for minimizing communication

Process-node affinity can, therefore, be important

Allocated nodes for a 12,000 core run on Jaguar

Load Balancing with AffinityLoad Balancing with Affinity

Renumbering the nodes improves load balancing and AllGather time

Basic load balancing Load balancing after renumbering

Results on Jaguar

Potential Research TopicsPotential Research Topics

High Performance Computing on Multicore Processors Algorithms, Applications, and libraries on GPUs

Applications on Massively Parallel Processors Quantum Monte Carlo applications Load balancing and communication optimizations

Simulation-based policy decisions Combine scientific computing with models of social interactions

to help make policy decisions

Introduction to Research 2011

Documents

good performance

computational finance

computational sciences

supercomputing superpowertwo

computational fluid

softwaremulticore processors

difficultsome doe applications

high end