Parallelizing Spacetime Discontinuous Galerkin Methods

Parallelizing Spacetime Discontinuous Galerkin Methods

Jonathan BoothUniversity of Illinois at Urbana/Champaign

In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan

This research made possible via NSF grant DMR 01-21695

http://charm.cs.uiuc.edu

Parallel Programming Lab

• Led by Professor Laxmikant Kale• Application-oriented

– Research is driven by real applications and the needs of real applications

• NAMD• CSAR Rocket Simulation (Roc*)• Spacetime Discontinuous Galerkin• Petaflops Performance Prediction (Blue Gene)

– Focus on scaleable performance for real applications


Charm++ Overview

• In development for roughly ten years

• Based on C++

• Runs on many platforms– Desktops– Clusters– Supercomputers

• Overlays a C layer called Converse– Allows multiple languages to work together


Charm++: Programmer View

• System of objects• Asynchronous

communication via method invocation

• Use an object identifier to refer to an object.

• User sees each object execute its methods atomically– As if on its own processor

Processor

Object/Task


Charm++: System View

• Set of objects invoked by messages

• Set of processors of the physical machine

• Keeps track of object to processor mapping

• Routes messages between objects

Processor

Object/Taskhttp://charm.cs.uiuc.edu

Charm++ Benefits

• Program is not tied to a fixed number of processors– No problem if program needs 128 processors

and only 45 available– Called processor virtualization

• Load balancing accomplished automatically– User writes a short routine to transfer object

between processors


Load Balancing - Green Process Starts Heavy Computation


A

B

C

Yellow Processes Migrate Away – System Handles Message Routing


A

B

C

A

B

C

Load Balancing

• Load balancing isn’t solely dependant on CPU usage

• Balancers consider network usage as well– Can move objects to lessen network

bandwidth usage

• Migrating an object to disk instead of another processor gives checkpoint/restart, out-of-core execution


Parallel Spacetime Discontinuous Galerkin

• Mesh generation is an advancing front algorithm– Adds an independent set of elements called patches t

o the mesh

• Spacetime methods are setup in such a way they are easy to parallelize– Each patch depends only on inflow elements

• Cone constraint insures no other dependencies

– Amount of data per patch is small• Inexpensive to send a patch and its inflow elements to anoth

er processor


Mesh Generation

Unsolved Patches

Mesh Generation

Solved Patches

Unsolved Patches

Mesh Generation

Solved Patches

Unsolved Patches

Refinement

Parallelization Method (1D)

• Master-Slave method– Centralized mesh generation– Distributed physics solver code– Simplistic implementation

• But fast to get running• Provides object migration sanity check

• No “time-step”– as soon as a patch returns the master

generates any new patches it can and sends them off to be solved


Results - Patches / Second

0

50

100

150

200

250

0 10 20 30 40

Processors

Patches/Second


Scaling Problems

• Speedup is ideal at 4 slave processors• After 4 slaves, diminishing speedup occurs• Possible sources:

– Network bandwidth overload– Charm++ system overhead (grainsize control)– Mesh generator overload

• Problem doesn’t scale-down– More processors don’t slow the computation d

own


Network Bandwidth

• Size of a patch to send both ways is 2048 bytes (very conservative estimate)

• Can compute 36 patches/(second*CPU)

• Each CPU needs 72kbytes/second

• 100Mbit Ethernet provides 10Mbyte/sec

• Network can support ~130 CPUs– Must not be a lack of network bandwidth


Charm++ System Overhead (Grainsize Control)

• Grainsize is a measure of the smallest unit of work• Too small and overhead dominates

– Network latency overhead– Object creation overhead

• Each patch takes 1.7ms to setup the connection to send (both ways)

• Can send ~550 patches/sec to remote processors– Again, higher than observed patch/second rate

• Grainsize can be reduced by sending multiple patches at once– Speeds up the computation but speedup still flattens out after 8

processors


Mesh Generation

• With 0 slave processors, 31ms/patch• With 1 slave processor, 27ms/patch• Geometry code takes 4ms to generate a patch

– Mesh generator needs a bit more time due to Charm++ message sending overhead

• Leads to less than 250 patches/second• Can’t trivially speed this up

– Would have to parallelize mesh generation– Parallel mesh generation also would lighten network

load if the mesh were fully distributed to slave nodes


Testing the Mesh Generator Bottleneck

• Does speeding up the mesh generator give better results?

• Leaves the question how to speed up the mesh generator– The cluster used is a P3 Xeon 500Mhz– So run the mesh generator on something

faster (a P4 2.8Ghz)– Everything still on 100Mbit network

Fast Mesh Generator Results

0100200300400500600700800900

0 5 10 15 20 25 30 35

Processors

Patches/Sec

Future Directions

• Parallelize geometry/mesh generation– Easy to do in theory– More complex in practice with refinement,

coarsening– Lessens network bandwidth consumption

• Only have to send border elements of all meshes• Compared to all elements sent right now

– Better cache performance


More Future Directions

• Send only necessary data– Currently send everything, needed or not

• Use migration to balance load rather than slaves– Means we’ll also get checkpoint/restart and out-of-

core execution for free– Also means we can load balance away some of the

network communication

• Integrate 2D mesh generation/physics code– Nothing in the parallel code knows the dimensionality


Parallelizing Spacetime Discontinuous Galerkin Methods

Documents

object identifier

parallelizeeach patch

educharm benefitsprogram

real applicationshttp

new patches

system viewset of objects

scaleable performance

method invocationuse