Page 1
Parallelizing Spacetime Discontinuous Galerkin Methods
Jonathan BoothUniversity of Illinois at Urbana/Champaign
In conjunction with: L. Kale, R. Haber, S. Thite, J. Palaniappan
This research made possible via NSF grant DMR 01-21695
http://charm.cs.uiuc.edu
Page 2
Parallel Programming Lab
• Led by Professor Laxmikant Kale• Application-oriented
– Research is driven by real applications and the needs of real applications
• NAMD• CSAR Rocket Simulation (Roc*)• Spacetime Discontinuous Galerkin• Petaflops Performance Prediction (Blue Gene)
– Focus on scaleable performance for real applications
http://charm.cs.uiuc.edu
Page 3
Charm++ Overview
• In development for roughly ten years
• Based on C++
• Runs on many platforms– Desktops– Clusters– Supercomputers
• Overlays a C layer called Converse– Allows multiple languages to work together
http://charm.cs.uiuc.edu
Page 4
Charm++: Programmer View
• System of objects• Asynchronous
communication via method invocation
• Use an object identifier to refer to an object.
• User sees each object execute its methods atomically– As if on its own processor
Processor
Object/Task
http://charm.cs.uiuc.edu
Page 5
Charm++: System View
• Set of objects invoked by messages
• Set of processors of the physical machine
• Keeps track of object to processor mapping
• Routes messages between objects
Processor
Object/Taskhttp://charm.cs.uiuc.edu
Page 6
Charm++ Benefits
• Program is not tied to a fixed number of processors– No problem if program needs 128 processors
and only 45 available– Called processor virtualization
• Load balancing accomplished automatically– User writes a short routine to transfer object
between processors
http://charm.cs.uiuc.edu
Page 7
Load Balancing - Green Process Starts Heavy Computation
http://charm.cs.uiuc.edu
A
B
C
Page 8
Yellow Processes Migrate Away – System Handles Message Routing
http://charm.cs.uiuc.edu
A
B
C
A
B
C
Page 9
Load Balancing
• Load balancing isn’t solely dependant on CPU usage
• Balancers consider network usage as well– Can move objects to lessen network
bandwidth usage
• Migrating an object to disk instead of another processor gives checkpoint/restart, out-of-core execution
http://charm.cs.uiuc.edu
Page 10
Parallel Spacetime Discontinuous Galerkin
• Mesh generation is an advancing front algorithm– Adds an independent set of elements called patches t
o the mesh
• Spacetime methods are setup in such a way they are easy to parallelize– Each patch depends only on inflow elements
• Cone constraint insures no other dependencies
– Amount of data per patch is small• Inexpensive to send a patch and its inflow elements to anoth
er processor
http://charm.cs.uiuc.edu
Page 11
Mesh Generation
Unsolved Patches
Page 12
Mesh Generation
Solved Patches
Unsolved Patches
Page 13
Mesh Generation
Solved Patches
Unsolved Patches
Refinement
Page 14
Parallelization Method (1D)
• Master-Slave method– Centralized mesh generation– Distributed physics solver code– Simplistic implementation
• But fast to get running• Provides object migration sanity check
• No “time-step”– as soon as a patch returns the master
generates any new patches it can and sends them off to be solved
http://charm.cs.uiuc.edu
Page 15
Results - Patches / Second
0
50
100
150
200
250
0 10 20 30 40
Processors
Patches/Second
http://charm.cs.uiuc.edu
Page 16
Scaling Problems
• Speedup is ideal at 4 slave processors• After 4 slaves, diminishing speedup occurs• Possible sources:
– Network bandwidth overload– Charm++ system overhead (grainsize control)– Mesh generator overload
• Problem doesn’t scale-down– More processors don’t slow the computation d
own
http://charm.cs.uiuc.edu
Page 17
Network Bandwidth
• Size of a patch to send both ways is 2048 bytes (very conservative estimate)
• Can compute 36 patches/(second*CPU)
• Each CPU needs 72kbytes/second
• 100Mbit Ethernet provides 10Mbyte/sec
• Network can support ~130 CPUs– Must not be a lack of network bandwidth
http://charm.cs.uiuc.edu
Page 18
Charm++ System Overhead (Grainsize Control)
• Grainsize is a measure of the smallest unit of work• Too small and overhead dominates
– Network latency overhead– Object creation overhead
• Each patch takes 1.7ms to setup the connection to send (both ways)
• Can send ~550 patches/sec to remote processors– Again, higher than observed patch/second rate
• Grainsize can be reduced by sending multiple patches at once– Speeds up the computation but speedup still flattens out after 8
processors
http://charm.cs.uiuc.edu
Page 19
Mesh Generation
• With 0 slave processors, 31ms/patch• With 1 slave processor, 27ms/patch• Geometry code takes 4ms to generate a patch
– Mesh generator needs a bit more time due to Charm++ message sending overhead
• Leads to less than 250 patches/second• Can’t trivially speed this up
– Would have to parallelize mesh generation– Parallel mesh generation also would lighten network
load if the mesh were fully distributed to slave nodes
http://charm.cs.uiuc.edu
Page 20
Testing the Mesh Generator Bottleneck
• Does speeding up the mesh generator give better results?
• Leaves the question how to speed up the mesh generator– The cluster used is a P3 Xeon 500Mhz– So run the mesh generator on something
faster (a P4 2.8Ghz)– Everything still on 100Mbit network
Page 21
Fast Mesh Generator Results
0100200300400500600700800900
0 5 10 15 20 25 30 35
Processors
Patches/Sec
Page 22
Future Directions
• Parallelize geometry/mesh generation– Easy to do in theory– More complex in practice with refinement,
coarsening– Lessens network bandwidth consumption
• Only have to send border elements of all meshes• Compared to all elements sent right now
– Better cache performance
http://charm.cs.uiuc.edu
Page 23
More Future Directions
• Send only necessary data– Currently send everything, needed or not
• Use migration to balance load rather than slaves– Means we’ll also get checkpoint/restart and out-of-
core execution for free– Also means we can load balance away some of the
network communication
• Integrate 2D mesh generation/physics code– Nothing in the parallel code knows the dimensionality
http://charm.cs.uiuc.edu