Swarat Chaudhuri Penn State Roberto Lublinerman Pavol Cerny Penn State IST Austria Parallel Programming with Object Assemblies Parallel Programming with.

Post on 15-Dec-2015

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Swarat Chaudhuri Penn State

Roberto Lublinerman Pavol Cerny Penn State IST Austria

Parallel Programming with

Object Assemblies

Data parallelism:

- Highly coarse-grained (MapReduce)- Highly fine-grained (numeric computations on dense arrays)-Problem-specific methods

Taming parallelism

Task-parallelismMessage-passing

Taming parallelism

Our target:

Data-parallel computations over large, unstructured, shared-memory graphs

Unknown granularity

High-level correctness as well as efficiency.

Delaunay mesh refinement

• Triangulate a given set of points.• Delaunay property:

No point is contained within the circumcircle of a triangle.

• Quality property:No bad triangles—i.e., triangles with an angle > 120o.

• Mesh refinement:Fix bad triangles through an iterative algorithm.

Retriangulation

Cavity: all triangles whose circumcircle contains new point.

Quality constraint may not hold for all new triangles.

Sequential mesh refinement

Mesh m = /* read input mesh */Worklist wl = new Worklist(m.getBad());foreach triangle t in wl {Cavity c = new Cavity(t);c.expand();c.retriangulate();m.updateMesh(c);wl.add(c.getBad());

}

• Cavities are contiguous “regions” in the mesh.• Worst-case cavities can encompass the whole mesh.

Parallelization

• Computation over complex, unstructured graphsMesh = Heap-allocated graph. Nodes = triangles.

Edges = adjacency

• Atomicity: Cavities must be retriangulated atomically.• Non-overlapping cavities can be processed in parallel.• Seems impossible to handle with static analysis:– Shape of data structure changes greatly over time.– Shape of data structure is highly input-dependent. – Without deep algorithmic knowledge, impossible to say if

statically if cavities will overlap.

• Lots of recent work, notably by Pingali et al.

List of similar applications

• Delaunay mesh refinement, Delaunay triangulation• Agglomerative clustering, ray tracing• Social network maintenance• Minimum spanning tree, Maximum flow• N-body simulation, epidemiological simulation• Sparse matrix-vector multiplication, sparse Cholesky

factorization• Belief propagation, survey propagation in Bayesian

inference• Iterative dataflow analysis, Petri net simulation• Finite-difference PDE solution

Locality of updates in Chorus

Cavity

• On a mesh of ~100,000 triangles from Lonestar benchmarks: Average cavity size = 3.75 triangles.Maximum cavity size = 12 triangles

• Average-case locality the essence of parallelism. • Chorus: parallel computation driven by

“neighborhoods” in heaps.

Heaps, regions, assemblies

• Heap = directed graphNodes = objectsLabeled edges = pointers

• Region = induced subgraph• Assembly =

region + thread of control

Typically speculativeand shortlived.

• Assembly class = set of local variables + set of guarded updates + constructor + public

variables.• Program = set of classes • Synchronization happens in guard evaluation.

Programs, assembly classes

busyexecutingupdate

terminated

ready to be preemptedor execute next update

:: Guard: Update

• g is a condition on thelocal variables and owned objects of

Guards can merge assemblies

:: merge (u.f): S

:: merge (u.f) when g: S

u f

• gets a bigger region, keeps local state

• dies.• must be in ready state

while merge happens

• Split into assemblies of

class T.• Other assemblies not

affected.• Not a synchronization

construct.

Updates can split an assembly

split(T)

• Attempts to access objects outside region lead to exceptions.

Local updates

x = u.f;x.f = y;

u f

Delaunay mesh refinement

• Use two assembly classes: Triangle and Cavity. – Cavity = local region in mesh.

• Each triangle:– Determines if it is bad (local check).– If so, merges with neighbors to become cavity.

• Each cavity:– Determines if it is complete (local check).– If no, merges with a neighbor.– If yes, retriangulates (locally) and splits.

Delaunay mesh refinement: sketch

assembly Triangle:: ... action:: merge (v.f, Cavity) when isBad:

skip assembly Cavity:: ... action:: merge (v.f) when (not isComplete):

...

isComplete: retriangulate(); split(Triangle)

Delaunay mesh refinement: sketch

assem Triangle:: ... action:: merge (v.f, Cavity, u) when bad?:

skip assem Cavity:: ... action:: merge (v.f) when (not complete?):

skip

complete?: retriangulate(); split(Triangle)

What happens on a conflict?

• Cavity i “absorbed” by cavity j.• Cavity j now has some

“unnecessary” triangles. • j will later split.

Boruvka’s algorithm for minimum spanning tree

• Assembly = spanning tree • Initially, each assembly has

one node.• As algorithm progresses, trees

merge.

Race-freedom

• No aliasing, only ownership transfer.

• can merge with only when is not in the middle of an update.

Deadlock-freedom

• Classic definition: Process P waits for a resource from Q and vice versa.

• Deadlock in Chorus:– has a locally enabled merge with – has a locally enabled merge with – No other progress is possible.

• But one of the merges can always be carried out. (An assembly can always be killed at its ready state.)

u

JChorus

• Chorus + sequential Java.

• Assembly classes in addition to object classes.

7: assembly Cavity { 8: action { // expand cavity 9: merge(outgoingedges, TriangleObject t): 10: { outgoingedges.remove(t);11: frontier.add(t);12: build(); } 13: } 14: Set members; Set border;15: Queue frontier; // current frontier16: List outgoingedges; // outgoing edges on which to merge17: TriangleObject initial; ...

Division-based implementation• Division = set of assemblies

mapped to a core.• Local access:

Merge-actions within a divisionSplit-actionsLocal updates

• Remote access:Merge-actions issued across divisions

• Uses assembly-level locks.

Implementation strategies

• Adaptive divisions. Heuristic for reducing the number of remote merges.

• During a merge, not only the target assembly, but also assemblies reachable by k pointer indirections, are migrated.

• Adaptation heuristic does elementary load balancing.

• Union-find data structure to relate objects and assemblies that they belong to

• Needed for splits and merges.• Token-passing for deadlock prevention and

termination detection.

Experiments: Delaunay refinement from Lonestar benchmarks

• Large dataset from Lonestar benchmarks.– 100,364 triangles.– 47,768 initially bad.

• 1 to 8 threads.• Competing approaches:– Object-level locking– DSTM (Software transactions)

Locality: mesh snapshots

The initial mesh and divisions Mesh after several thousand retriangulations

Delaunay: Speedup over sequential

Delaunay: Self-relative speedup

Delaunay: Conflicts

Related models

• Threads + explicit locking: Global heap abstraction, arbitrary aliasing.

• Software transactions: Burden of reasoning passed to transaction manager. In most implementations, heap is viewed as global.

• Static data partitioning: Unpredictable nature of the computation makes static analysis hard.

• Actors: Based on low-level messaging. If sending references, potential of races. If copying triangles, inefficient.

• Pingali et al’s Galois: Same problem, but ours is an alternative.

More information

Parallel programming with object assemblies.Roberto Lublinerman, Swarat Chaudhuri, Pavol Cerny.OOPSLA 2009.

http://www.cse.psu.edu/~swarat/chorus

top related