X10 and APGAS at Petascalex10.sourceforge.net/documentation/papers/Petascale... · X10 and APGAS at Petascale Olivier Tardieu1, Benjamin Herta1, David Cunningham2, David Grove1, Prabhanjan

X10 and APGAS at Petascale

Olivier Tardieu1, Benjamin Herta1, David Cunningham2, David Grove1, Prabhanjan Kambadur1, Vijay Saraswat1,

Avraham Shinnar1, Mikio Takeuchi3, Mandana Vaziri1

1IBM T.J. Watson Research Center 2Google Inc.

3IBM Research – Tokyo

This material is based upon work supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002.

Background

§  X10 tackles the challenge of programming at scale §  HPC, cluster, cloud §  scale out: run across many distributed nodes è this talk & PPAA talk §  scale up: exploit multi-core and accelerators è CGO tutorial §  resilience and elasticity è next talk

§  X10 is §  a programming language

§  imperative object-oriented strongly-typed garbage-collected (like Java) §  concurrent and distributed: Asynchronous Partitioned Global Address Space model

§  an open-source tool chain developed at IBM Research è X10 2.4.2 just released §  a growing community

§  X10 workshop at PLDI’14 è CFP at http://x10-lang.org

§  Double goal: productivity and performance

2

Outline

3

§  X10 §  programming model: Asynchronous Partitioned Global Address Space

§  Optimizations for scale out §  distributed termination detection §  high-performance networks §  memory management

§  Performance results §  Power 775 architecture §  benchmarks

§  Global load balancing §  Unbalanced Tree Search at scale

X10 Overview

4

Task parallelism §  async S!§  finish S!

Place-shifting operations §  at(p) S!§  at(p) e!

………

…

Tasks

Local Heap

Place 0

………

Tasks

Local Heap

Place N

…

Global Reference

APGAS Places and Tasks

Concurrency control §  when(c) S!§  atomic S!

Distributed heap §  GlobalRef[T]!§  PlaceLocalHandle[T]!

5

APGAS Idioms

§  Remote procedure call v = at(p) f(arg1, arg2);!

§  Active message at(p) async m(arg1, arg2);

§  SPMD finish for(p in Place.places()) {!

at(p) async {! for(i in 1..n) {! async doWork(i);! }! }! }!

§  finish construct is transitive and can cross place boundaries

§  Atomic remote update at(ref) async atomic ref() += v;!

§  Divide-and-conquer parallelism def fib(n:Long):Long {! if(n < 2) return n;!

val x:Long;! val y:Long;! finish {! async x = fib(n-1);! y = fib(n-2);! }!

return x + y;! }!

!

6

Example: BlockDistRail.x10

public class BlockDistRail[T] {! protected val sz:Long; // block size! protected val raw:PlaceLocalHandle[Rail[T]];!!

public def this(sz:Long, places:Long){T haszero} {! this.sz = sz;! raw = PlaceLocalHandle.make[Rail[T]](PlaceGroup.make(places), ()=>new Rail[T](sz));!

}! public operator this(i:Long) = (v:T) { at(Place(i/sz)) raw()(i%sz) = v; }! public operator this(i:Long) = at(Place(i/sz)) raw()(i%sz);!!

public static def main(Rail[String]) {! val rail = new BlockDistRail[Long](5, 4);! rail(7) = 8;!

Console.OUT.println(rail(7));! }!}!

7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 Place 0 Place 1 Place 2 Place 3

Optimizations for Scale Out

8

Distributed Termination Detection

§  Local finish is easy §  synchronized counter: increment when task is spawned, decrement when task ends

§  Distributed finish is non-trivial §  network can reorder increment and decrement messages

§  X10 algorithm: disambiguation in space è space overhead §  one row of n counters per place with n places §  when place p spawns task at place q increment counter q at place p §  when task terminates at place p decrement counter p at place p §  finish triggered when sum of each column is zero

§  Charm++ algorithm: disambiguation in time è communication overhead §  successive non-overlapping waves of termination detections

9

Optimized Distributed Termination Detection

§  Source optimizations §  aggregate messages at source §  compress

§  Software routing §  aggregate messages at intermediate nodes

§  Pattern-based specialization §  “put”: a finish governing a single task è wait for one ack §  “get”: a finish governing round trip è wait for return task §  local finish: a finish with no remote tasks è single counter §  SPMD finish: a finish with no nested remote task è single counter §  irregular/dense finish: a finish with lots of links è software routing

§  Runtime optimizations + static analysis + pragmas ! scalable finish

10

“get”

SPMD finish

High-Performance Networks

§  RDMAs §  efficient remote memory operations §  asynchronous semantics ! good fit for APGAS

§  just another task

Array.asyncCopy[Double](src, srcIndex, dst, dstIndex, size);!

§  Collectives §  multi-point coordination and communication §  networks/APIs biased towards SPMD today ! poor fit for APGAS today

Team.WORLD.barrier(here.id);!columnTeam.addReduce(columnRole, localMax, Team.MAX);!

§  future: MPI-3 and beyond ! good fit for APGAS §  one-sided collectives, endpoints, etc.

11 11

Memory Management

§  Garbage collector §  problem: distributed heap

§  distributed garbage collection is impractical §  solution: segregate local/remote refs ! issue is contained

§  only local refs are automatically collected

§  Congruent memory allocator §  problem: low-level requirements

§  large pages required to minimize TLB misses §  registered pages required for RDMAs §  congruent addresses required for RDMAs at scale

§  solution: dedicated memory allocator ! issue is contained §  congruent registered pages §  large pages if available §  only used for performance-critical arrays §  only impacts allocation & deallocation

12

Performance Results

13

DARPA HPCS/PERCS Prototype (Power 775)

§  Compute Node §  32 Power7 cores 3.84 GHz §  128 GB DRAM §  peak performance: 982 Gflops §  Torrent interconnect

§  Drawer §  8 nodes

§  Rack §  8 to 12 drawers

§  Full Prototype §  up to 1,740 compute nodes §  up to 55,680 cores §  up to 1.7 petaflops

§  1 petaflops with 1,024 compute nodes

14

DARPA HPCS/PERCS Benchmarks

§  HPC Challenge benchmarks §  Linpack TOP500 (flops) §  Stream Triad local memory bandwidth §  Random Access distributed memory bandwidth §  Fast Fourier Transform mix

§  Machine learning kernels §  KMeans graph clustering §  SSCA1 pattern matching §  SSCA2 irregular graph traversal §  UTS unbalanced tree traversal

§  Implemented in X10 as pure scale out tests §  one core = one place = one main task §  native libraries for sequential math kernels: ESSL, FFTE, SHA1

15

Performance at Scale (Weak Scaling)

number of cores at scale

absolute performance

at scale

relative efficiency compared to single host (weak scaling)

performance at scale relative to best

implementation available Stream 55,680 397 TB/s 98% 87%

FFT 32,768 28.7 Tflop/s 100% 41% (no tuning)

Linpack 32,768 589 Tflop/s 87% 85%

RandomAccess 32,768 843 Gup/s 100% 81%

KMeans 47,040 98% ?

SSCA1 47,040 98% ?

SSCA2 47,040 245 B edges/s see paper for details ?

UTS (geometric) 55,680 596 B nodes/s 98% prior impl. do not scale

16 16

HPCC Class 2 Competition 2012: Best Performance Award

589231 22.4

18.0

16.00

18.00

20.00

22.00

24.00

0

200000

400000

600000

800000

0 16384 32768

Gflo

ps/p

lace

Gflo

ps

Places

G-HPL

396614

7.23

7.12

0

5

10

15

0

100000

200000

300000

400000

500000

0 27840 55680

GB

/s/p

lace

GB

/s

Places

EP Stream (Triad)

844

0.82 0.82

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

100

200

300

400

500

600

700

800

900

0 8192 16384 24576 32768

Gup

s/pl

ace

Gup

s

Places

G-RandomAccess

26958

0.88

0.82

0.00 0.20 0.40 0.60 0.80 1.00 1.20

0 5000

10000 15000 20000 25000 30000

0 16384 32768 G

flops

/pla

ce

Gflo

ps

Places

G-FFT

356344

596451 10.93 10.87

10.71

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

0

100000

200000

300000

400000

500000

600000

700000

0 13920 27840 41760 55680

Mill

ion

node

s/s/

plac

e

Mill

ion

node

s/s

Places

UTS

17

Global Load Balancing

18

Unbalanced Tree Search at Scale

§  Problem §  count nodes in randomly generated tree è unbalanced & unpredictable §  separable random number generator è no locality constraint

§  Lifeline-based global work stealing [PPoPP’11] §  n random victims then p lifelines (hypercube) §  steal (synchronous) then deal (asynchronous)

§  Novel optimizations §  use of nested finish scopes è scalable finish

§  use of “dense” finish pattern for root finish §  use of “get” finish pattern for random steal attempt

§  pseudo random steals è software routing §  compact work queue encoding (for shallow trees) è less state, smaller messages

§  lazy expansion of intervals of nodes (siblings)

19

Conclusions

§  Performance §  X10 and APGAS can scale to Petaflop systems

§  Productivity §  X10 and APGAS can implement legacy algorithms

§  such as statically scheduled and distributed codes §  X10 and APGAS can ease the development of novel scalable codes

§  including irregular and unbalanced workloads

§  APGAS constructs deliver productivity and performance gains at scale

§  Follow-up work presented PPAA’14 §  APGAS global load balancing library derived from UTS §  application to SSCA2

20

X10 and APGAS at Petascalex10.sourceforge.net/documentation/papers/Petascale... · X10 and APGAS at Petascale Olivier Tardieu1, Benjamin Herta1, David Cunningham2, David Grove1, Prabhanjan

Documents