Top Banner
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser ([email protected])
36

ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Sep 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

ParalleX A Cure for Scaling Impaired Parallel Applications

Hartmut Kaiser ([email protected])

Page 2: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Tianhe-1A 2.566 Petaflops Rmax

Heterogeneous Architecture:

• 14,336 Intel Xeon CPUs

• 7,168 Nvidia Tesla M2050 GPUs

• More than 100 racks

• 4.04 megawatts

Cetraro Workshop 2011

2

6/28/2011

Page 3: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Technology Demands new Response

Cetraro Workshop 2011

3

6/28/2011

Page 4: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Technology Demands new Response

Cetraro Workshop 2011

4

6/28/2011

Page 5: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Amdahl’s Law

• P: Proportion of parallel code

• N: Number of processors

1

1 − 𝑃 + 𝑃𝑁

Figure courtesy of Wikipedia (http://en.wikipedia.org/wiki/Amdahl's_law)

Cetraro Workshop 2011

5

6/28/2011

Page 6: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

The 4 Horsemen of the Apocalypse: SLOW

• Starvation

• Latencies

•Overheads

•Waiting for Contention resolution

Cetraro Workshop 2011

6

6/28/2011

Page 7: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Efficiency Factors

• Starvation

▫ Insufficient concurrent work to maintain high utilization of resources Inadequate global or local parallelism due to poor load balancing

• Latency

▫ Time-distance delay of remote resource access and services E.g., memory access and system-wide message passing

• Overhead

▫ Critical path work for management of parallel actions and resources

▫ Work not necessary for sequential variant

• Waiting for contention resolution

▫ Delay due to lack of availability of oversubscribed shared resource Bottlenecks in the system, e.g., memory bank access, and network bandwidth

7

Cetraro Workshop 2011 6/28/2011

Page 8: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Efficiency Factors

• Starvation

▫ Insufficient concurrent work to maintain high utilization of resources Inadequate global or local parallelism due to poor load balancing

• Latency

▫ Time-distance delay of remote resource access and services E.g., memory access and system-wide message passing

• Overhead

▫ Critical path work for management of parallel actions and resources

▫ Work not necessary for sequential variant

• Waiting for contention resolution

▫ Delay due to lack of availability of oversubscribed shared resource Bottlenecks in the system, e.g., memory bank access, and network bandwidth

8

Cetraro Workshop 2011 6/28/2011

Page 9: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

A Game Changer

Cetraro Workshop 2011

9

6/28/2011

Page 10: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Adaptive Mesh Refinement (AMR)

6/28/2011 Cetraro Workshop 2011

10

Page 11: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Why Adaptive Mesh Refinement (AMR)

• From 31 Mar 2010 to 31 Mar 2011 at least 68,394,791 SU’s were dedicated on Teragrid to finite difference based AMR applications (out of ~1.407 billion SU’s allocated) -- about 5% of runs

• Nearly all of the publicly available AMR toolkits use MPI

• Strong scaling of AMR applications is typically very poor

• ParalleX functionality fits nicely with the AMR algorithm: global address space, “work stealing”, parallelism discovery, dynamic threads, implicit load balancing

6/28/2011 Cetraro Workshop 2011

11

Page 12: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Constraint based Synchronization for AMR

Cetraro Workshop 2011

12

6/28/2011

• Compute dependencies at task instantiation time

• No global barriers, uses constraint based synchronization

• Computation flows at its own pace • Message driven • Symmetry between local and

remote task creation/execution

Page 13: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

What’s ParalleX ?

• Active global address space (AGAS) instead of PGAS • Message driven instead of message passing • Lightweight control objects instead of global

barriers • Latency hiding instead of latency avoidance • Adaptive locality control instead of static data

distribution • Fine grained parallelism of lightweight threads

instead of Communicating Sequential Processes (CSP/MPI)

• Moving work to data instead of moving data to work

6/28/2011 Cetraro Workshop 2011

13

Page 14: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

The Runtime System – A Game Changer

• Runtime system ▫ is: ephemeral, dedicated to and exists only with an application

▫ is not: the OS, persistent and dedicated to the hardware system

• Moves us from static to dynamic operational regime ▫ Exploits situational awareness for causality-driven adaptation

▫ Guided-missile with continuous course correction rather than a fired projectile with fixed-trajectory

• Based on foundational assumption ▫ Untapped system resources to be harvested

▫ More computational work will yield reduced time and lower power

▫ Opportunities for enhanced efficiencies discovered only in flight

▫ New methods of control to deliver superior scalability

• “Undiscovered Country” – adding a dimension of systematics ▫ Adding a new component to the system stack

▫ Path-finding through the new trade-off space

Cetraro Workshop 2011

14

6/28/2011

Page 15: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

HPX Runtime System Design

• Current version of HPX provides the following infrastructure on conventional systems as defined by the ParalleX execution model

▫ Active Global Address Space (AGAS)

▫ ParalleX Threads and ParalleX Thread Management

▫ Parcel Transport and Parcel Management

▫ Local Control Objects (LCOs)

Cetraro Workshop 2011 6/28/2011

15

Page 16: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

HPX Runtime System Design

• Current version of HPX provides the following infrastructure on conventional systems as defined by the ParalleX execution model

▫ Active Global Address Space (AGAS)

▫ ParalleX Threads and ParalleX Thread Management

▫ Parcel Transport and Parcel Management

▫ Local Control Objects (LCOs)

Cetraro Workshop 2011 6/28/2011

16

Page 17: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Main Runtime System Tasks

• Manage parallel execution for application Starvation ▫ Delineating parallelism, runtime adaptive management of parallelism ▫ Synchronizing parallel tasks ▫ Thread scheduling, static and dynamic load balancing

• Mitigate latencies for application Latencies ▫ Latency hiding through overlap of computation and communication ▫ Latency avoidance through locality management ▫ Dynamic copy semantic support

• Reduce overhead for application Overheads ▫ Synchronization, scheduling, load balancing, communication, context

switching, memory management, address translation

• Resolve contention for application Contention ▫ Adaptive routing, resource scheduling, load balancing ▫ Localized request buffering for logical resources

17

Cetraro Workshop 2011 6/28/2011

Page 18: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Active Global Address Space

• Global Address Space throughout the system ▫ Removes dependency on static data distribution ▫ Enables dynamic load balancing of application and system data

• AGAS assigns global names (identifiers, unstructured 128 bit integers to all entities managed by HPX.

• Unlike PGAS allows mechanisms to resolving global identifiers into corresponding local virtual addresses (LVA) ▫ LVAs comprise – Locality ID, Type of Entity being referred to and its local

memory address ▫ Moving an entity to a different locality updates this mapping. ▫ Current implementation is based on centralized database storing the

mappings which are accessible over the local area network. ▫ Local caching policies have been implemented to prevent bottlenecks and

minimize the number of required round-trips.

• Current implementation allows autonomous creation of globally unique ids in the locality where the entity is initially located and supports memory pooling of similar objects to minimize overhead

Cetraro Workshop 2011

18

6/28/2011

Page 19: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Thread Management

• Thread manager is modular and implements a work-queue based management as specified by PX Execution model

• Threads are cooperatively scheduled at user level without requiring a kernel transition

• Specially designed synchronization primitives such as semaphores, mutexes etc. allow synchronization of HPX threads in the same way as conventional threads

• Thread management currently supports several key modes ▫ Global Thread Queue

▫ Local Queue (work stealing)

▫ Local Priority Queue (work stealing)

Cetraro Workshop 2011

19

6/28/2011

Page 20: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Parcel Management

• Any inter-locality messaging is based on Parcels

▫ In HPX implementation parcels are represented as polymorphic objects

▫ An HPX entity on creating a parcel object sends it to the parcel handler.

• The parcel handler serializes the parcel where all dependent data is bundled along with the parcel.

• At the receiving locality the parcel is received using the standard TCP/IP protocols,

• The action manager de-serializes the parcel and creates HPX threads out of the specification

Cetraro Workshop 2011

20

Locality 2Locality 1

Parcel Handler

parcel

object

Action Manager

HPX Threads

put()

Serialized Parcel De-serialized Parcel

6/28/2011

Page 21: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Exemplar LCO: Futures

• In HPX Futures LCO refers to an object that acts as a proxy for the result that is initially not known.

• When a user code invokes a future (using future.get() ) the thread can do one of 2 activities ▫ If the remote data /arguments are

available then the future.get() operation fetches the data and the execution of the thread continues

▫ If the remote data is NOT available the thread may continue until it requires the actual value; then the thread suspends allowing other threads to continue execution. The original thread re-activates as soon as the data data dependency is resolved

6/28/2011 Cetraro Workshop 2011

21

Locality 1

Locality 2

future.get()suspend thread 1

reactivate thread 1

execute thread 2

Note: Thread 1 is suspended only if the results from locality 2are not readily available. If results are available Tread 1 continues to complete execution.

Page 22: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Based on HPX – An exemplar implementation of ParalleX for conventional systems

Cetraro Workshop 2011

22

6/28/2011

Page 23: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Starvation: Non-uniform Workload

0.000

0.002

0.004

0.006

0.008

0.010

4 5 6 7 8 9 10 11 12

Wav

e A

mp

litu

de

Computational Domain (Radius)

AMR Example Mesh Structure

0 LoR

1 LoR

2 LoR

Cetraro Workshop 2011

23

6/28/2011

Page 24: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Starvation: Non-uniform Workload

0.000

0.002

0.004

0.006

0.008

0.010

4 5 6 7 8 9 10 11 12

Wav

e A

mp

litu

de

Computational Domain (Radius)

AMR Example Mesh Structure

0 LoR

1 LoR

2 LoR

Cetraro Workshop 2011

24

6/28/2011

Page 25: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Starvation: Non-uniform Workload

Cetraro Workshop 2011

25

6/28/2011

Page 26: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Grain Size: The New Freedom

6/28/2011 Cetraro Workshop 2011

26

Page 27: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Overhead: Load Balancing

6/28/2011 Cetraro Workshop 2011

27

Competing effects for optimal grain size: overheads vs. load balancing (starvation)

Page 28: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Overhead: Load Balancing

6/28/2011 Cetraro Workshop 2011

28

Competing effects for optimal grain size: overheads vs. load balancing (starvation)

Page 29: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

0

20

40

60

80

100

120

0 4 8 12 16 20 24 28 32 36 40 44 48

Exe

cuti

on

Tim

e [

s]

Number of OS Threads (Cores)

Execution Time [s] (1,000,000 PX Threads)

0μs

3.5μs

7μs

14.5μs

29μs

58μs

115μs

Overhead: Threads

Cetraro Workshop 2011

29

6/28/2011

Page 30: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

0

20

40

60

80

100

120

0 4 8 12 16 20 24 28 32 36 40 44 48

Exe

cuti

on

Tim

e [

s]

Number of OS Threads (Cores)

Execution Time [s] (1,000,000 PX Threads)

0μs

3.5μs

7μs

14.5μs

29μs

58μs

115μs

Overhead: Threads

Cetraro Workshop 2011

30

6/28/2011

Page 31: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Scaling: AMR using MPI and HPX

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7 8

Scal

ing

(no

rmal

ize

d t

o 1

co

re)

Levels of AMR refinement

Scaling of MPI AMR application

1 core

2 cores

4 cores

10 cores

20 cores

Cetraro Workshop 2011

31

6/28/2011

Page 32: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Scaling: AMR using MPI and HPX

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7 8

Scal

ing

(no

rmal

ize

d t

o 1

co

re)

Levels of AMR refinement

Scaling of MPI AMR application

1 core

2 cores

4 cores

10 cores

20 cores

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7 8

Scal

ing

(no

rmal

ize

d t

o 1

co

re)

Levels of AMR refinement

Scaling of HPX AMR application

1 core

2 cores

4 cores

10 cores

20 cores

Cetraro Workshop 2011

32

6/28/2011

Page 33: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

Performance: AMR using MPI and HPX

0.5

1

1.5

2

2.5

3

3.5

4

1 core 2 cores 5 cores 10 cores 20 cores 30 cores

Wal

lclo

ck t

ime

rat

io M

PI/

HP

X (

HP

X =

1)

Number of cores

Wallclock time ratio MPI/HPX (Depending on levels of refinement - LoR, pollux.cct.lsu.edu, 32 cores)

0 LoR

1 LoR

2 LoR

3 LoR

Cetraro Workshop 2011

33

6/28/2011

Page 34: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

A Cure for Scaling Impaired Parallel Applications ?

Cetraro Workshop 2011

34

6/28/2011

Page 35: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

ParalleX – Is it a Cure?

• Not completely sure yet

▫ Half way through

▫ Promising results on SMP systems

▫ First (promising) results on distributed Systems

• No code changes required!

• Current projects

▫ Custom hardware (FPGAs) accelerating systems functionality

▫ Improving performance of AGAS, Parcel transport, …

▫ Redefining I/O

Cetraro Workshop 2011

35

6/28/2011

Page 36: ParalleX - Louisiana State Universitystellar.cct.lsu.edu/pubs/Cetraro.pdf · 2011. 10. 9. · Cetraro Workshop 2011 2 6/28/2011 . Technology Demands new Response Cetraro Workshop

ParalleX – Is it a Cure?

• ParalleX execution model can be implemented without adding significantly more overhead than what MPI does

• Implicit load balancing for AMR simulations based on finer grained parallelism highly beneficial

• There are regimes and applications that can benefit from this highly parallel model

• Runtime granularity control is crucial for optimal scaling

Cetraro Workshop 2011

36

6/28/2011