OpenHPI - Parallel Programming Concepts - Week 5

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.1: Hardware

Dr. Peter Tröger + Teaching Team

Summary: Week 4

■  Accelerators enable major speedup for data parallelism □  SIMD execution model (no branching)

□  Memory latency managed with many light-weight threads ■  Tackle diversity with OpenCL

□  Loop parallelism with index ranges □  Kernels in C, compiled at runtime

□  Complex memory hierarchy supported ■  Getting fast is easy, getting faster is hard

□  Best practices for accelerators □  Hardware knowledge needed

2

What if my computational problem still demands more power?

OpenHPI | Parallel Programming Concepts | Dr. Peter Tröger

Parallelism for …

■  Speed – compute faster ■  Throughput – compute more in the same time

■  Scalability – compute faster / more with additional resources □  Huge scalability only with shared nothing systems □  Still also depends on application characteristics

Processing Element A1



Processing Element B1

Processing Element B2

Processing Element B3 Sca

ling

Up

Scaling Out

Mai

n M

emor

y

Mai

n M

emor

y

3


Parallel Hardware

■  Shared memory system □  Typically a single machine, common address space for tasks

□  Hardware scaling is limited (power / memory wall) ■  Shared nothing (distributed memory) system □  Tasks on multiple machines, can only access local memory □  Global task coordination by explicit messaging

□  Easy scale-out by adding machines to the network

4


Processing Element

Task

Shared Memory

Processing Element

Task

Processing Element

Task

Processing Element

Task

Mes

sage

Mes

sage

Mes

sage

Mes

sage

Cache Cache Local

Memory Local

Memory

Parallel Hardware

■  Shared memory system à collection of processors □  Integrated machine for capacity computing

□  Prepared for a large variety of problems ■  Shared-nothing system à collection of computers

□  Clusters and supercomputers for capability computing □  Installation to solve few problems in the best way

□  Parallel software must be able leverage multiple machines at the same time

□  Difference to distributed systems (Internet, Cloud) ◊  Single organizational domain, managed as a whole ◊  Single parallel application at a time,

no separation of client and server application ◊ Hybrids are possible (e.g. HPC in Amazon AWS cloud)

5


Shared Nothing: Clusters

■  Collection of stand-alone machines connected by a local network □  Cost-effective technique for a large-scale parallel computer

□  Users are builders, have control over their system □  Synchronization much slower than in shared memory □  Task granularity becomes an issue

6

Processing Element

Task

Processing Element

Task

Mes

sage

Mes

sage

Mes

sage

Mes

sage


Local Memory

Local Memory

Shared Nothing: Supercomputers

■  Supercomputers / Massively Parallel Processing (MPP) systems □  (Hierarchical) cluster with a lot of processors

□  Still standard hardware, but specialized setup □  High-performance interconnection network □  For massive data-parallel applications, mostly simulations

(weapons, climate, earthquakes, airplanes, car crashes, ...) ■  Examples (Nov 2013)

□  BlueGene/Q, 1.5 million cores, 1.5 PB memory, 17.1 TFlops

□  Tianhe-2, 3.1 million cores, 1 PB memory, 17.808 kW power, 33.86 PFlops (quadrillions calculations per second)

■  Annual ranking with the TOP500 list (www.top500.org)

7


Example

8

© 2011 IBM Corporation

IBM System Technology Group

1. Chip:16+2 !P

cores

2. Single Chip Module

4. Node Card:32 Compute Cards, Optical Modules, Link Chips; 5D Torus

5a. Midplane: 16 Node Cards

6. Rack: 2 Midplanes

7. System: 96 racks, 20PF/s

3. Compute card:One chip module,16 GB DDR3 Memory,Heat Spreader for H2O Cooling

5b. IO drawer:8 IO cards w/16 GB8 PCIe Gen2 x8 slots3D I/O torus

•Sustained single node perf: 10x P, 20x L• MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria)• Software and hardware support for programming models for exploitation of node hardware concurrency

Blue Gene/Q


Interconnection Networks

■  Bus systems □  Static approach, low costs

□  Shared communication path, broadcasting of information

□  Scalability issues with shared bus ■  Completely connected networks

□  Static approach, high costs □  Only direct links, optimal performance

■  Star-connected networks □  Static approach with central switch □  Less links, still very good performance □  Scalability depends on central switch

9

PE PE PE PE


PE

PE

PE

PE

PE

PE PE

PE

PE

PE

PE

PE

PE

PE PE

PE

Switch


■  Crossbar switch □  Dynamic switch-based network

□  Supports multiple parallel direct connections without collisions

□  Less edges than completely connected network, but still scalability issues

■  Fat tree □  Use ‘wider’ links in higher parts of the

interconnect tree □  Combine tree design advantages with

solution for root node scalability □  Communication distance between any

two nodes is no more than 2 log #PE’s

10


PE1 PE2 PE3 PEn

PE1

PE2

PE3

PEn

PE PE PE PE PE PE

Switch Switch Switch

Switch Switch

Switch


■  Linear array ■  Ring

□  Linear array with connected endings ■  N-way D-dimensional mesh

□  Matrix of processing elements □  Not more than N neighbor links

□  Structured in D dimensions ■  N-way D-dimensional torus

□  Mesh with “wrap-around” connection

11


PE PE PE PE

PE PE PE PE

PE PE

PE PE

PE PE

PE

PE

PE

PE PE

PE PE

PE PE

PE

PE

PE

07.01.2013

42

Point-to-point networks: "ring and fully connected graph

• Ring has only two connections per PE (almost optimal)

• Fully connected graph – optimal connectivity (but high cost)

83

Mesh and Torus

• Compromise between cost and connectivity

84

4-way 2D torus 8-way 2D mesh 4-way 2D mesh

Example: Blue Gene/Q 5D Torus

■  5D torus interconnect in Blue Gene/Q supercomputer □  2 GB/s on all 10 links, 80ns latency to direct neighbors

□  Additional link for communication with I/O nodes

12


[IBM

]

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.2: Granularity and Task Mapping


Workload


14

■  Last week showed that task granularity may be flexible □  Example: OpenCL work group size

■  But: Communication overhead becomes significant now □  What is the right level of task granularity ?

Surface-To-Volume Effect

■  Envision the work to be done (in parallel) as sliced 3D cube □  Not a demand on the application

data, just a representation ■  Slicing represents splitting into tasks

■  Computational work of a task □  Proportional to the volume of the cube slice □  Represents the granularity of decomposition

■  Communication requirements of the task □  Proportional to the surface of the cube slice

■  “communication-to-computation” ratio □  Fine granularity: Communication high, computation low □  Coarse granularity: Communication low, computation high

15



16

[nic

erw

eb.c

om]

■  Fine-grained decomposition for using all processing elements ?

■  Coarse-grained decomposition to reduce communication overhead ?

■  A tradeoff question !



■  Heatmap example with 64 data cells

■  Version (a): 64 tasks □  64x4=

256 messages, 256 data values

□  64 processing elements used in parallel

■  Version (b): 4 tasks

□  16 messages, 64 data values

□  4 processing elements used in parallel

17

[Fos

ter]



■  Rule of thumb □  Agglomerate tasks to avoid communication

□  Stop when parallelism is no longer exploited well enough □  Agglomerate in all dimensions at the same time

■  Influencing factors □  Communication technology + topology

□  Serial performance per processing element □  Degree of application parallelism

■  Task communication vs. network topology □  Resulting task graph must be

mapped to network topology □  Task-to-task communication

may need multiple hops

18


[Fos

ter]

The Task Mapping Problem

■  Given … □  … a number of homogeneous processing elements

with performance characteristics, □  … some interconnection topology of the processing elements

with performance characteristics, □  … an application dividable into parallel tasks.

■  Questions: □  What is the optimal task granularity ? □  How should the tasks be placed on processing elements ? □  Do we still get speedup / scale-up by this parallelization ?

■  Task mapping is still research, mostly manual tuning today ■  More options with configurable networks / dynamic routing

□  Reconfiguration of hardware communication paths

19


Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.3: Programming with MPI


Message Passing

■  Parallel programming paradigm for “shared nothing” environments □  Implementations for shared memory available,

but typically not the best approach ■  Users submit their message passing program & data as job

■  Cluster management system creates program instances

Instance 0

Instance 1

Instance 2 Instance

3

Execution Hosts

21

Cluster Management Software

Submission Host

Job

Appli-cation

Single Program Multiple Data (SPMD)

22


// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }

Input data

SPMD program

// … (determine rank and comm_size) … int token; if (rank != 0) {

// Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1);

} else { // Set the token's value if you are rank 0 token = -1;

} // Send your local token value to your ‘right’ neighbor

MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD);

// Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n",

rank, token, comm_size - 1); }





































Instance 0

Instance 1

Instance 2

Instance 3

Instance 4

Message Passing Interface (MPI)

■  Many optimized messaging libraries for “shared nothing” environments, developed by networking hardware vendors

■  Need for standardized API solution: Message Passing Interface □  Definition of API syntax and semantics

□  Enables source code portability, not interoperability □  Software independent from hardware concepts

■  Fixed number of process instances, defined on startup □  Point-to-point and collective communication

■  Focus on efficiency of communication and memory usage ■  MPI Forum standard

■  Consortium of industry and academia ■  MPI 1.0 (1994), 2.0 (1997), 3.0 (2012)

23


MPI Communicators

■  Each application instance (process) has a rank, starting at zero ■  Communicator: Handle for a group of processes

□  Unique rank numbers inside the communicator group □  Instance can determine communicator size and own rank □  Default communicator MPI_COMM_WORLD □  Instance may be in multiple communicator groups

24


Rank 0 Size 4 Rank 1

Size 4

Rank 2 Size 4 Rank 3

Size 4 Com

mun

icat

or

Communication

■  Point-to-point communication between instances int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com); int MPI_Recv(void* buf, int count, MPI_Datatype type, int sourceRank, int tag, MPI_Comm com);

■  Parameters □  Send / receive buffer + size + data type □  Sender provides receiver rank, receiver provides sender rank □  Arbitrary message tag

■  Source / destination identified by [tag, rank, communicator] tuple ■  Default send / receive will block until the match occurs ■  Useful constants: MPI_ANY_TAG, MPI_ANY_SOURCE, MPI_ANY_DEST ■  Variations in the API for different buffering behavior

25


Example: Ring communication

26


// … (determine rank and comm_size) … int token; if (rank != 0) { // Receive from your ‘left’ neighbor if you are not rank 0 MPI_Recv(&token, 1, MPI_INT, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, rank - 1); } else { // Set the token's value if you are rank 0 token = -1; } // Send your local token value to your ‘right’ neighbor MPI_Send(&token, 1, MPI_INT, (rank + 1) % comm_size, 0, MPI_COMM_WORLD); // Now rank 0 can receive from the last rank. if (rank == 0) { MPI_Recv(&token, 1, MPI_INT, comm_size - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Process %d received token %d from process %d\n", rank, token, comm_size - 1); }

[mpi

tuto

rial

.com

]

Deadlocks

27 Consider: int a[10], b[10], myrank; MPI_Status status; ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); } ...

If MPI_Send is blocking, there is a deadlock.

int MPI_Send(void* buf, int count, MPI_Datatype type, int destRank, int tag, MPI_Comm com);


Collective Communication

■  Point-to-point communication vs. collective communication ■  Use cases: Synchronization, data distribution & gathering

■  All processes in a (communicator) group communicate together □  One sender with multiple receivers (one-to-all) □  Multiple senders with one receiver (all-to-one) □  Multiple senders and multiple receivers (all-to-all)

■  Typical pattern in supercomputer applications ■  Participants continue if the group communication is done

□  Always blocking operation □  Must be executed by all processes in the group

□  No assumptions on the state of other participants on return

28


Barrier

29 ■  Communicator members block until everybody reaches the barrier


MPI_Barrier(comm) MPI_Barrier(comm) MPI_Barrier(comm)























































Broadcast

■  int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int rootRank, MPI_Comm comm ) □  rootRank is the rank of the chosen root process □  Root process broadcasts data in buffer to all other processes,

itself included □  On return, all processes have the same data in their buffer

30


Data

Proc

esse

s

D0

Data

Proc

esse

s

D0

D0

D0

D0

D0

D0

Broadcast

Scatter

■  int MPI_Scatter(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm)

□  sendbuf buffer on root process is divided, parts are sent to all processes, including root

□  MPI_SCATTERV allows varying count of data per rank


31

Data

Proc

esse

s

D0 D1 D2 D3 D4 D5

Data

Proc

esse

s

D0

D1

D2

D3

D4

D5

Scatter

Gather

Gather

■  int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int rootRank, MPI_Comm comm) □  Each process (including the root process) sends the data in its sendbuf buffer to the root process

□  Incoming data in recvbuf is stored in rank order □  recvbuf parameter is ignored for all non-root processes

32


Data

Proc

esse

s

D0 D1 D2 D3 D4 D5

Data

Proc

esse

s

D0

D1

D2

D3

D4

D5

Scatter

Gather

Reduction

■  int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int rootRank, MPI_Comm comm)

□  Similar to MPI_Gather □  Additional reduction operation op to aggregate received

data: maximum, minimum, sum, product, boolean operators, max-min, min-min

■  MPI implementation can overlap communication and reduction calculation for faster results

33


Data

Proc

esse

s

D0A

D0B

D0C

Reduce ‘+’

Data

Proc

esse

s

D0A ‘+’ D0B ‘+’ D0C

D0B

D0C

Example: MPI_Scatter + MPI_Reduce

34 /* -- E. van den Berg 07/10/2001 -- */!#include <stdio.h>!#include "mpi.h"!!int main (int argc, char *argv[]) { ! int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors! int rank, i = -1, j = -1;!! MPI_Init (&argc, &argv);! MPI_Comm_rank (MPI_COMM_WORLD, &rank);!! MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , ! 1, MPI_INT, 0, MPI_COMM_WORLD);! printf ("[%d] Received i = %d\n", rank, i);! ! MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, ! 0, MPI_COMM_WORLD);!! printf ("[%d] j = %d\n", rank, j);! MPI_Finalize(); ! return 0;!}!


What Else

■  Variations: MPI_ISend, MPI_Sendrecv, MPI_Allgather, MPI_Alltoall, …

■  Definition of virtual topologies for better task mapping ■  Complex data types

■  Packing / Unpacking (sprintf / sscanf) ■  Group / Communicator Management ■  Error Handling ■  Profiling Interface

■  Several implementations available □  MPICH - Argonne National Laboratory □  OpenMPI - Consortium of Universities and Industry □  ...

35


Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.4: Programming with Channels


Communicating Sequential Processes

■  Formal process algebra to describe concurrent systems □  Developed by Tony Hoare at University of Oxford (1977)

◊  Also inventor of QuickSort and Hoare logic □  Computer systems act and interact with the environment □  Decomposition in subsystems (processes) that operate

concurrently inside the system □  Processes interact with other processes, or the environment

■  Book: T. Hoare, Communicating Sequential Processes, 1985

■  A mathematical theory, described with algebraic laws ■  CSP channel concept available in many programming

languages for “shared nothing” systems ■  Complete approach implemented in Occam language

37


CSP: Processes

■  Behavior of real-world objects can be described through their interaction with other objects □  Leave out internal implementation details □  Interface of a process is described as set of atomic events

■  Example: ATM and User, both modeled as processes □  card event – insertion of a credit card in an ATM card slot □  money event – extraction of money from the ATM dispenser

■  Alphabet - set of relevant events for an object description

□  May never happen, interaction is restricted to these events □  αATM = αUser = {card, money}

■  A CSP process is the behavior of an object, described with its alphabet

38


Communication in CSP

■  Special class of event: Communication □  Modeled as unidirectional channel between processes

□  Channel name is a member of the alphabets of both processes □  Send activity described by multiple c.v events

■  Channel approach assumes rendezvous behavior □  Sender and receiver block on the channel operation until the

message is transmitted □  Implicit barrier based on communication

■  With formal foundation, mathematical proofs are possible □  When two concurrent processes communicate with each other

only over a single channel, they cannot deadlock. □  Network of non-stopping processes which is free of cycles

cannot deadlock. □  …

39


What‘s the Deal ?

■  Any possible system can be modeled through event chains □  Enables mathematical proofs for deadlock freedom,

based on the basic assumptions of the formalism (e.g. single channel assumption)

■  Some tools available (check readings page)

■  CSP was the formal base for the Occam language □  Language constructs follow the formalism □  Mathematical reasoning about the behavior of written code

■  Still active research (Welsh University), channel concept frequently adopted □  CSP channel implementations for Java, MPI, Go, C, Python …

□  Other formalisms based on CSP, e.g. Task/Channel model

40


Channels in Scala

41 actor { var out: OutputChannel[String] = null val child = actor { react { case "go" => out ! "hello" } } val channel = new Channel[String] out = channel child ! "go" channel.receive { case msg => println(msg.length) } }

case class ReplyTo(out: OutputChannel[String]) val child = actor { react { case ReplyTo(out) => out ! "hello" } } actor { val channel = new Channel[String] child ! ReplyTo(channel) channel.receive { case msg => println(msg.length) } }

Scope-based channel sharing

Sending channels in messages


Channels in Go

42


package main import fmt “fmt” func sayHello (ch1 chan string) { ch1 <- “Hello World\n” } func main() { ch1 := make(chan string) go sayHello(ch1) fmt.Printf(<-ch1) } $ 8g chanHello.go $ 8l -o chanHello chanHello.8 $ ./chanHello Hello World $

Concurrent sayHello function

Put value into channel ch1

Program start, create channel

Run sayHello concurrently

Read value from ch1, print it

Compile application

Link application

Run application

Channels in Go

■  select concept allows to switch between available channels □  All channels are evaluated

□  If multiple can proceed, one is chosen randomly □  Default clause if no channel is available

■  Channels are typically first-class language constructs □  Example: Client provides a response channel in the request

■  Popular solution to get deterministic behavior

43

select { case v := <-ch1: fmt.Println("channel 1 sends", v) case v := <-ch2: fmt.Println("channel 2 sends", v) default: // optional fmt.Println("neither channel was ready") }


Task/Channel Model

■  Computational model for multi-computer by Ian Foster ■  Similar concepts to CSP

■  Parallel computation consists of one or more tasks □  Tasks execute concurrently □  Number of tasks can vary during execution □  Task: Serial program with local memory

□  A task has in-ports and outports as interface to the environment

□  Basic actions: Read / write local memory, send message on outport, receive message on in-port, create new task, terminate

44


Task/Channel Model

■  Outport / in-port pairs are connected by channels □  Channels can be created and deleted

□  Channels can be referenced as ports, which can be part of a message

□  Send operation is non-blocking □  Receive operation is blocking □  Messages in a channel stay in order

■  Tasks are mapped to physical processors by the execution environment □  Multiple tasks can be mapped to one processor

■  Data locality is explicit part of the model ■  Channels can model control and data dependencies

45


Programming With Channels

■  Channel-only parallel programs have advantages □  Performance optimization does not influence semantics

◊  Example: Shared-memory channels for some parts □  Task mapping does not influence semantics ◊  Align number of tasks for the problem,

not for the execution environment ◊  Improves scalability of implementation

□  Modular design with well-defined interfaces

■  Communication should be balanced between tasks ■  Each task should only communicate with a small group of

neighbors

46


Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.5: Programming with Actors


Actor Model

■  Carl Hewitt, Peter Bishop and Richard Steiger. A Universal Modular Actor Formalism for Artificial Intelligence IJCAI 1973. □  Mathematical model for concurrent computation □  Actor as computational primitive

◊  Local decisions, concurrently sends / receives messages ◊ Has a mailbox for incoming messages ◊ Concurrently creates more actors

□  Asynchronous one-way message sending

□  Changing topology allowed, typically no order guarantees ◊  Recipient is identified by mailing address ◊  Actors can send their own identity to other actors

■  Available as programming language extension or library in many environments

48


Erlang – Ericsson Language

■  Functional language with actor support ■  Designed for large-scale concurrency

□  First version in 1986 by Joe Armstrong, Ericsson Labs □  Available as open source since 1998

■  Language goals driven by Ericsson product development □  Scalable distributed execution of phone call handling software

with large number of concurrent activities □  Fault-tolerant operation under timing constraints

□  Online software update ■  Users

□  Amazon EC2 SimpleDB , Delicious, Facebook chat, T-Mobile SMS and authentication, Motorola call processing, Ericsson GPRS and 3G mobile network products, CouchDB, EJabberD, …

49


Concurrency in Erlang

■  Concurrency Oriented Programming □  Actor processes are completely independent (shared nothing)

□  Synchronization and data exchange with message passing □  Each actor process has an unforgeable name □  If you know the name, you can send a message □  Default approach is fire-and-forget

□  You can monitor remote actor processes ■  Using this gives you …

□  Opportunity for massive parallelism □  No additional penalty for distribution, despite latency issues

□  Easier fault tolerance capabilities □  Concurrency by default

50


Actors in Erlang

■  Communication via message passing is part of the language ■  Send never fails, works asynchronously (PID ! Message)

■  Actors have mailbox functionality □  Queue of received messages, selective fetching □  Only messages from same source arrive in-order □  receive statement with set of clauses, pattern matching

□  Process is suspended in receive operation until a match receive Pattern1 when Guard1 -> expr1, expr2, ..., expr_n; Pattern2 when Guard2 -> expr1, expr2, ..., expr_n; Other -> expr1, expr2, ..., expr_n end

51


Functions exported + #args

Erlang Example: Ping Pong Actors

52

Start Ping and Pong actors

Blocking recursive receive, scanning the mailbox

Ping actor, sending message to Pong

Blocking recursive receive, scanning the mailbox

Sending message to Ping

[erlan

g.or

g]

-module(tut15). -export([test/0, ping/2, pong/0]). ping(0, Pong_PID) -> Pong_PID ! finished, io:format("Ping finished~n", []); ping(N, Pong_PID) -> Pong_PID ! {ping, self()}, receive pong -> io:format("Ping received pong~n", []) end, ping(N - 1, Pong_PID). pong() -> receive finished -> io:format("Pong finished~n", []); {ping, Ping_PID} -> io:format("Pong received ping~n", []), Ping_PID ! pong, pong() end. test() -> Pong_PID = spawn(tut15, pong, []), spawn(tut15, ping, [3, Pong_PID]).

Pong actor

Actors in Scala

■  Actor-based concurrency in Scala, similar to Erlang ■  Concurrency abstraction on top of threads or processes

■  Communication by non-blocking send operation and blocking receive operation with matching functionality actor { var sum = 0 loop { receive { case Data(bytes) => sum += hash(bytes) case GetSum(requester) => requester ! sum }}}

■  All constructs are library functions (actor, loop, receiver, !) ■  Alternative self.receiveWithin() call with timeout ■  Case classes act as message type representation

53


Case classes, acting as message types

Start the counter actor

Scala Example: Counter Actor

54


import scala.actors.Actor import scala.actors.Actor._ case class Inc(amount: Int) case class Value class Counter extends Actor { var counter: Int = 0; def act() = { while (true) { receive { case Inc(amount) => counter += amount case Value => println("Value is "+counter) exit() }}}} object ActorTest extends Application { val counter = new Counter counter.start() for (i <- 0 until 100000) { counter ! Inc(1) } counter ! Value // Output: Value is 100000 }

Send an Inc message to the counter actor

Send a Value message to the counter actor

Implementation of the counter actor

Blocking receive loop, scanning the mailbox

Actor Deadlocks

55 ■  Synchronous send operator „!?“ available in Scala □  Sends a message and blocks in receive afterwards

□  Intended for request-response pattern

■  Original asynchronous send makes deadlocks less probable

[htt

p://

sava

nne.

be/a

rtic

les/

conc

urre

ncy-

in-e

rlan

g-sc

ala/

] // actorA actorB !? Msg1(value) match { case Response1(r) => // … } receive { case Msg2(value) => reply(Response2(value)) }

// actorB actorA !? Msg2(value) match { case Response2(r) => // … } receive { case Msg1(value) => reply(Response1(value)) }

// actorA actorB ! Msg1(value) while (true) { receive { case Msg2(value) => reply(Response2(value)) case Response1(r) => // ... }}

// actorB actorA ! Msg2(value) while (true) { receive { case Msg1(value) => reply(Response1(value)) case Response2(r) => // ... }}

Parallel Programming Concepts OpenHPI Course Week 5 : Distributed Memory Parallelism Unit 5.6: Programming with MapReduce


MapReduce

■  Programming model for parallel processing of large data sets □  Inspired by map() and reduce() in functional programming

□  Intended for best scalability in data parallelism ■  Huge interest started with Google Research publication

□  Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters“

□  Google products rely on internal implementation ■  Apache Hadoop: Widely known open source implementation

□  Scales to thousands of nodes □  Has shown to process petabytes of data □  Cluster infrastructure with custom file system (HDFS)

■  Parallel programming on very high abstraction level

57


MapReduce Concept

■  Map step □  Convert input tuples [key, value] with map() function into one / multiple intermediate tuples [key2, value2] per input

■  Shuffle step: Collect all intermediate tuples with the same key ■  Reduce step

□  Combine all intermediate tuples with the same key by some reduce() function to one result per key

■  Developer just defines stateless map() and reduce() functions ■  Framework automatically ensures parallelization ■  Persistence layer needed for input and output only

58


[dev

elop

ers.

goog

le.c

om]

Example: Character Counting

59


Java Example: Hadoop Word Count

60


public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one);

}}} public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}...}

[had

oop.

apac

he.o

rg]

MapReduce Data Flow

61


[dev

elop

er.y

ahoo

.com

]

Advantages

■  Developer never implements communication or synchronization, implicitly done by the framework □  Allows transparent fault tolerance and optimization

■  Running map and reduce tasks are stateless

□  Only rely on their input, produce their own output □  Repeated execution in case of failing nodes □  Redundant execution for compensating nodes with different

performance characteristics ■  Scale-out only limited by

□  Distributed file system performance (input / output data)

□  Shuffle step communication performance ■  Chaining of map/reduce tasks is very common in practice ■  But: Demands embarrassingly parallel problem

62


Summary: Week 5

■  “Shared nothing” systems provide very good scalability □  Adding new processing elements not limited by “walls”

□  Different options for interconnect technology ■  Task granularity is essential

□  Surface-to-volume effect □  Task mapping problem

■  De-facto standard is MPI programming ■  High level abstractions with

□  Channels □  Actors

□  MapReduce

63


„What steps / strategy would you apply to parallelize a given compute-intense program? “

OpenHPI - Parallel Programming Concepts - Week 5

Education

pb memory

shared memory task granularity

single parallel application

gb ddr3 memory

shared bus

branching memory latency

programming models

best way parallel software