Attacking the programming model wall

Attacking theprogramming model

wallMarco Danelutto

Dept. Computer Science, Univ. of PisaBelfast, February 28th 2013

Market pressure

Hw advances

Power wall

Setting the scenario (HW)

Market pressure

•From components•To cores

Moorelaw

•Gesture/voice interaction•3/D graphics

Newneeds

•New applications•Larger data sizes

Supercomputing

Multicores Moore law from components to cores Simpler cores, shared memory, cache

coherent, full interconnect

Name Cores Contexts Sockets Core x board

AMD 6176 12 1 4 48E5-4650 8 2 4 64SPARC T4 8 8 4 512

Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe)

Options for cache coherence, more complex inter core communication protocols

Manycores

Name Core Core

Contexts Intercon Mem controllers

TileraPro64 VLIW 64 64 Mesh 4Intel PHI IA-64 60 240 Ring 8 (2 way)

ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe)

Data parallel computations only

GPUs

Name Cores M interface

M bandwidth

nVidia C2075

448 384 bit 144GB/sec

mVidia K20X 2688 384 bit 250GB/sec

Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores

Non-standard programming tools

FPGA

Name Cells Block RAM M bandwith

Artix 7 215,000 13Mb 1,066MB/sVirtex 7 2,000,000 68Mb 1,866MB/s

Power cost > hw cost

Thermal dissipation cost

FLOP/Watt is “the must”

Power wall

Reducing idle costs◦ E4 CARMA CLUSTER

ARM + nVIDIA Spare Watt → GPU

Reducing the cooling costs◦ Eurotech AURORA TIGON

Intel technology Water cooling Spare Watt → CPU

Power wall (2)

Close to metal programming models

SIMD / MIMD abstractionprogrammingmodels

High level high productivityprogramming models

Setting the scenario (sw)

Programming models

Low abstraction level High abstraction level

Pros◦ Performance / efficiency◦ Heterogeneous hw

targeting

Cons◦ Huge application

programmer responsibilities

◦ Portability (functional, performance)

◦ Quantitative parallelism exploitation

Pros◦ Expressive power◦ Separation of concerns◦ Qualitative parallelism

exploitation

Cons◦ Performance / efficiency◦ Hw targeting

Separation of concerns

Functional Non functional

What has to be computer

Function from input data to output data

Domain specific

Application dependent

How the results is computed

Parallelism, Power management, Security, Fault Tolerance, …

Target hw specific

Factorizable

Supported programming paradigms

Current programmingframeworks

OpenCL

OpenMP

MPI

CILK

TBBExpressive power

Market pressur

es

HW advances

Low

level

progra

mming

models

UrgenciesNeed for

a) Parallel programming models

b) Parallel programmers

Structured parallel programming

Algorithmic skeletons Parallel design patterns

From HPC community

Started in early ‘90(M. Cole’s PhD thesis)

Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls

From SW engineering community

Started in early ‘00

“Recipes” to handle parallelism (name, problem, forces, solutions, …)

Compiling tools + run

time systems

Clear program

ming model

High level

abstractionsSimilarities

Common, parametric, reusable parallelism exploitation patterns (from HPC community)

Exposed to programmers as constructs, library calls, objects, higher order functions, components, ...

Composable◦ Two tier model: “stream parallel” skeletons with

inner “data parallel” skeletons

Algorithmic skeletons

Sample classical skeletons

Stream parallel Data Parallel

Parallel computation of different items from an input stream

Task/farm (master/worker), Pipeline

Parallel computation on (possibly overlapped) partitions of the same input data

Map, Stencil, Reduce, Scan, Mapreduce

‘90 •Complex patterns, no composition•Targeting clusters•Mostly libraries (RTS)

‘00 •Simple data/stream parallel patterns•Composable, targeting COW/NOW•Libraries + First compilers

‘10 •Optimized, composable building blocks•Targeting cluster of heterogeneous multicore•Quite complex tool chain (compiler + RTS)

Evolution of the concept

‘90 •Cole PhD thesis skeletons•P3L (Pisa)•SCL (Imperial college London)

‘00 •Lithium/Muskel (Pisa), Skandium (INRIA)•Muesli (Muenster), SkeTo (Tokio)•OSL (Orleans), Mallba (La Laguna)

‘10 •SkePu (Linkoping)•FastFlow (Pisa/Torino)•TBB? (Intel), TPL? (Microsoft)

Evolution of the concept (2)

Implementing skeletons

Template based Macro Data Flow based

Skeleton implemented by instantiating a “concurrent activity graph template”

Performance models used to instantiate quantitative parameters

P3L, Muesli, SkeTo, FastFlow

Skeleton program compiled to macro data flow graphs

Rewriting/refactoring compiling process

Parallel MDF graph interpreter

Muskel, Skipper, Skandium

Formally proven rewriting rules

Farm(Δ) = ΔPipe(Δ1, Δ2) = SeqComp(Δ1, Δ2)

Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))

Refactoring skeletons

Pipe(Farm(Δ1), Δ2) •Service time = maxi=1,2 {stagei}•Nw = nw(farm)+1

Pipe(Δ1, Δ2) •Higher service time•Nw = 2

SeqComp(Δ1, Δ2) •Sequential service time•Nw = 1

Farm(SeqComp(Δ1,Δ2))

•Service time < original•With less resources (normal form)

Sample refactoring: normal form

Performance modelling

Pipeline service timeMaxi=1,k { serviceTime(Stagei)}

Pipeline latency∑i=1,k { serviceTime(Stagei)}

Farm service timemax { taskSchedTime, resGathTime, workerTime/#worker}

Map latency partitionTime + workerTime + gatherTime

Sample performance models

Key strenghts Full parallel structure of the application exposed

to the skeleton framework◦ Exploited by optimizations, support for autonomic non

functional concern management Framework responsibility for architecture

targeting◦ Write once run everywhere code, with architecture

specific compiler and back end (run time) tools Only functional debugging required to

application programmers

Expressive power reduce time to

deploy

Parallel structure exposed guarantees

performance

Ideally

•Application programmer: WHAT•System programmer: HOW

Separation of concerns

•Structure suggested•Interpreted by tools

Inversion of control

•Close to hand coded programs•At a fraction of the devel time

Performance

Assessments

Carefully describe a parallelism exploitation pattern including◦ Applicability◦ Forces◦ Possibile implementations/problem solutions

As text At different levels of abstraction

Parallel design patterns

Finding concurrency

Algorithm space

Supporting structure

Implementation mechanism

Pattern spaces

Collapsed in algorithmic skeletons◦ application programmer → concurrency and algorithm spaces◦ Skeleton implementation (system programmer)→ support

structures and implementation mechanisms

Patterns

Structured parallel programming: design patterns

Design patterns

Problem

Programming

tools

Low level code

Follow, learn, use

Structured parallel programming: skeletons

Skeleton library

ProblemHigh level code

Instantiate, compose

Structured parallel programming

Design patterns

ProblemHigh level code

Use knowledge to instantiate, compose

Skeletons

Working unstructured Tradeoffs

◦ CPU/GPU threads◦ Processes/Threads◦ Coarse/fine grain

tasks Target

architecture dependent decisions Concurrent activity set

Thread/

Procs

SynchMemo

ry

Creation◦ Thread pool vs. on-the-fly creation

Pinning◦ Operating system dendent effectiveness

Memory management ◦ Embarrassingly parallel patterns may benefit of

process memory space separation (see Memory (next) slide)

Thread/processes

Cache friendly algorithms◦ Minimization of cache coherency traffic◦ Data aligment/padding

Memory wall◦ 1-2 memory interfaces per 4-8 cores ◦ 4-8 memory interfaces per 60-64 cores (+internal

routing)

Memory

High level, general purpose mechanisms◦ Passive wait◦ High latency

Low level mechanisms◦ Active wait◦ Smaller latency

Eventually◦ Synchronization on memory (fences)

Synchronization

Ideally◦ As much parallel activities as necessary to sustain the

input data rate Base measures

◦ Estimated input pressure & task processing time, communication overhead

Compile vs. run time choices◦ Try to devise statically some optimal values◦ Adjust initial settings dynamically based on observations

Devising parallelism degree

Auto scheduling◦ Idle workers require tasks from a “global” queue◦ Far nodes require less than near ones

Affinity scheduling◦ Tasks scheduled on the producing cores

Round robin allocation of dynamically allocated chunks

NUMA memory exploitation

Algorithmic code:

computing the

application results out of

the input data

Non functional code:

programming performance, security, fault

tolerance, power

management

More separation of concerns

Behavioural skeletonsStructured

parallel algorithm code

Pipe

Pipe

Seq

Seq

Map

Seq

Seq

Seq

Sensors & Actuators

exposes

NFC autonomic manager

reads

ECA rule based program

Autonomic manager: ex-ecutes a MAPE loop. Ateach iteration, and ECA(Event Condition Action)

rule system is executed usingmonitored values and possi-bly operating actions on thestructured parallel pattern

Sensors: determinewhat can be perceived

of the computationActuators: determine whatcan be affected/changed

in the computationMAPEloop

Event: inter arrival time changes Condition: faster than service time Action: increase the parallelism degree

Sample rules

Event: fault at worker Condition: service time low Action: recruit new worker

resource

BS assessments

Yes, nice, but then ?

We have MPI, OpenMP, Cuda, OpenCL …

FastFlow Full C++, skeleton based, streaming parallel

processing framework

•Pipeline, Farm•Composable, customizable

Streaming network patterns

•Lock free SPMC, MPSC, •& MPMC queues

Arbitrary streaming networks

•Lock free SPSC queue•General threading model

Simple streaming networks

http://mc-fastflow.sourceforge.net

Full POSIX/C++ compliancy◦ G++, make, gprof, gdb, pthread, …

Reuse existing code◦ Proper wrappers

Run from laptops to clusters & clouds◦ Same skeleton structure

Bring skeletons to your desk

Basic abstraction: ff_nodeclass RedEye: public ff_node { … int svc_init(){ … }

void svc_end() { … }

void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }

ff_node

Input channel

svcOutput channel

Farm(Worker, Nw)◦ Embarrassingly parallel computations on streams◦ Computing Worker in parallel (Nw copies)◦ Emitter + string of workers + Collector implementation

Pipeline(Stage1, … , StageN)◦ StageK processes output of Stage(K-1)

and delivers to Stage(K+1) Feedback(Skel, Cond)

◦ Routes back results from Skel to input or forward to output depending on Cond

Basic stream parallel skeletons

Setting up a pipelineff_pipeline myImageProcessingPipe;

ff_node startNode = new Reader(…); ff_node redEye = new RedEye();ff_node light = new LightCalibration();ff_node sharpen = new Sharpen();ff_node endNode = new Writer(…);

myImageProcessingPipe.addStage(startNode);myImageProcessingPipe.addStage(redEye);myImageProcessingPipe.addStage(light);myImageProcessingPipe.addStage(sharpen);myImageProcessingPipe.addStage(endNode);

myImageProcessingPipe.run_and_wait_end();

Refactoring (farm introduction)ff_node sharpen = new Sharpen();ff_farm<> thirdStage; std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w); …myImageProcessingPipeaddStage(sharpen);myImageProcessingPipe.addStage(thirdStage);…

Refactoring (map introduction)ff_farm<> thirdStage;

std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w);Emitter em; // scatter data to workersCollector co; // collect results from wfarm.add_emitter(em);farm.add_collector(co); …myImageProcessingPipe.addStage(sharpen);…

1. Create a suitable skeleton accelerator2. Offload tasks from main (sequential)

business logic code3. Accelerator exploits the “spare cores” on

your machine

FastFlow accelerator

FastFlow acceleratorff_farm<> farm(true); // Create acceleratorstd::vector<ff_node *> w;for(int i=0;i<nworkers;++i) w.push_back(new Worker);farm.add_workers(w);farm.add_collector(new Collector);farm.run_then_freeze();

while(…) { …. farm.offload(x); // offload tasks …}…while(farm.load_result(&res)) { ….// eventually process results}

0.5msecs tasks

5msecs tasks

50msecs tasks

Scalability (vs. PLASMA)

Farm vs. Macro Data Flow

FastFlow evolution

Data parallelism

•Generalized map•Reduce

Distributed version

•Transport level channels•Distributed orchestration of FF applications

HeterogeneousHw targeting

•Simple map & reduce offloading•Fair scheduling to CPU/GPU cores

Distributed version

Cloud offloadingApplication

runs on COWNot enough throughput

Monitored on a single cloud

resourcePerformance

model: number

of necessary cloud

resources

Distributed version

(local+cloud)With throughput

Cloud offloading (2)

Performance modelling of percentage of tasks to offload (in a map or in a reduce)

GPU offloading

GPU map offloading

FastFlow ported to Tilera Pro64◦ (paper this afternoon)

With minimal intervention to ensure functional portability

And some more changes made to support better hw optimizations

Moving to MIC

FastFlow on Tilera Pro64

Conclusions

Thanks to Marco Aldinucci, Massimo Torquati, Peter Kilpatrick, Sonia Campa, Giorgio Zoppi, Daniele Buono, Silvia Lametti, Tudor Serban

Any questions?

[email protected]

Attacking the programming model wall

Documents