Top Banner
Attacking the programming model wall Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013
69

Attacking the programming model wall

Feb 23, 2016

Download

Documents

lainey

Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Attacking the programming model wall

Attacking theprogramming model

wallMarco Danelutto

Dept. Computer Science, Univ. of PisaBelfast, February 28th 2013

Page 2: Attacking the programming model wall

Market pressure

Hw advances

Power wall

Setting the scenario (HW)

Page 3: Attacking the programming model wall

Market pressure

•From components•To cores

Moorelaw

•Gesture/voice interaction•3/D graphics

Newneeds

•New applications•Larger data sizes

Supercomputing

Page 4: Attacking the programming model wall

Multicores Moore law from components to cores Simpler cores, shared memory, cache

coherent, full interconnect

Name Cores Contexts Sockets Core x board

AMD 6176 12 1 4 48E5-4650 8 2 4 64SPARC T4 8 8 4 512

Page 5: Attacking the programming model wall

Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe)

Options for cache coherence, more complex inter core communication protocols

Manycores

Name Core Core

Contexts Intercon Mem controllers

TileraPro64 VLIW 64 64 Mesh 4Intel PHI IA-64 60 240 Ring 8 (2 way)

Page 6: Attacking the programming model wall

ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe)

Data parallel computations only

GPUs

Name Cores M interface

M bandwidth

nVidia C2075

448 384 bit 144GB/sec

mVidia K20X 2688 384 bit 250GB/sec

Page 7: Attacking the programming model wall

Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores

Non-standard programming tools

FPGA

Name Cells Block RAM M bandwith

Artix 7 215,000 13Mb 1,066MB/sVirtex 7 2,000,000 68Mb 1,866MB/s

Page 8: Attacking the programming model wall

Power cost > hw cost

Thermal dissipation cost

FLOP/Watt is “the must”

Power wall

Page 9: Attacking the programming model wall

Reducing idle costs◦ E4 CARMA CLUSTER

ARM + nVIDIA Spare Watt → GPU

Reducing the cooling costs◦ Eurotech AURORA TIGON

Intel technology Water cooling Spare Watt → CPU

Power wall (2)

Page 10: Attacking the programming model wall

Close to metal programming models

SIMD / MIMD abstractionprogrammingmodels

High level high productivityprogramming models

Setting the scenario (sw)

Page 11: Attacking the programming model wall

Programming models

Low abstraction level High abstraction level

Pros◦ Performance / efficiency◦ Heterogeneous hw

targeting

Cons◦ Huge application

programmer responsibilities

◦ Portability (functional, performance)

◦ Quantitative parallelism exploitation

Pros◦ Expressive power◦ Separation of concerns◦ Qualitative parallelism

exploitation

Cons◦ Performance / efficiency◦ Hw targeting

Page 12: Attacking the programming model wall

Separation of concerns

Functional Non functional

What has to be computer

Function from input data to output data

Domain specific

Application dependent

How the results is computed

Parallelism, Power management, Security, Fault Tolerance, …

Target hw specific

Factorizable

Page 13: Attacking the programming model wall

Supported programming paradigms

Current programmingframeworks

OpenCL

OpenMP

MPI

CILK

TBBExpressive power

Page 14: Attacking the programming model wall

Market pressur

es

HW advances

Low

level

progra

mming

models

UrgenciesNeed for

a) Parallel programming models

b) Parallel programmers

Page 15: Attacking the programming model wall

Structured parallel programming

Algorithmic skeletons Parallel design patterns

From HPC community

Started in early ‘90(M. Cole’s PhD thesis)

Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls

From SW engineering community

Started in early ‘00

“Recipes” to handle parallelism (name, problem, forces, solutions, …)

Page 16: Attacking the programming model wall

Compiling tools + run

time systems

Clear program

ming model

High level

abstractionsSimilarities

Page 17: Attacking the programming model wall

Common, parametric, reusable parallelism exploitation patterns (from HPC community)

Exposed to programmers as constructs, library calls, objects, higher order functions, components, ...

Composable◦ Two tier model: “stream parallel” skeletons with

inner “data parallel” skeletons

Algorithmic skeletons

Page 18: Attacking the programming model wall

Sample classical skeletons

Stream parallel Data Parallel

Parallel computation of different items from an input stream

Task/farm (master/worker), Pipeline

Parallel computation on (possibly overlapped) partitions of the same input data

Map, Stencil, Reduce, Scan, Mapreduce

Page 19: Attacking the programming model wall

‘90 •Complex patterns, no composition•Targeting clusters•Mostly libraries (RTS)

‘00 •Simple data/stream parallel patterns•Composable, targeting COW/NOW•Libraries + First compilers

‘10 •Optimized, composable building blocks•Targeting cluster of heterogeneous multicore•Quite complex tool chain (compiler + RTS)

Evolution of the concept

Page 20: Attacking the programming model wall

‘90 •Cole PhD thesis skeletons•P3L (Pisa)•SCL (Imperial college London)

‘00 •Lithium/Muskel (Pisa), Skandium (INRIA)•Muesli (Muenster), SkeTo (Tokio)•OSL (Orleans), Mallba (La Laguna)

‘10 •SkePu (Linkoping)•FastFlow (Pisa/Torino)•TBB? (Intel), TPL? (Microsoft)

Evolution of the concept (2)

Page 21: Attacking the programming model wall

Implementing skeletons

Template based Macro Data Flow based

Skeleton implemented by instantiating a “concurrent activity graph template”

Performance models used to instantiate quantitative parameters

P3L, Muesli, SkeTo, FastFlow

Skeleton program compiled to macro data flow graphs

Rewriting/refactoring compiling process

Parallel MDF graph interpreter

Muskel, Skipper, Skandium

Page 22: Attacking the programming model wall

Formally proven rewriting rules

Farm(Δ) = ΔPipe(Δ1, Δ2) = SeqComp(Δ1, Δ2)

Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))

Refactoring skeletons

Page 23: Attacking the programming model wall

Pipe(Farm(Δ1), Δ2) •Service time = maxi=1,2 {stagei}•Nw = nw(farm)+1

Pipe(Δ1, Δ2) •Higher service time•Nw = 2

SeqComp(Δ1, Δ2) •Sequential service time•Nw = 1

Farm(SeqComp(Δ1,Δ2))

•Service time < original•With less resources (normal form)

Sample refactoring: normal form

Page 24: Attacking the programming model wall

Performance modelling

Page 25: Attacking the programming model wall

Pipeline service timeMaxi=1,k { serviceTime(Stagei)}

Pipeline latency∑i=1,k { serviceTime(Stagei)}

Farm service timemax { taskSchedTime, resGathTime, workerTime/#worker}

Map latency partitionTime + workerTime + gatherTime

Sample performance models

Page 26: Attacking the programming model wall

Key strenghts Full parallel structure of the application exposed

to the skeleton framework◦ Exploited by optimizations, support for autonomic non

functional concern management Framework responsibility for architecture

targeting◦ Write once run everywhere code, with architecture

specific compiler and back end (run time) tools Only functional debugging required to

application programmers

Page 27: Attacking the programming model wall

Expressive power reduce time to

deploy

Parallel structure exposed guarantees

performance

Ideally

Page 28: Attacking the programming model wall

•Application programmer: WHAT•System programmer: HOW

Separation of concerns

•Structure suggested•Interpreted by tools

Inversion of control

•Close to hand coded programs•At a fraction of the devel time

Performance

Assessments

Page 29: Attacking the programming model wall

Carefully describe a parallelism exploitation pattern including◦ Applicability◦ Forces◦ Possibile implementations/problem solutions

As text At different levels of abstraction

Parallel design patterns

Page 30: Attacking the programming model wall

Finding concurrency

Algorithm space

Supporting structure

Implementation mechanism

Pattern spaces

Page 31: Attacking the programming model wall

Collapsed in algorithmic skeletons◦ application programmer → concurrency and algorithm spaces◦ Skeleton implementation (system programmer)→ support

structures and implementation mechanisms

Patterns

Page 32: Attacking the programming model wall

Structured parallel programming: design patterns

Design patterns

Problem

Programming

tools

Low level code

Follow, learn, use

Page 33: Attacking the programming model wall

Structured parallel programming: skeletons

Skeleton library

ProblemHigh level code

Instantiate, compose

Page 34: Attacking the programming model wall

Structured parallel programming

Design patterns

ProblemHigh level code

Use knowledge to instantiate, compose

Skeletons

Page 35: Attacking the programming model wall

Working unstructured Tradeoffs

◦ CPU/GPU threads◦ Processes/Threads◦ Coarse/fine grain

tasks Target

architecture dependent decisions Concurrent activity set

Thread/

Procs

SynchMemo

ry

Page 36: Attacking the programming model wall

Creation◦ Thread pool vs. on-the-fly creation

Pinning◦ Operating system dendent effectiveness

Memory management ◦ Embarrassingly parallel patterns may benefit of

process memory space separation (see Memory (next) slide)

Thread/processes

Page 37: Attacking the programming model wall

Cache friendly algorithms◦ Minimization of cache coherency traffic◦ Data aligment/padding

Memory wall◦ 1-2 memory interfaces per 4-8 cores ◦ 4-8 memory interfaces per 60-64 cores (+internal

routing)

Memory

Page 38: Attacking the programming model wall

High level, general purpose mechanisms◦ Passive wait◦ High latency

Low level mechanisms◦ Active wait◦ Smaller latency

Eventually◦ Synchronization on memory (fences)

Synchronization

Page 39: Attacking the programming model wall

Ideally◦ As much parallel activities as necessary to sustain the

input data rate Base measures

◦ Estimated input pressure & task processing time, communication overhead

Compile vs. run time choices◦ Try to devise statically some optimal values◦ Adjust initial settings dynamically based on observations

Devising parallelism degree

Page 40: Attacking the programming model wall

Auto scheduling◦ Idle workers require tasks from a “global” queue◦ Far nodes require less than near ones

Affinity scheduling◦ Tasks scheduled on the producing cores

Round robin allocation of dynamically allocated chunks

NUMA memory exploitation

Page 41: Attacking the programming model wall

Algorithmic code:

computing the

application results out of

the input data

Non functional code:

programming performance, security, fault

tolerance, power

management

More separation of concerns

Page 42: Attacking the programming model wall

Behavioural skeletonsStructured

parallel algorithm code

Pipe

Pipe

Seq

Seq

Map

Seq

Seq

Seq

Sensors & Actuators

exposes

NFC autonomic manager

reads

ECA rule based program

Autonomic manager: ex-ecutes a MAPE loop. Ateach iteration, and ECA(Event Condition Action)

rule system is executed usingmonitored values and possi-bly operating actions on thestructured parallel pattern

Sensors: determinewhat can be perceived

of the computationActuators: determine whatcan be affected/changed

in the computationMAPEloop

Page 43: Attacking the programming model wall

Event: inter arrival time changes Condition: faster than service time Action: increase the parallelism degree

Sample rules

Event: fault at worker Condition: service time low Action: recruit new worker

resource

Page 44: Attacking the programming model wall

BS assessments

Page 45: Attacking the programming model wall

Yes, nice, but then ?

We have MPI, OpenMP, Cuda, OpenCL …

Page 46: Attacking the programming model wall

FastFlow Full C++, skeleton based, streaming parallel

processing framework

•Pipeline, Farm•Composable, customizable

Streaming network patterns

•Lock free SPMC, MPSC, •& MPMC queues

Arbitrary streaming networks

•Lock free SPSC queue•General threading model

Simple streaming networks

http://mc-fastflow.sourceforge.net

Page 47: Attacking the programming model wall

Full POSIX/C++ compliancy◦ G++, make, gprof, gdb, pthread, …

Reuse existing code◦ Proper wrappers

Run from laptops to clusters & clouds◦ Same skeleton structure

Bring skeletons to your desk

Page 48: Attacking the programming model wall

Basic abstraction: ff_nodeclass RedEye: public ff_node { … int svc_init(){ … }

void svc_end() { … }

void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }

ff_node

Input channel

svcOutput channel

Page 49: Attacking the programming model wall

Farm(Worker, Nw)◦ Embarrassingly parallel computations on streams◦ Computing Worker in parallel (Nw copies)◦ Emitter + string of workers + Collector implementation

Pipeline(Stage1, … , StageN)◦ StageK processes output of Stage(K-1)

and delivers to Stage(K+1) Feedback(Skel, Cond)

◦ Routes back results from Skel to input or forward to output depending on Cond

Basic stream parallel skeletons

Page 50: Attacking the programming model wall

Setting up a pipelineff_pipeline myImageProcessingPipe;

ff_node startNode = new Reader(…); ff_node redEye = new RedEye();ff_node light = new LightCalibration();ff_node sharpen = new Sharpen();ff_node endNode = new Writer(…);

myImageProcessingPipe.addStage(startNode);myImageProcessingPipe.addStage(redEye);myImageProcessingPipe.addStage(light);myImageProcessingPipe.addStage(sharpen);myImageProcessingPipe.addStage(endNode);

myImageProcessingPipe.run_and_wait_end();

Page 51: Attacking the programming model wall

Refactoring (farm introduction)ff_node sharpen = new Sharpen();ff_farm<> thirdStage; std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w); …myImageProcessingPipeaddStage(sharpen);myImageProcessingPipe.addStage(thirdStage);…

Page 52: Attacking the programming model wall

Refactoring (map introduction)ff_farm<> thirdStage;

std::vector<ff_node *> w; for(int i=0;i<nworkers;++i) w.push_back(new Sharpen());farm.add_workers(w);Emitter em; // scatter data to workersCollector co; // collect results from wfarm.add_emitter(em);farm.add_collector(co); …myImageProcessingPipe.addStage(sharpen);…

Page 53: Attacking the programming model wall

1. Create a suitable skeleton accelerator2. Offload tasks from main (sequential)

business logic code3. Accelerator exploits the “spare cores” on

your machine

FastFlow accelerator

Page 54: Attacking the programming model wall

FastFlow acceleratorff_farm<> farm(true); // Create acceleratorstd::vector<ff_node *> w;for(int i=0;i<nworkers;++i) w.push_back(new Worker);farm.add_workers(w);farm.add_collector(new Collector);farm.run_then_freeze();

while(…) { …. farm.offload(x); // offload tasks …}…while(farm.load_result(&res)) { ….// eventually process results}

Page 55: Attacking the programming model wall

0.5msecs tasks

Page 56: Attacking the programming model wall

5msecs tasks

Page 57: Attacking the programming model wall

50msecs tasks

Page 58: Attacking the programming model wall

Scalability (vs. PLASMA)

Page 59: Attacking the programming model wall

Farm vs. Macro Data Flow

Page 60: Attacking the programming model wall

FastFlow evolution

Data parallelism

•Generalized map•Reduce

Distributed version

•Transport level channels•Distributed orchestration of FF applications

HeterogeneousHw targeting

•Simple map & reduce offloading•Fair scheduling to CPU/GPU cores

Page 61: Attacking the programming model wall

Distributed version

Page 62: Attacking the programming model wall

Cloud offloadingApplication

runs on COWNot enough throughput

Monitored on a single cloud

resourcePerformance

model: number

of necessary cloud

resources

Distributed version

(local+cloud)With throughput

Page 63: Attacking the programming model wall

Cloud offloading (2)

Page 64: Attacking the programming model wall

Performance modelling of percentage of tasks to offload (in a map or in a reduce)

GPU offloading

Page 65: Attacking the programming model wall

GPU map offloading

Page 66: Attacking the programming model wall

FastFlow ported to Tilera Pro64◦ (paper this afternoon)

With minimal intervention to ensure functional portability

And some more changes made to support better hw optimizations

Moving to MIC

Page 67: Attacking the programming model wall

FastFlow on Tilera Pro64

Page 68: Attacking the programming model wall

Conclusions

Page 69: Attacking the programming model wall

Thanks to Marco Aldinucci, Massimo Torquati, Peter Kilpatrick, Sonia Campa, Giorgio Zoppi, Daniele Buono, Silvia Lametti, Tudor Serban

Any questions?

[email protected]