Jan. 2009 (C)RG@SERC,IISc Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc [email protected].

Jan. 2009(C)RG@SERC,IISc

Programming Models for Accelerator-Based Architectures

R. GovindarajanHPC Lab,SERC, IISc

[email protected]

Jan. 2009 © RG@SERC,IISc 2

HPC Design Using Accelerators

• High level of performance from Accelerators• Variety of general-purpose hardware accelerators

– GPUs : nVidia, ATI,– Accelerators: Clearspeed, Cell BE, …– Plethora of Instruction Sets even for SIMD

• Programmable accelerators, e.g., FPGA-based• HPC Design using Accelerators

– Exploit instruction-level parallelism – Exploit data-level parallelism on SIMD units– Exploit thread-level parallelism on multiple units/multi-cores

• Challenges– Portability across different generation and platforms– Ability to exploit different types of parallelism


Accelerators – Cell BE


Accelerators - 8800 GPU


The Challenge


Programming in Accelerator-Based Architectures

• Develop a framework – Programmed in a higher-level language, and is

efficient – Can exploit different types of parallelism on

different hardware– Parallelism across heterogeneous functional

units – Be portable across platforms – not device

specific!

Jointly with Prof. Matthew JacobArchitecture Lab., SERC, IISc


Existing Approaches

StreaMIT

RAW CellBE

Compiler

Accelerator

GPUs

Runtime System

Brooks

GPUs

Compiler

C/C++

SSE/Altivec

Autovectorizer


What is needed

Compiler/Runtime System


Two-Pronged Approach

CUDA

Profile-basedCompiler

GPUs Multicore

PLASMA: High-Level Intermediate Representation

Compiler and Runtime System



CUDA


GPUs Multicore

PLASMA: High-Level Intermediate Representation

Compiler and Runtime System

StreaMIT


Stream Programming Model

• Higher level programming model where nodes represent computation and channels communication (producer/consumer relation) between them.

• Exposes Pipelined parallelism and Task-level parallelism

• Temporal streaming of data• Synchronous Data Flow (SDF), Stream Flow Graph,

StreamMIT, Brook, … • Compiling techniques for achieving rate-optimal,

buffer-optimal, software-pipelined schedules • Mapping applications to Accelerators such as GPUs

and Cell BE.


The StreamIt Language

• Streamit programs are a hierarchical composition of three basic constructs:– Pipeline– SplitJoin

• Round-robin or duplicate splitter

– Feedback Loop

• Stateful filters• Peek values

...

Splitter

Filter Filter Filter

Stream

Stream

Joiner

Joiner Body Splitter

Loop


StreaMIT

• No. of Push/Pop values fixed and known at compile-time

• Multi-rate firing Dup. Splitter

Bandpass Filter+

Amplifier

Combiner

Signal Source

Bandpass Filter+

Amplifier

2 – Band Equalizer


Multi-Rate Firing

• Consistent firing rate of nodes to ensure no data accumulation on channels

• If node A fires 3 times, B should fire twice, and C should fire 4 times

• Solving a set of linear equations!

NA * 2 = NB * 3

NB * 4 = NC * 2

• Multiple solutions possible• Primitive steady-state solution

(firing rates)

B

A

C

2

3

4

2


StreamIt on GPUs

• StreamIt provides a convenient way of programming GPUs

• More ”natural” than frameworks like CUDA or CTM for most domains

• Easier learning curve than CUDA, programmer does not need to think of the program in terms of ”threads” or blocks, but only as a set of communicating filters

• StreamIt programs are easier to verify, since the I/O rates of each filter are static, and hence the schedule can be determined entirely at compile time.


Challenges on GPUs

• Work distribution between the multiprocessors– GPUs have hundreds of processors (SMs and SIMD units)!

• Exploiting task-level and data-level parallelism– Scheduling across the multiprocessors– Multiple concurrent threads in SM to exploit DLP

• Determining the execution configuration (number of threads for each filter) that minimizes execution time.

• Register constraints (eventhough ~1000s of them)• Lack of synchronization mechanisms between the

multiprocessors of the GPU.• Managing CPU-GPU memory bandwidth efficiently• ”Stateless” filters exploit data parallelism, but ”stateful”

filters require special attention.


Existing Approaches Single Threaded SIMD Execution


Existing Approaches (contd.)

Execution on Cell BE Our Approach for GPUs


Compiling Stream Programs to CUDA for GPUs

• Software Pipeline the execution of the stream program on the GPU– This takes care of synchronization and consistency

issues, since the multiprocessors can execute their work in a decoupled fashion, with kernel invocations being the only synchronization points.

– Work distribution and scheduling are accomplished by formulating the problem as a unified Integer Linear Program and solving it, using standard ILP solvers.

– The ILP formulation is sufficiently simple to be solved in a few seconds on current hardware.


Stream Graph Execution

Stream Graph

Buffer requirement = 4 x

A

C

D

B

SIMD Execution

A1 A2

SM1 SM2 SM3 SM4

A3 A4

B1 B2 B3 B4

D3

C3

D4

C4

D1

C1

D2

C2

0123

4567


Stream Graph Execution

Stream Graph Software Pipelined Execution

Buffer requirement = 2 x

A

C

D

B

SM1 SM2 SM3 SM4

A1 A2

A3 A4

B1 B2

B3 B4 D1

C1

D2

C2

D3

C3

D4

C4

0123

4567


Our Approach

• Good execution configuration determined by using profiling – Identify near-optimal no. of concurrent thread instances per filter.

– Takes into consideration register contrainsts

• Formulate work scheduling and processor (SM) assignment as a unified Integer Linear Program problem.

– Takes into account communication bandwidth restrictions

• Efficient buffer layout scheme to ensure all accesses to GPU memory are coalesced.

• Stateful filters are assigned to CPUs – synergistic execution of CPUs and GPUs is ongoing work!


ILP Formulation

• Resource Constraints :

wk,v,p = 1 kth instance of filter v mapped to SM p


ILP Formulation

• Dependence Constraint :

(j,k,v) -- Sched. Time of kth instance of filter v in steady state iteration j

ok,v specifies time within the SWP kernel

fk,v specifies the stage of the SWP kernel

• Filter execution must complete by kernel end


ILP Formulation

• Dependence Constraint (contd.):

Admissibility of the schedule is given by:

• Constraint solving the above equations gives the schedule!


Compiler Framework


Experimental Results

• Speedup on GPU (8800) compared to CPU of stream programs• Filters are coarsened before scheduling!


Experimental Results (contd.)

• Improvements due to buffer coalescing• More results in the CGO-09 paper!



Compiler/Runtime System

CUDA


GPUs Multicore


What should a solution provide?

• Rich abstractions for Functionality– Not a lowest common denominator

• Independence from any single architecture• Portability without compromises on efficiency

– Don't forget high-performance goals of the ISA• Scale-up and scale down

– Single core embedded processor to multi-core workstation

• Take advantage of Accelerators (GPU, Cell, etc.)• Transparent Distributed Memory

PLASMA: Portable Programming for PLASTIC SIMD Accelerators


Our Approach

Stream Program

Intermediate Representation

Cuda, C with Intrinsics,

• Stream or Other high-level program model to a high-level intermediate language

– Perform suitable compiler optimization

– Intermediate representation expressive enough to handle (target) machine specificities

• IR to Target machine – Exploit SIMD and thread-level

parallelism– Agnostic to SIMD width– Manages heterogeneous

memory


PLASMA Overview


PLASMA IR

• Operator– Add, Mult, …

• Vector– 1-D bulk data type of base types– E.g. <1, 2, 3, 4, 5>

• Distributor– Distributes operator over vector – Example:

par add <1,2,3,4,5> <10,15,20,25,30> returns <11, 17, 23, 29, 35>

• Vector composition– Concat, slice, gather, scatter, …

Reduce Add

Par Mul

Slice V

M

Matrix-Vector Multiply

par mul, temp, A[i * n:i * n + n:1], Xreduce add, Y[i:i + 1:1], temp


Our Framework

“CPLASM”, a prototype high-level assembly language

Prototype PLASMA IR Compiler Currently Supported Targets:

C (Scalar), SSE3, CUDA (NVIDIA GPUs) Future Targets:

Cell, ATI, ARM Neon, ... Compiler Optimizations for this “Vector” IR


Our Framework (contd.)


Experimental Results

• Kernel programs written in CPLASM • Compiled to C or CUDA, exposing SIMD

parallelism• Execution on SSE2 or GPU• Comparison with hand-optimized library


Initial Results

Compares well with hand-optimized library kernels

Blocking (tiling) optimization can lead to better performance


Future Directions

• Synergistic execution of stream program in CPU and GPU.

• Support for multiple heterogeneous functional units

• Retargetting PLASMA for multiple accelerators• Extending the framework beyond Stream

Programming models

Jan. 2009(C)RG@SERC,IISc

Thank You !!

Jan. 2009 (C)RG@SERC,IISc Programming Models for Accelerator-Based Architectures R. Govindarajan HPC Lab,SERC, IISc [email protected].

Documents