Top Banner
Company Confidential © ClearSpeed 2006 1 www.clearspeed.co m Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO
37

Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

Company Confidential © ClearSpeed 2006 1www.clearspeed.com

Programming a HeterogeneousData Parallel Coprocessor using Cn

Ray McConnell, CTO

Page 2: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

2© ClearSpeed 2006 | Company Confidential

Extremely Cool - Extremely Fast

For the world's most compute-intensive applications, ClearSpeed provides low power, high performance parallel processing solutions.

Page 3: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

Company Confidential © ClearSpeed 2006 3www.clearspeed.com

CSX Technology

Page 4: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

4© ClearSpeed 2006 | Company Confidential

What are ClearSpeed’s products?

• New, acceleration coprocessor, the CSX600– Assists serial CPU running compute-intensive math libraries– Can be integrated next to the main CPU on the motherboard– …or installed on add-in cards, e.g. PCI-X, PCI Express– …or embedded, e.g. aerospace, auto, medical, defence

• Significantly accelerate libraries and applications– Libraries: Level 3 BLAS, LAPACK, FFTW– ISV apps: MATLAB, LS-DYNA, ABAQUS, AMBER, etc.– In-house codes: Using the SDK to port kernels

• ClearSpeed’s Advance™ board: aimed at the server market– Dual CSX600 coprocessors– R∞ ≈ 50 GFLOPS for 64-bit matrix multiply (DGEMM) calls– 133 MHz PCI-X– Low power; less than 25 Watts

Page 5: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

5© ClearSpeed 2006 | Company Confidential

CSX600 coprocessor layout

• Array of 96 Processor Elements• 250 MHz• IBM 0.13µm FSG process, 8-layer

metal (copper)• 47% logic, 53% memory

– More logic than most processors!

– About 50% of the logic is FPUs

– Hence around one quarter of the chip is floating point hardware

• 15 mm x 15 mm die size• 128 million transistors• Approx. 10 Watts

Page 6: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

6© ClearSpeed 2006 | Company Confidential

• Multi-Threaded Array Processing– Programmed in high-level languages– Hardware multi-threading for latency

tolerance– Asynchronous, overlapped I/O– Run-time extensible instruction set– Bi-endian (compatible with host CPU)

• Array of 96 Processor Elements (PEs)– Each is a Very Long Instruction Word

(VLIW) core, not just an ALU

– Flexible data parallel processing

– Built-in PE fault tolerance, resiliency

• High performance, low power dissipation

CSX600 processor core

Page 7: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

7© ClearSpeed 2006 | Company Confidential

CSX600 Processing Elements

Each PE is a VLIW core:

• Multiple execution units• 4-stage floating point adder

• 4-stage floating point multiplier

• Divide/square root unit

• Fixed-point MAC 16x16 32+64

• Integer ALU with shifter

• Load/store

• High-bandwidth, 5-port register file (3r, 2w)

• Closely coupled 6 KB SRAM for data

• High bandwidth per PE DMA (PIO)

• Per PE address generators

• Complete pointer model, including parallel pointer chasing and vectors of addresses

} 32/64-bit

IEEE 754

Page 8: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

8© ClearSpeed 2006 | Company Confidential

Advance™ Dual CSX600 PCI-X accelerator board

– 50 DGEMM GFLOPS sustained– 0.4 M 1K complex single precision FFTs/s (20 GFLOPS)– ~200 Gbytes/s aggregate B/W to on-chip memories– 6.4 Gbytes/s aggregate B/W to local ECC DDR2-DRAM– 1 Gbyte of local DRAM (512 Mbytes per CSX600)– ~1 Gbyte/s to/from board via PCI-X @133 MHz– < 25 watts for entire card (8” single-slot PCI-X)

Page 9: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

9© ClearSpeed 2006 | Company Confidential

Advance™ Dual CSX600 PCI-X accelerator board

3.2GBytes/sec

3.2GBytes/sec1GBytes/sec

1.6GBytes/sec 512M-2GBytes

Page 10: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

10© ClearSpeed 2006 | Company Confidential

Which applications can be accelerated?

Any applications with significant data parallelism:• Fine-grained – vector operations• Medium-grained – unrolled independent loops • Coarse-grained – multiple simultaneous data channels/sets

Example applications and libraries include:• Linear algebra – BLAS, LAPACK• Bio-informatics – AMBER, GROMACS, GAUSSIAN, CPMD• Computational finance – Monte Carlo, genetic algorithms• Signal processing – FFT (1D, 2D, 3D), FIR, Wavelet• Simulation – FEA, N-body, CFD• Image processing – filtering, image recognition, DCTs• Oil & Gas – Kirchhoff Time/Wave Migration• Intelligent systems – artificial neural networks

Page 11: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

11© ClearSpeed 2006 | Company Confidential

ClearSpeed applications strategy

Standard MathLibraries

BLASMatrix arithmetic

LAPACKLinear algebra

FFTFast FourierTransforms

LINPACK

Top500

OpenMDMolecular Dynamics

AMBER

CPMD

Density FunctionTheory (DFT)

GAMESSGaussian

ComputationalChemistry

SeismicProcessing

Oil & Gas

MATLAB

Mathematica

CAE

= strong dependence

= weak dependence

Page 12: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

12© ClearSpeed 2006 | Company Confidential

ClearSpeed applications strategy

• Provide transparent acceleration of widely used standard libraries– Initially target BLAS, LAPACK, FFTW

• Compatible with Intel MKL, AMD ACML, …– Works just like OpenGL via shared libraries and

dynamically-linked libraries (DLLs)– Plug-and-play acceleration under Linux and Windows

• Port key widely-used applications– Choose open source where possible – dissemination– Have ported GROMACS, now porting AMBER

• Create “template” example applications• Encourage the creation and adoption of standard

libraries– OpenMD, OpenFFT

• Work with customers to port proprietary codes

Page 13: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

13© ClearSpeed 2006 | Company Confidential

Application acceleration structure

Page 14: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

14© ClearSpeed 2006 | Company Confidential

BLAS/LAPACK/FFTW uses

• Software known to use BLAS, LAPACK, FFTW…– MATLAB, Mathematica, Maple, Octave, …– LINPACK, HPCC– IMSL, BCSLIB-EXT, SuperLU, NAG

• FEA, CFD, Finance codes– ABAQUS, ANSYS, MSC (Nastran, Marc, ADAMS), …– LS-DYNA parallel implicit (uses BCSLIB-EXT)– CPMD, Molpro, NWChem, GAMESS, Gaussian, …– Some silicon design (EDA) tools– Numerous Oil & Gas in-house codes– Many, many more!

• ClearSpeed has a profiler for analysing an application’s use of standard libraries (ClearTrace)

Page 15: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

15© ClearSpeed 2006 | Company Confidential

High Performance LINPACK (HPL)

High Performance Linpack

Consider a LINPACK run of 10,000 unknowns, which makes many matrix multiply (DGEMM) calls, starting at size ≈ 25x109 FMACs, and reducing in size each time.

DGEMM call

First DGEMM takes e.g.10s at 5

GFLOPS, the next DGEMM

takes 9.5s etc.

BLAS system library

DGEMM return

Main CPU(s)

System memory

Page 16: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

16© ClearSpeed 2006 | Company Confidential

Speeding up HPL via accelerated BLAS

High Performance Linpack

Consider exactly the same system as before, but with a ClearSpeed accelerator board installed. The ClearSpeed BLAS library intercepts calls to the system BLAS libraries and offloads them for acceleration.

DGEMM call

First DGEMM takes e.g.1s at 50

GFLOPS, the next DGEMM

takes 0.95s etc.

BLAS system library

DGEMM return

Main CPU(s)

System memory

ClearSpeed accelerator board

Page 17: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

17© ClearSpeed 2006 | Company Confidential

CSX600 Level 3 BLAS performance

Source: vendor websites

Matrix Multiply (DGEMM)

5.2 6.5

14.49.4

12.25.9

14.0

50.0

0

10

20

30

40

50

60

IBM

BG/L

(700

MHz)

IBM

Pow

erPC 9

70 (2

.2GHz)

IBM

POW

ER5 (1

.9GHz)

AMD O

pter

on 2

85 (2

.6GHz)

Inte

l Pen

tium

D 9

50 (3

.4GHz)

Inte

l Itan

ium 2

(1.6

GHz)

NEC SX-8

(2GHz)

ClearS

peed

Adv

ance

boa

rd

GF

LO

PS

Page 18: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

18© ClearSpeed 2006 | Company Confidential

CSX600 Level 3 BLAS power efficiency

Source: vendor websites

Matrix Multiply (DGEMM)

431

82 120 99 129 60 108

2000

0

500

1000

1500

2000

2500

IBM

BG

/L (7

00M

Hz)

IBM

Power

PC 970

(2.2

GHz)

IBM

PO

WER5

(1.9

GHz)

AMD O

pter

on 2

85 (2

.6G

Hz)

Inte

l Pen

tium

D 9

50 (3

.4GHz)

Inte

l Ita

nium

2 (1

.6GHz)

NEC SX-8

(2GHz)

Clear

Speed

Advan

ce b

oard

MF

LO

PS

Per

Wat

t

Page 19: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

19© ClearSpeed 2006 | Company Confidential

DGEMM performance from hardware

0

5

10

15

20

25

30

35

40

45

50

0 384 768 1152 1536 1920 2304 2688 3072 3456 3840 4224 4608 4992 5376 5760 6144

Matrix Size

GF

LO

PS

Page 20: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

20© ClearSpeed 2006 | Company Confidential

CSX600 LINPACK performance on 4-core machine

Page 21: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

21© ClearSpeed 2006 | Company Confidential

Bandwidth Limited

CoreLimited

~1 Byte per Flop

SparseMV SPECfp2000 LinpackDGEMM

NWChemFluid DynamicsOcean Models

Petro ReservoirAuto NVH

Auto CrashWeather

SeismicGAMESS

StreamDAXPYDDOT

ClearSpeed Plays Here.

ClearSpeed applications strategy

ClearSpeed Plays Here TodayPCI-e, Next Generations

Page 22: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

22© ClearSpeed 2006 | Company Confidential

MATLAB acceleration

Plug-and-play MATLAB acceleration

• Original time on 3.2 GHz x86:– 8.1 seconds

• Time with ClearSpeed FFT acceleration:– 1.6 seconds

• Time with ClearSpeed convolution acceleration:– 1.2 seconds

• 6X acceleration!

Page 23: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

Company Confidential © ClearSpeed 2006 23www.clearspeed.com

Software

Page 24: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

24© ClearSpeed 2006 | Company Confidential

Software development environment

Software Development Kit (SDK)• Cn compiler (ANSI-C based commercial compiler),

assembler, libraries, ddd/gdb-based debugger, newlib-based C-rtl etc.

• Extensive documentation and training• CSX600 dual-processor development boards• Microcode Development Kit (MDK)• Microcode compiler, debugger, and standard ISET.• Available for Linux, Windows

Page 25: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

25© ClearSpeed 2006 | Company Confidential

Gdb/ddd debugger

Port of standard gdb enables most GUIs to “just work” with the CSX600:

• Hardware supports single step, breakpoint etc

• gdb port is multi-everything (thread, processor and board)

• Visualize all the state in the PEs

• Hardware performance counters also exposed via gdb

Page 26: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

26© ClearSpeed 2006 | Company Confidential

Thread profiler

• The CSX600 is 8-way threaded:– Typically 1 compute thread and 1 or more I/O threads

• The hardware supports tracing in real-time:– Thread switches– I/O operation start/finish

Page 27: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

27© ClearSpeed 2006 | Company Confidential

• Mono execution unit and poly execution unit– Instructions can be executed in either

domain

– mono variables are scalar (single value)

– poly variables are vector (multiple values)

• 2 domains, 2 types of memory:– mono memory (e.g. card memory)

– poly memory (embedded in poly execution unit)

CSX600 from a programmer’s perspective

Page 28: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

28© ClearSpeed 2006 | Company Confidential

Cn: Extending C for SIMD Array Programming

• New Keywords– mono and poly storage qualifiers

• mono is a serial (single) variable• poly is a parallel (vector) variable

• Contrast the two types:– mono:

• One copy exists on mono execution unit• Visible to all processing elements in poly execution unit• mono assumed unless poly specified

– poly:• One per processing element in the poly execution unit• Visible to a single processing element• Data can be shared via “swazzle”• Not visible to mono execution unit

Page 29: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

29© ClearSpeed 2006 | Company Confidential

Cn - Variables

• poly variables akin to an array of mono variables:

• Consider:int ma, mb, mc; poly int pa, pb, pc;

mc = ma + mb; pc = pa + pb;

• Variables pa,pb,pc exist on all PEs– Default configuration: 96 PEs

Page 30: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

30© ClearSpeed 2006 | Company Confidential

Cn - Broadcast

int ma;

poly int pb, pc;

pc = ma + pb;

• mono variable ma is broadcast to all poly execution units

Page 31: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

31© ClearSpeed 2006 | Company Confidential

Cn - Pointers

• mono and poly can be used with pointers

mono int * mono mPmi mono ptr to mono int

poly int * mono mPpi mono ptr to poly int

mono int * poly pPmi poly ptr to mono int

poly int * poly pPpi poly ptr to poly int

• Most commonly used type:– mono ptr to poly type

poly <type> * mono <varname>

Page 32: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

32© ClearSpeed 2006 | Company Confidential

Cn mono to poly pointers

• mono ptr to poly intpoly int * mono pPmi

Note: Points to same location in each PE

int *

Mono memory

int

Poly memory

int

Poly memory

int

Poly memory

Page 33: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

33© ClearSpeed 2006 | Company Confidential

Cn – Poly to mono pointers

• De-reference of poly pointer to mono not permitted

mono int * poly pPmi;

poly int Pi;

Pi = *pPmi; // Not permitted

• Instead, available through a Cn library function call

mono int * poly pPmi;

poly int Pi;

memcpym2p(&Pi, pPmi, sizeof(int)); // OK

Page 34: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

34© ClearSpeed 2006 | Company Confidential

Cn - Conditionals

• if statements in Cn depend on multiplicity of the condition• A poly execution unit (SIMD) can NOT skip poly

conditional code– Single Instruction stream for all PEs– All PEs execute instructions in lockstep– All code must be issued, but not necessarily executed

• Example:if (a == b) /* true on some PEs, false on some */

/* Always issued, may be ignored */else

/* Always issued, may be ignored */

Page 35: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

35© ClearSpeed 2006 | Company Confidential

Porting code

void daxpy(double *c, double *a, double alpha, uint N) {

uint i;

for (i=0; i<N; i++)

c[i] = c[i] + a[i]*alpha;

}

void daxpy(double *c, double *a, double alpha, uint N) {

uint i;

poly double cp, ap;

for (i=0; i<N; i+=num_pes) {

memcpym2p(&cp, &c[i+pe_num], sizeof(double));

memcpym2p(&ap, &a[i+pe_num], sizeof(double));

cp = cp + ap*alpha;

memcpyp2m(&c[i+pe_num], &cp, sizeof(double))

}

Page 36: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

36© ClearSpeed 2006 | Company Confidential

Example: Cn radix-2 FFT

void cn_fft(poly float *xy,poly float *w, short n) { poly short n1,n2,ie,ia,i,j,k,l; poly float xt,yt,c,s;

n2 = n; ie = 1; for (k=n; k > 1; k = (k >> 1) ) { n1 = n2; n2 = n2>>1; ia = 0; for (j=0; j < n2; j++) { c = w[2*ia]; s = w[2*ia+1]; ia = ia + ie; for (i=j; i < n; i += n1) { l = i + n2; xt = xy[2*l] - xy[2*i]; xy[2*i] = xy[2*i] + xy[2*l]; yt = xy[2*l+1] - xy[2*i+1]; xy[2*i+1] = xy[2*i+1] + xy[2*l+1]; xy[2*l] = (c*xt + s*yt); xy[2*l+1] = (c*yt - s*xt); } } ie = ie<<1; }}

Page 37: Www.clearspeed.com Company Confidential © ClearSpeed 2006 1 Programming a Heterogeneous Data Parallel Coprocessor using Cn Ray McConnell, CTO.

37© ClearSpeed 2006 | Company Confidential

• For acceleration of standard libraries and applications, such as Level 3 BLAS, LAPACK, FFTW, MATLAB, Mathematica, ANSYS, ABAQUS, GAMESS, Gaussian, AMBER …

• 50 GFLOPS sustained from a ClearSpeed Advance™ board

• Callable from C/C++, Fortran, etc.• ~25 watts per single-slot board• Multiple Advance™ boards for even higher

performance

ClearSpeed Advance™ and CSX600 summary

ClearSpeed’s Advance™ board delivers new levels of floating-point and integer performance, performance per watt, and ease of use: