Top Banner
Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop computer CASPUR June 23 2003
56

The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Jan 24, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Massimiliano FaticaStanford University

The Merrimac Project: towards a Petaflop computer

CASPURJune 23 2003

Page 2: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Motivation

Overview of the Merrimac projectHardware

Software

Applications: StreamFLO

Brooktran

Conclusions

Outline

Page 3: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Motivation

Page 4: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

How to make the next step:

Special purpose hardware:Grape (gravitional N-body probelm)

MD-Grape (Molecular dynamics)

General purpose hardware:Keep building “monster” clusters

New architectures:

HTMT Petaflop project (lead by Sterling): use of esoteric technology, multithreaded architecture

BlueGene/L

Merrimac project: streaming supercomputer

From Teraflops to Petaflops

Page 5: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Performance

Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications is far from the peak

Salinas, one of the 2002 Gordon Bell Awards, was able to sustain 1.16 Tflops on ASCI White (less than 10% of peak)

On the Earth Simulator, a custom engineered system with exceptional memory bandwidth, interconnect performance and vector processing capabilities

Global atmospheric simulation was able to achieve 65% of the 40 Tflops of peak performance

Page 6: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Price/Performance

GFlops

$

100 101 102 103 104 105 106104

105

106

107

108 RedStorm

ES

Desktop Merrimac

Merrimac

ASCI machines

Linux cluster

TFlops PFlops

Page 7: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Performance/Cost Comparisons

• Earth Simulator (today)– Peak 40TFLOPS, ~$450M– 0.09 MFLOPS/$ – Sustained 0.04MFLOPS/$

• Red Storm (2004)– Peak 40TFLOPS, ~$90M– 0.44 MFLOPS/$

• Merrimac (proposed 2006)– Peak 2 PFLOPS: 40 TFLOPS, < $1M ($312K)– 128 MFLOPS/$– Sustained 64 MFLOPS/$ (single node)

Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large

Page 8: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac project

Page 9: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

The Merrimac Team

HARDWARE: B. Dally

I. Ahn, A. Das, M. Horowitz, U. Kapasi, N. Jayasena

SOFTWARE: P. Hanrahan

I. Buck, M. Erez, J. Gummarju, T. Knight, C. Kozyrakis, F. Labonte, M. Rosemblum

APPLICATIONS: J. Alonso

T. Barth, E. Darve, M. Fatica, R. Fedkiw, F. Losasso, A. Wray

Page 10: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

How did we achieve that?

Abundant, inexpensive arithmetic– Can put 100s of 64-bit ALUs on a

chip– 20pJ per FP operation

(Relatively) high off-chip bandwidth– 1Tb/s demonstrated, 2nJ per

word off chipMemory is inexpensive $100/Gbyte Velio VC3003

1Tb/s I/O BW

nVidia GeForce4~120 Gflops/sec~1.2 Tops/sec

VLSI Makes Computation Plentiful

Page 11: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Objectives for the Streaming architecture:Parallelism:

– To keep 100s of ALUs per chip (thousands/board millions/system) busy

Locality of data:– To match 20Tb/s ALU bandwidth to ~100Gb/s

chip bandwidth.Latency tolerance:

– To cover 500 cycle remote memory access time

Exploit VLSI technology

Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system

Architecture of Pentium 4

Current Architecture: few ALUs / chip = expensive and limited performance.

Page 12: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Benefits of the streaming architecture

Modern VLSI technology makes arithmetic cheap– 100s of GFLOPS/chip today, TFLOPS in 2010

But bandwidth is expensiveStreams change the ratio of arithmetic to bandwidth

– By exposing producer-consumer locality• Cannot be exploited by caches – no reuse, no spatial locality

Streams also expose parallelism– To keep 100s of FPUs per processor busy

High-radix networks reduce cost of bandwidth when its needed– Simplifies programming

Streaming scientific computations exploits the capabilities of VLSI

Page 13: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Stream processors as a building block

High-performance interconnection to provides good global bandwidth

New programming paradigm to exploit this new architecture:

Strong interaction between application developers and language development group

It will scale from a 2 Tflops workstation to a 2 Pflops machine with 16K processors

Merrimac project

Page 14: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac projectHardware

Page 15: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

The Imagine Stream Processor

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntro

ller

Programmable signal and image processor

Peak performance of 20 GFlops (32 bit), sustains over 12 GFlops on key signal processing benchmarks

Has shown the potentiality of the streaming architecture

It is not easy to program (Kernel-C, Stream-C)

Page 16: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac Architecture

Page 17: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac node

Page 18: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac stream processor chip

Page 19: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Flat memory bandwidth within a 16-node board4:1 Concentration within a 32-node backplane, 8:1 across a 32 backplane systemRouters with bandwidth B=640Gb/s route messages with length L=128b– Requires high radix

to exploit

Merrimac network

High radix routers enable economical global memory

Page 20: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Merrimac Bandwidth

Bandwidth taper matches capabilities of VLSI to demands of scientific applications

Page 21: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Bandwidth hierarchy enabled by streams

Level Bandwidth per Node (GB/s)

Cluster registers 3,840

Stream register file 512

Stream cache 64

Local Memory 16

Board Memory (16 Nodes) 16

Cabinet Memory (1K Nodes) 4

Global Memory (16K Nodes) 2

Stre

amin

gN

etwo

rk

Page 22: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Software

Merrimac project

Page 23: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

C with streaming construct:

Make data parallelism explicit

Declare communication pattern

Streams:

Streams are views of memory

Records operated on in parallel

Brook: streaming language

Stream A

Kernel

Stream B

Page 24: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Kernels are functions which operate only on streams

Streams arguments are read-only or write-only

Special reduction variables

No “state” or static variables allowed

No global memory access

Brook kernels

Page 25: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

To port a Fortran/C code to Brook:

define the data layout and access patterns

Define the computation on data

Decouple the data access pattern from the computation

Code is more clean and easy to understand

How to write a stream application?

Page 26: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Brook Examplestruct Grid_struct{ float x; float y;};typedef stream struct Grid_struct Gridstream;typedef struct Grid_struct Gridarray;typedef stream float Floats;typedef stream Gridstream **grid2d_s;

int main(int argc, char** argv) {Gridarray* mesh; float* vol;Gridstream meshstream; Floats volstream;gird2d_s grid2d “2,2”;int nx=3; ny=2mesh=malloc(sizeof(Gridarray)*(nx+1)*(ny+1));vol=malloc(sizeof(float)*nx*(ny+2));...........call streamLoad(meshstream,mesh,(nx+1)*(ny+1));call streamShape(meshstream,meshstream,2,nx+1,ny+1);call streamStencil(grid2d,meshstream,STREAM_STENCIL_HALO,2,0,1,0,1);call ComputeMetric(grid2d,volstream,&volmin,&volmax);call streamStore(volstream,vol+nx,nx,ny);.........}

kernel void ComputeMetric(grid2d_s grid,out floats volume,reduce float *volmin, reduce float *volmax) {

volume=.5*((grid[1][1].x-grid[0][0].x)*(grid[0][1].y-grid[1][0].y) -(grid[1][1].y-grid[0][0].y)*(grid[0][1].x-grid[1][0].x));*volmin = *volmin < volume ? *volmin : volume;*volmax = *volmax > volume? *volmax : volume; }

mesh

vol

Page 27: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

There are two Brook implementations:1. compiler based on the Metacompiler, that translates Brook to

C and uses run-time library: it is used to develop and debug codes.

2. compiler based on the Imagine tools: it used to assess performance and to study different hardware configurations

A new compiler infrastructure is being developed using the Open64 compiler

Brook compilers

Page 28: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Major tasks

Page 29: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

3 major applications:

StreamFEM

SteamFLO

StreamMD

Applications

Page 30: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

StreamFLO

Page 31: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

StreamFLOStreamFLo is a streaming version of FLO82 [Jameson] for the solution of the inviscid flow around an airfoil

The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations

The structure of the code is similar to TFLO and the algorithm is found in many compressible solvers

Page 32: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

There is an external driver that controls the multigrid strategy and the multigrid data movement (restriction and prolongation). All the multigrid levels are stored in a 1D array.

On each multigrid level, the code operates on 2D grids where it needs to compute the time step. Most of the operations are performed on stencil (3x3 and 5x5)

Code organization

restriction

restriction prolongation

prolongation

Page 33: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

C ******************************************************************C * *C * TRANSFERS THE SOLUTION TO A COARSER MESH *C * *C ******************************************************************

……….DO N=1,4 JJ = 1 DO J=2,JL,2

JJ = JJ +1II = 1DO I=2,IL,2II = II +1

WWR(II,JJ,N) = (DW(I,J,N)*VOL(I,J) +DW(I+1,J,N)*VOL(I+1,J) . +DW(I,J+1,N)*VOL(I,J+1) +DW(I+1,J+1,N) *VOL(I+1,J+1))/ . (VOL(I,J)+VOL(I+1,J)+VOL(I,J+1)+ VOL(I+1,J+1)) END DO END DO

END DO……….

Original FORTRAN subroutine

Fine mesh

Coarse mesh

AA

AA AA

AA

1 2 IL+1IL

JL+1JL

12 A B

C DA BC D

A BC D

A BC D

Page 34: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Equivalent Brook Code// Fine mesh: flow is a 2D stream of shape (nx+2,ny+2)// Coarse mesh: coarse_flow is a 2D stream of shape (nx/2,ny/2)// Select interior points on the fine mesh streamDomain(flow,flow,2,2,nx+1,2,ny+1) streamDomain(vol,vol,2,2,nx+1,2,ny+1)// Generate a stream with the groups {(A,B,C,D), (A,B,C,D), (A,B,C,D), (A,B,C,D)} streamGroup(local_flow2d,flow,STREAM_GROUP_HALO,2,2,2); streamGroup(local_vol2d,vol,STREAM_GROUP_HALO,2,2,2);//Apply restriction operator to generate stream with (AA,AA,AA,AA) TransferFieldFineCoarse(local_flow2d, local_vol2d, coarse_flow)

Define access pattern:

Define computational kernel:kernel void TransferFieldFineCoarse(flow2d_s fine_flow,float2d_s vol, out Flow coarse_flow){ coarse_flow.rho=(vol[0][0]* fine_flow[0][0].rho + vol[0][1]* fine_flow[0][1].rho + vol[1][0]* fine_flow[1][0].rho + vol[1][1]* fine_flow[1][1].rho) / (vol[0][0]+ vol[0][1]+ vol[1][0]+ vol[1][1]);…….}

1 2 IL+1IL

JL+1JL

12 A B

C DA BC D

A BC D

A BC D

AA

AA AA

AA

Page 35: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

struct Flow_struct {float rho; /* density */float u; /* momentum in x direction=density*velocity_x */float v; /* momentum in y direction=density*velocity_y */float e; /* Total energy= density*entalpy-pressure */float p; /* pressure */};

typedef stream struct Flow_struct Flow;typedef stream float floats;typedef stream Flow **flow2d_s;typedef stream floats **float2d_s;

main(int argc, char** argv){ Flow flow,local_flow,interior_flow,coarse_flow; flow2d_s local_flow2d "2,2"; ……..}

Equivalent Brook codeStreams definitions:

Page 36: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Preliminary performance:kernel schedule, H-CUSP dissipative flux

� MUL0� MUL1� MUL2� MUL3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM0� MC_0� JUK0� VAL0�

0

20

40

60

80

100

120

140

160

�180

� 200�

220

240

260

280

300

320

340

360

380

400

420

440

460

480

500

520

540

560

580

600

620

640

660

680

700

720

740

760

780

800

820

840

860

880

900

920

940

960

980

1000

1020

1040

1060

(B10)

� MUL 0� MUL1� MUL2� MUL 3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM 0� MC_0� JUK0� VAL0�

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

44

46

48

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

82

84

86

88

90

92

94

96

98

100

102

104

106

108

110

112

114

116

118

120

122

124

126

128

130

132

134

136

138

140

142

144

146

148

150

152

154

156

158

160

162

164

166

168

170

172

174

176

178

180

182

184

186

188

190

192

194

196

198

200

202

204

206

208

210

212

214

216

218

220

222

224

226

228

230

232

234

236

238

240

242

244

246

248

250

252

254

256

258

260

262

264

266

268

270

272

274

276

278

280

282

284

286

288

290

292

294

296

298

300

302

304

306

308

310

312

314

316

318

320

322

324

326

328

330

332

334

� 336�338

340

342

344

346

348

350

352

354

356

358

360

362

364

366

368

370

372

374

376

378

380

382

384

386

388

390

392

394

396

398

400

402

404

406

408

410

412

414

416

418

420

422

424

426

428

430

432

434

436

438

440

442

444

446

448

450

452

454

456

458

460

462

464

466

468

470

472

474

476

478

480

482

484

486

488

490

492

494

496

498

500

502

504

506

508

510

512

514

516

518

520

522

524

526

528

530

532

534

536

538

540

542

544

546

548

550

552

554

556

558

560

562

564

566

568

570

572

574

576

578

580

582

584

586

588

590

592

594

596

598

600

602

604

606

608

610

612

614

616

618

620

622

624

626

628

630

632

634

636

638

640

642

644

646

648

650

652

654

656

658

660

662

664

666

668

670

672

674

676

678

680

682

684

686

688

690

692

694

696

698

700

702

704

706

708

710

712

714

716

718

720

722

724

726

728

730

732

734

736

738

740

742

744

746

748

750

752

754

756

758

760

762

764

766

768

770

772

774

776

778

780

782

784

786

788

790

792

794

796

798

800

802

804

806

808

810

812

814

816

818

820

822

824

826

828

830

832

834

836

838

840

842

844

846

848

850

852

854

856

858

860

862

864

866

868

870

872

874

876

878

880

882

884

886

888

890

892

894

896

898

900

902

904

906

908

910

912

914

916

918

920

922

924

926

928

930

932

934

936

938

940

942

944

946

948

950

952

954

956

958

960

962

964

966

968

970

972

974

976

978

980

982

984

986

988

990

992

994

996

998

1000

1002

1004

1006

1008

1010

1012

1014

1016

1018

1020

1022

1024

1026

1028

1030

1032

1034

1036

1038

1040

1042

1044

1046

1048

1050

1052

1054

1056

1058

1060

(B10)

S P R E A D

C O M M U C D A T A

C O M M U C D A T A

S P R E A D _ W T

S P W R I T E

S P R E A D

S P R E A D

S E L E C T

S P R E A D _ W T

S P W R I T E

C O M M U C D A T A

S P R E A D

C O M M U C D A T A

N S E L E C T

I E Q3 2

P A S S

S P R E A D

N S E L E C T

I E Q3 2

S E L E C T

S E L E C T

I E Q 3 2

N S E L E C T

S P R E A D _ W T

S P W R I T E

N S E L E C T

S P R E A D _ W T

S P W R I T E

S P R E A D _ W T

S P W R I T E

S P R E A D

S P R E A D

S P R E A D

C O M M U C D A T A

S P R E A D

S P R E A D

C O M M U C D A T A

N S E L E C T

I E Q3 2

I S U B3 2

I S U B3 2

S P R E A D

N S E L E C T

P A S S

S E L E C T

I E Q 3 2

S P R E A D _ W T

S P W R I T E

I L T3 2

I A D D3 2

I S U B3 2

S P R E A D

N S E L E C T

C O M M U C D A T A

I S U B3 2

S P R E A D _ W T

S P W R I T E

N S E L E C T

I S U B3 2

I L T3 2

I A D D 3 2

C O M M U C D A T A

S P R E A D

N S E L E C T

I A D D3 2

S P R E A D _ W T

S P W R I T E

I S U B3 2

I S U B 3 2

S P R E A D

S E L E C T

I L T3 2

I E Q 3 2

S P R E A D

I S U B3 2

I S U B 3 2

I L T 3 2

I A D D3 2

N S E L E C T

S E L E C T

S P R E A D

S E L E C T

S P R E A D

I S U B 3 2

I L T3 2

I A D D3 2

N S E L E C T

S P R E A D _ W T

S P W R I T E

S P R E A D

I L T3 2

I A D D3 2

S E L E C T

S P R E A D

S P R E A D

S E L E C T

S P R E A D

S P R E A D

S P R E A D

C O M M U C D A T A

S P R E A D

C O M M U C D A T A

S P R E A D

C O M M U C D A T A

N S E L E C T

I E Q3 2

P A S S

S P R E A D

N S E L E C T

I E Q 3 2

S E L E C T

S E L E C T

I E Q3 2

N S E L E C T

S P R E A D _ W T

S P W R I T E

N S E L E C T

S P R E A D _ W T

S P W R I T E

S P R E A D _ W T

S P W R I T E

S P R E A D

S P R E A D

S P R E A D

C O M M U C D A T A

S P R E A D

S P R E A D

C O M M U C D A T A

N S E L E C T

I E Q3 2

S P R E A D

N S E L E C T

I S U B3 2

S E L E C T

I E Q 3 2

S P R E A D _ W T

S P W R I T E

I M U L 3 2

N S E L E C T

S P R E A D

I L T 3 2

I A D D3 2

S P R E A D _ W T

S P W R I T E

I S U B3 2

S P R E A D

S P R E A D

S E L E C T

I M U L 3 2

I L T 3 2

I A D D3 2

S P R E A D

S P R E A D

C O M M U C D A T A

S P R E A D

N S E L E C T

P A S S

S E L E C T

I E Q 3 2

S P R E A D

N S E L E C T

S P R E A D

S P R E A D _ W T

S P W R I T E

F I N V S Q R T _ L O O K U P

S P R E A D

F M U L

S P R E A D

S P R E A D

C O N D _ I N _ R

S P R E A D

S P R E A D

F M U L

S P R E A D

S P C R E A D _ W T

S P C W R I T E

S P R E A D

C O M M U C D A T A

S P R E A D

F S U B

S P R E A D

C O M M U C D A T A

N S E L E C T

C O M M U C D A T A

S P R E A D

S P R E A D _ W T

S P W R I T E

C O M M U C D A T A

S P R E A D

F M U L

C O N D _ I N _ D

S P R E A D

S E L E C T

I E Q3 2

S P R E A D

N S E L E C T

S E L E C T

I E Q3 2

S P R E A D _ W T

S P W R I T E

S P C R E A D _ W T

S P C W R I T E

N S E L E C T

C O M M U C D A T A

F M U L

S P R E A D _ W T

S P W R I T E

C O M M U C D A T A

C O M M U C D A T A

S P R E A D

S P R E A D

S E L E C T

S P R E A D

F M U L

S P R E A D _ W T

S P W R I T E

S P R E A D

N S E L E C T

I E Q 3 2

P A S S

S P R E A D

N S E L E C T

I E Q3 2

S E L E C T

� MUL 0� MUL1� MUL2� MUL 3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM 0� MC_0� JUK0� VAL0�

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

42

44

46

48

50

52

54

56

58

60

62

64

66

68

70

72

74

76

78

80

82

84

86

88

90

92

94

96

98

100

102

104

106

108

110

112

114

116

118

120

122

124

126

128

130

132

134

136

138

140

142

144

146

148

150

152

154

156

158

160

162

164

166

168

170

172

174

176

178

180

182

184

186

188

190

192

194

196

198

200

202

204

206

208

210

212

214

216

218

220

222

224

226

228

230

232

234

236

238

240

242

244

246

248

250

252

254

256

258

260

262

264

266

268

270

272

274

276

278

280

282

284

286

288

290

292

294

296

298

300

302

304

306

308

310

312

314

316

318

320

322

324

326

328

330

332

334

336

338

340

342

344

346

348

350

352

354

356

358

360

362

364

366

368

370

372

374

376

378

380

382

384

386

388

390

392

394

396

398

400

402

404

406

408

410

412

414

416

418

420

422

424

426

428

430

432

434

436

438

440

442

444

446

448

450

452

454

456

458

460

462

464

466

468

470

472

474

476

478

480

482

484

486

488

490

492

494

496

498

500

502

504

506

508

510

512

514

516

518

520

522

524

526

528

530

532

534

536

538

540

542

544

546

548

550

552

554

556

558

560

562

564

566

568

570

572

574

576

578

580

582

584

586

588

590

592

594

596

598

600

602

604

606

608

610

612

614

616

618

620

622

624

626

628

630

632

634

636

638

640

642

�644

� 646�

648

650

652

654

656

658

660

662

664

666

668

670

672

674

676

678

680

682

684

686

688

690

692

694

696

698

700

702

704

706

708

710

712

714

716

718

720

722

724

726

728

730

732

734

736

738

740

742

744

746

748

750

752

754

756

758

760

762

764

766

768

770

772

774

776

778

780

782

784

786

788

790

792

794

796

798

800

802

804

806

808

810

812

814

816

818

820

822

824

826

828

830

832

834

836

838

840

842

844

846

848

850

852

854

856

858

860

862

864

866

868

870

872

874

876

878

880

882

884

886

888

890

892

894

896

898

900

902

904

906

908

910

912

914

916

918

920

922

924

926

928

930

932

934

936

938

940

942

944

946

948

950

952

954

956

958

960

962

964

966

968

970

972

974

976

978

980

982

984

986

988

990

992

994

996

998

1000

1002

1004

1006

1008

1010

1012

1014

1016

1018

1020

1022

1024

1026

1028

1030

1032

1034

1036

1038

1040

1042

1044

1046

1048

1050

1052

1054

1056

1058

1060

(B10)

F S Q R T

� F S U B� F A D D�

F S U B

F M U L

F M U L

F S U B

F A D D

F M U L

S P R E A D

F S Q R T

F M U L

F S U B

F M U L

F S U B

S P R E A D

P A S S

F S U B

F A D D

F A D D

F A D D

P A S S

P A S S

S P R E A D

F M U L

F A D D

F S U B

F M U L

S P R E A D

F M U L

P A S S

P A S S

F A D D

F A D D

F A D D

S P R E A D

F M U L

F S U B

F M U L

F A D D

P A S S

S P R E A D

P A S S

F A D D

F S U B

F A D D

F A D D

S P R E A D

N S E L E C T

F M U L

F S U B

F S U B

S P R E A D

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F A D D

F S U B

S P R E A D

P A S S

F S U B

F A D D

F S U B

F A D D

S P R E A D

F S U B

F S U B

F A B S

F A D D

S P R E A D

F M U L

F A D D

F M U L

F A D D

S P R E A D

P A S S

F M U L

F M U L

P A S S

F A D D

S P R E A D

P A S S

F M U L

F A B S

F M U L

F S U B

S P R E A D

F S Q R T

F S U B

F S U B

F L T

F S U B

S P R E A D

F S U B

F A D D

F I N V S Q R T _ L O O K U P

F L T

F M U L

F M U L

F M U L

F A D D

F S U B

S P R E A D

F S U B

F M U L

F M U L

F S U B

P A S S

P A S S

S P R E A D

F A D D

F A D D

F A D D

S P R E A D

F M U L

F I N V S Q R T _ L O O K U P

F S U B

F S U B

S P R E A D

F M U L

F M U L

F S U B

F L T

P A S S

S P R E A D

P A S S

F M U L

F A D D

F M U L

P A S S

F S U B

S P R E A D

P A S S

F I N V S Q R T _ L O O K U P

F A B S

F A D D

F M U L

F A D D

S P R E A D

F M U L

F I N V S Q R T _ L O O K U P

F M U L

P A S S

F M U L

S P R E A D

F M U L

F M U L

F S U B

S E L E C T

F S U B

F M U L

F I N V S Q R T _ L O O K U P

F M U L

F S U B

F A D D

P A S S

F I N V S Q R T _ L O O K U P

F M U L

F A B S

F M U L

F M U L

F M U L

F M U L

P A S S

P A S S

F M U L

P A S S

F S U B

F S U B

F M U L

F A B S

F M U L

F A B S

F I N V S Q R T _ L O O K U P

F S U B

F L T

F M U L

F A D D

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F M U L

F M U L

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F S U B

F M U L

F M U L

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F M U L

F L T

F S U B

F A D D

F S U B

F M U L

F S U B

P A S S

P A S S

S P R E A D

F A D D

F S U B

F M U L

F S U B

S P R E A D

F M U L

F M U L

F S U B

F S U B

F M U L

F M U L

F M U L

F A D D

N S E L E C T

F I N V S Q R T _ L O O K U P

F M U L

F S U B

F M U L

P A S S

F M U L

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F M U L

F M U L

S P C R E A D _ W T

S P C W R I T E

F S U B

F S U B

F M U L

F M U L

F S U B

F S U B

F M U L

P A S S

C O M M U C D A T A

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F M U L

S P R E A D

F M U L

F L T

F M U L

F M U L

P A S S

F M U L

F M U L

F M U L

F M U L

F M U L

F M U L

F M U L

F M U L

S E L E C T

F I N V S Q R T _ L O O K U P

F M U L

F M U L

P A S S

F M U L

F M U L

F M U L

F M U L

F M U L

F M U L

P A S S

N S E L E C T

F M U L

F M U L

F S U B

F S U B

F I N V S Q R T _ L O O K U P

F M U L

F M U L

F A D D

F A D D

P A S S

F M U L

F M U L

P A S S

F S U B

F M U L

F M U L

F M U L

F M U L

F S U B

S E L E C T

F M U L

F M U L

F S U B

F M U L

F M U L

F M U L

F M U L

F M U L

F I N V S Q R T _ L O O K U P

P A S S

F M U L

F M U L

F S U B

F M U L

P A S S

N S E L E C T

F S U B

F S U B

F S U B

F M U L

F I N V S Q R T _ L O O K U P

S E L E C T

S P R E A D

F M U L

F M U L

F M U L

F M U L

P A S S

S P R E A D _ W T

S P W R I T E

F M U L

F M U L

F M U L

F M U L

P A S S

S P R E A D

F M U L

F M U L

N S E L E C T

F S U B

F S U B

P A S S

S P R E A D

F A D D

F M U L

F M U L

F M U L

S P R E A D

F M U L

F M U L

F M U L

F M U L

S P R E A D

F S U B

F S U B

F S U B

F S U B

P A S S

P A S S

S P R E A D

F S U B

F S U B

S E L E C T

F M U L

F M U L

S P R E A D

F S Q R T

F M U L

F M U L

F M U L

F M U L

S P R E A D

C O M M U C D A T A

F S U B

F M U L

F M U L

F M U L

S P R E A D

P A S S

F M U L

F M U L

F M U L

F M U L

S P R E A D

P A S S

F M U L

F M U L

F M U L

F M U L

P A S S

S P R E A D

F M U L

F M U L

F S U B

F A D D

P A S S

S P R E A D

P A S S

F M U L

F S U B

F S U B

F M U L

P A S S

N S E L E C T

S P R E A D

F M U L

F M U L

F M U L

F M U L

P A S S

S P R E A D

F M U L

F M U L

F M U L

F M U L

S P R E A D

P A S S

F M U L

F S U B

F A D D

F M U L

P A S S

S P R E A D

F M U L

F A D D

F M U L

F M U L

S P R E A D

F M U L

F M U L

F M U L

F M U L

P A S S

S P R E A D

S E L E C T

F M U L

F M U L

F M U L

F M U L

S P R E A D

F S U B

F M U L

F M U L

F S U B

S P R E A D

S E L E C T

F S Q R T

F M U L

F M U L

F M U L

F S U B

P A S S

S P R E A D _ W T

S P W R I T E

F M U L

F M U L

F M U L

F M U L

P A S S

S E L E C T

Set up of the 5x5 stencil: 20% utilization

Computation of the flux: 90% utilization

This kernel currently runs at 37 GFlops on the Merrimac simulator1:

58% of the peak performances

Improving the performance of the stencil setup will bring the performance up to 50 GFlops

1The simulator does not yet support MADD instructions: the peak performance is 64 GFlops

Page 37: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Working on 3D version of StreamFLO

Moving from Euler to Navier-Stokes

Evaluate different strategies for flux computations

Ongoing work

Page 38: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Brooktran

Page 39: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Why do we need Fortran support?Most scientific and high performance codes are in Fortran

National labs, NASA, aerospace companies have a huge investment in Fortran codes

The codes have been thoroughly tested and validated

They can be HUGE

Even if rewriting a code in a different language is not be a big deal, the validation process is

A code not fully validated can be acceptable in academia but not for real missions.

Fortran

Page 40: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

The path from C to Brook is much easier than the one from Fortran to Brook:

C to Brook: similar to OpenMP parallelization in extent of changes

Start from original code and, one by one, “streamify” functions.

You can start working on the time consuming part of the code

Very easy to check the results since all the I/O & utility functions are working

Fortran to Brook: more extensive changes than even MPI parallelization

The code has to be rewritten from scratch

A lot of time must be spent rewriting I/O and utility functions

Checking the results and debugging is very time consuming

Porting codes to Merrimac

Page 41: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Mixed language programming: Fortran+BrookUse the original structure of the Fortran code.

The subroutines that are floating-point intensive are replaced by Brook kernels

Streams are a view of memory, so we just need to pass the proper memory information to Brook

It requires some “glue” code and a standard Fortran compiler

Possible paths from Fortran to Brook (1)

Page 42: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Example of mixed language programming// FORTRAN main

program samplereal, allocatable, dimension(:):: a,b,cinteger:: nn=1000allocate(a(n),b(n),c(n))call brook_sum(a,b,c,n)end program sample

// Brook functionvoid brook_sum(a,b,c,n)float a[],b[],c[];int *n;{ floats stream_a, stream_b, stream_c; streamLoad(stream_a,a,n); streamLoad(stream_b,b,n); add_array(stream_a,stream_b, stream_c);streamStore(stream_c,c,n);}

// Brook kernelkernel add_array( floats a, floats b, out floats c){ c=a+b; }

Page 43: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Brooktran: streaming language that uses Fortran syntax

The setup of streams is done through library calls

The kernels are written using a Fortran syntax and have the same constraints as Brook kernels

Possible paths from Fortran to Brook (2)

Page 44: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

// FORTRAN mainprogram samplereal, allocatable, dimension(:):: a,b,cstream, real, dimension(:) :: stream_a, stream_b, stream_cinteger:: nn=1000allocate(a(n),b(n),c(n))call streamLoad(stream_a, a, n)call streamLoad(stream_b, b, n)call add_array(stream_a, stream_b, stream_c)call StreamStore(stream_c,c,n)end program sample

// Brooktran kernelkernel subroutine add_array(a, b, c)stream, real, intent(in):: a, bstream, real, intent(out):: cc=a+bend subroutine add_array

Example of Brooktran

Page 45: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

A Fortran syntax of Brook will help porting legacy codes to Merrimac

Open64 already has a Fortran95 front-end

Fortran 9x array syntax makes stream code very compact

Brooktran

Page 46: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Brooktran syntax

Page 47: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

We need to add the following keywords to Fortran:

stream: used to define a stream; it is a native compound object much like an array

kernel: used to specify a function or subroutine that can be executed by the streaming processor unit

reduce: used for reduction arguments in kernels

New keywords

Page 48: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Kernel functions or subroutines are declared by placing the “kernel” keyword before the function or subroutine name

Arguments of the call have the same restrictions as in Brook.

All arguments need to have explicit “intents”

Kernel

kernel subroutine streamsum( a ,b, c, sum)stream, real, intent (in):: a,bstream, real, intent (out):: creal, intent(reduce):: sum

c=a+bsum=sum+c

end subroutine streamsum

Page 49: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Streams are a native compound object like arrays

The shape is defined by the dimension given in the declaration

Stream

stream, type(real), dimension(:,:):: aFor streams that are generated from stencil of group operators, we can specify the “shape” of each element

stream, type(real), dimension(:,:):: b(3,3)real, dimension(:,:) :: meshstreamSource(a,array,2,100,100)streamGroup(b,a,STREAM_STENCIL_HALO,2,-1,1,-1,1)

Page 50: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Exampleprogram compute_meshtype gridcell

real:: xreal:: y

end type gridcelltype(gridcell), dimension(:,:),allocatable :: meshreal, dimension(:,:),allocatable:: volstream, type(gridcell), dimension(:,:):: a, b(2,2)stream, type(real), dimension(:,:):: cnx=3; ny=2allocate(mesh(nx+1,ny+1),vol(nx,0:ny+1))...........call streamSource(a,mesh,2,nx+1,ny+1)call streamStencil(b,a,STREAM_STENCIL_HALO,2,0,1,0,1)call ComputeMetric(b,c,volmin,volmax)call streamSink(c,vol(1,1),nx,ny)

kernel subroutine ComputeMetric(grid,volume,volmin,volmax) stream, intent(in), type(gridcell):: grid(2,2) stream, intent(out),type(real)::volumereal, intent(reduce):: volmin,volmax volume=.5*((grid(2,2).x-grid(1,1).x)*(grid(1,2).y-grid(2,1).y) & & -(grid(2,2).y-grid(1,1).y)*(grid(1,2).x-grid(2,1).x));volmin=min(volmin,volume)vomax=max(volmax,volume)end subroutine ComputeMetric

mesh

vol

Page 51: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Stream load/store, domain, etc is done with function calls:

call streamSource(Y,X,100*200)

or

Y=streamSource(X,100*200)

In Brooktran, streams have an associated shape. We should modify the load operator

call streamSource(Y,X,2,100,200)

We can use the Fortran 9x array syntax:call streamSource(Y,X(51:100,101:200),50*100)

Stream manipulation

Page 52: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Conclusions

Page 53: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Conclusions

Cost/Performance: 100:1 compared to clusters.Programmable: applicable to large class of scientific applications.Porting and developing new code made easier: stream language, support of legacy codes.

Page 54: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Arithmetic intensity is sufficient. Bandwidth is not going to be the limiting factor in these applications. Computation can be naturally organized in a streaming fashion.The interaction between the application developers and the language development group has helped insured that Brook can be used to code real scientific applications.Architecture has been refined in the process of evaluating these applications.Implementation is much easier than MPI. Brook hides most of the parallelization complexity from the user. The code is very clean and easy to understand. The streaming versions of these applications are in the range of 1000-5000 lines of code.

Page 55: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

Plan

FY02 Demonstrate simple 2D applications on single-node stream processor simulator

FY03 Demonstrate more complex 3D applications on multi-node stream processor simulator

FY04 Detailed microarchitecture, refine programming tools

FY05 Detailed design of streaming supercomputer

FY06 Construct prototype streaming supercomputers

Current effort is technology development– Demonstrate feasibility of stream

technology– Reduce risk, solve key technical issuesNext step is full scale development

Page 56: The Merrimac Project - Stanford Universityaero-comlab.stanford.edu/fatica/presentations/caspur.pdf · Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop

For additional information

http://merrimac.stanford.edu

http://cits.stanford.edu

• (Chapter 5, technical report )