Massimiliano Fatica Stanford University The Merrimac Project: towards a Petaflop computer CASPUR June 23 2003
Massimiliano FaticaStanford University
The Merrimac Project: towards a Petaflop computer
CASPURJune 23 2003
Motivation
Overview of the Merrimac projectHardware
Software
Applications: StreamFLO
Brooktran
Conclusions
Outline
Motivation
How to make the next step:
Special purpose hardware:Grape (gravitional N-body probelm)
MD-Grape (Molecular dynamics)
General purpose hardware:Keep building “monster” clusters
New architectures:
HTMT Petaflop project (lead by Sterling): use of esoteric technology, multithreaded architecture
BlueGene/L
Merrimac project: streaming supercomputer
From Teraflops to Petaflops
Performance
Theoretical peak performance of the ASCI machines are in the Teraflops range, but sustained performance with real applications is far from the peak
Salinas, one of the 2002 Gordon Bell Awards, was able to sustain 1.16 Tflops on ASCI White (less than 10% of peak)
On the Earth Simulator, a custom engineered system with exceptional memory bandwidth, interconnect performance and vector processing capabilities
Global atmospheric simulation was able to achieve 65% of the 40 Tflops of peak performance
Price/Performance
GFlops
$
100 101 102 103 104 105 106104
105
106
107
108 RedStorm
ES
Desktop Merrimac
Merrimac
ASCI machines
Linux cluster
TFlops PFlops
Performance/Cost Comparisons
• Earth Simulator (today)– Peak 40TFLOPS, ~$450M– 0.09 MFLOPS/$ – Sustained 0.04MFLOPS/$
• Red Storm (2004)– Peak 40TFLOPS, ~$90M– 0.44 MFLOPS/$
• Merrimac (proposed 2006)– Peak 2 PFLOPS: 40 TFLOPS, < $1M ($312K)– 128 MFLOPS/$– Sustained 64 MFLOPS/$ (single node)
Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large
Merrimac project
The Merrimac Team
HARDWARE: B. Dally
I. Ahn, A. Das, M. Horowitz, U. Kapasi, N. Jayasena
SOFTWARE: P. Hanrahan
I. Buck, M. Erez, J. Gummarju, T. Knight, C. Kozyrakis, F. Labonte, M. Rosemblum
APPLICATIONS: J. Alonso
T. Barth, E. Darve, M. Fatica, R. Fedkiw, F. Losasso, A. Wray
How did we achieve that?
Abundant, inexpensive arithmetic– Can put 100s of 64-bit ALUs on a
chip– 20pJ per FP operation
(Relatively) high off-chip bandwidth– 1Tb/s demonstrated, 2nJ per
word off chipMemory is inexpensive $100/Gbyte Velio VC3003
1Tb/s I/O BW
nVidia GeForce4~120 Gflops/sec~1.2 Tops/sec
VLSI Makes Computation Plentiful
Objectives for the Streaming architecture:Parallelism:
– To keep 100s of ALUs per chip (thousands/board millions/system) busy
Locality of data:– To match 20Tb/s ALU bandwidth to ~100Gb/s
chip bandwidth.Latency tolerance:
– To cover 500 cycle remote memory access time
Exploit VLSI technology
Arithmetic is cheap, global bandwidth is expensiveLocal << global on-chip << off-chip << global system
Architecture of Pentium 4
Current Architecture: few ALUs / chip = expensive and limited performance.
Benefits of the streaming architecture
Modern VLSI technology makes arithmetic cheap– 100s of GFLOPS/chip today, TFLOPS in 2010
But bandwidth is expensiveStreams change the ratio of arithmetic to bandwidth
– By exposing producer-consumer locality• Cannot be exploited by caches – no reuse, no spatial locality
Streams also expose parallelism– To keep 100s of FPUs per processor busy
High-radix networks reduce cost of bandwidth when its needed– Simplifies programming
Streaming scientific computations exploits the capabilities of VLSI
Stream processors as a building block
High-performance interconnection to provides good global bandwidth
New programming paradigm to exploit this new architecture:
Strong interaction between application developers and language development group
It will scale from a 2 Tflops workstation to a 2 Pflops machine with 16K processors
Merrimac project
Merrimac projectHardware
The Imagine Stream Processor
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory System
Mic
roco
ntro
ller
Programmable signal and image processor
Peak performance of 20 GFlops (32 bit), sustains over 12 GFlops on key signal processing benchmarks
Has shown the potentiality of the streaming architecture
It is not easy to program (Kernel-C, Stream-C)
Merrimac Architecture
Merrimac node
Merrimac stream processor chip
Flat memory bandwidth within a 16-node board4:1 Concentration within a 32-node backplane, 8:1 across a 32 backplane systemRouters with bandwidth B=640Gb/s route messages with length L=128b– Requires high radix
to exploit
Merrimac network
High radix routers enable economical global memory
Merrimac Bandwidth
Bandwidth taper matches capabilities of VLSI to demands of scientific applications
Bandwidth hierarchy enabled by streams
Level Bandwidth per Node (GB/s)
Cluster registers 3,840
Stream register file 512
Stream cache 64
Local Memory 16
Board Memory (16 Nodes) 16
Cabinet Memory (1K Nodes) 4
Global Memory (16K Nodes) 2
Stre
amin
gN
etwo
rk
Software
Merrimac project
C with streaming construct:
Make data parallelism explicit
Declare communication pattern
Streams:
Streams are views of memory
Records operated on in parallel
Brook: streaming language
Stream A
Kernel
Stream B
Kernels are functions which operate only on streams
Streams arguments are read-only or write-only
Special reduction variables
No “state” or static variables allowed
No global memory access
Brook kernels
To port a Fortran/C code to Brook:
define the data layout and access patterns
Define the computation on data
Decouple the data access pattern from the computation
Code is more clean and easy to understand
How to write a stream application?
Brook Examplestruct Grid_struct{ float x; float y;};typedef stream struct Grid_struct Gridstream;typedef struct Grid_struct Gridarray;typedef stream float Floats;typedef stream Gridstream **grid2d_s;
int main(int argc, char** argv) {Gridarray* mesh; float* vol;Gridstream meshstream; Floats volstream;gird2d_s grid2d “2,2”;int nx=3; ny=2mesh=malloc(sizeof(Gridarray)*(nx+1)*(ny+1));vol=malloc(sizeof(float)*nx*(ny+2));...........call streamLoad(meshstream,mesh,(nx+1)*(ny+1));call streamShape(meshstream,meshstream,2,nx+1,ny+1);call streamStencil(grid2d,meshstream,STREAM_STENCIL_HALO,2,0,1,0,1);call ComputeMetric(grid2d,volstream,&volmin,&volmax);call streamStore(volstream,vol+nx,nx,ny);.........}
kernel void ComputeMetric(grid2d_s grid,out floats volume,reduce float *volmin, reduce float *volmax) {
volume=.5*((grid[1][1].x-grid[0][0].x)*(grid[0][1].y-grid[1][0].y) -(grid[1][1].y-grid[0][0].y)*(grid[0][1].x-grid[1][0].x));*volmin = *volmin < volume ? *volmin : volume;*volmax = *volmax > volume? *volmax : volume; }
mesh
vol
There are two Brook implementations:1. compiler based on the Metacompiler, that translates Brook to
C and uses run-time library: it is used to develop and debug codes.
2. compiler based on the Imagine tools: it used to assess performance and to study different hardware configurations
A new compiler infrastructure is being developed using the Open64 compiler
Brook compilers
Major tasks
3 major applications:
StreamFEM
SteamFLO
StreamMD
Applications
StreamFLO
StreamFLOStreamFLo is a streaming version of FLO82 [Jameson] for the solution of the inviscid flow around an airfoil
The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations
The structure of the code is similar to TFLO and the algorithm is found in many compressible solvers
There is an external driver that controls the multigrid strategy and the multigrid data movement (restriction and prolongation). All the multigrid levels are stored in a 1D array.
On each multigrid level, the code operates on 2D grids where it needs to compute the time step. Most of the operations are performed on stencil (3x3 and 5x5)
Code organization
restriction
restriction prolongation
prolongation
C ******************************************************************C * *C * TRANSFERS THE SOLUTION TO A COARSER MESH *C * *C ******************************************************************
……….DO N=1,4 JJ = 1 DO J=2,JL,2
JJ = JJ +1II = 1DO I=2,IL,2II = II +1
WWR(II,JJ,N) = (DW(I,J,N)*VOL(I,J) +DW(I+1,J,N)*VOL(I+1,J) . +DW(I,J+1,N)*VOL(I,J+1) +DW(I+1,J+1,N) *VOL(I+1,J+1))/ . (VOL(I,J)+VOL(I+1,J)+VOL(I,J+1)+ VOL(I+1,J+1)) END DO END DO
END DO……….
Original FORTRAN subroutine
Fine mesh
Coarse mesh
AA
AA AA
AA
1 2 IL+1IL
JL+1JL
12 A B
C DA BC D
A BC D
A BC D
Equivalent Brook Code// Fine mesh: flow is a 2D stream of shape (nx+2,ny+2)// Coarse mesh: coarse_flow is a 2D stream of shape (nx/2,ny/2)// Select interior points on the fine mesh streamDomain(flow,flow,2,2,nx+1,2,ny+1) streamDomain(vol,vol,2,2,nx+1,2,ny+1)// Generate a stream with the groups {(A,B,C,D), (A,B,C,D), (A,B,C,D), (A,B,C,D)} streamGroup(local_flow2d,flow,STREAM_GROUP_HALO,2,2,2); streamGroup(local_vol2d,vol,STREAM_GROUP_HALO,2,2,2);//Apply restriction operator to generate stream with (AA,AA,AA,AA) TransferFieldFineCoarse(local_flow2d, local_vol2d, coarse_flow)
Define access pattern:
Define computational kernel:kernel void TransferFieldFineCoarse(flow2d_s fine_flow,float2d_s vol, out Flow coarse_flow){ coarse_flow.rho=(vol[0][0]* fine_flow[0][0].rho + vol[0][1]* fine_flow[0][1].rho + vol[1][0]* fine_flow[1][0].rho + vol[1][1]* fine_flow[1][1].rho) / (vol[0][0]+ vol[0][1]+ vol[1][0]+ vol[1][1]);…….}
1 2 IL+1IL
JL+1JL
12 A B
C DA BC D
A BC D
A BC D
AA
AA AA
AA
struct Flow_struct {float rho; /* density */float u; /* momentum in x direction=density*velocity_x */float v; /* momentum in y direction=density*velocity_y */float e; /* Total energy= density*entalpy-pressure */float p; /* pressure */};
typedef stream struct Flow_struct Flow;typedef stream float floats;typedef stream Flow **flow2d_s;typedef stream floats **float2d_s;
main(int argc, char** argv){ Flow flow,local_flow,interior_flow,coarse_flow; flow2d_s local_flow2d "2,2"; ……..}
Equivalent Brook codeStreams definitions:
Preliminary performance:kernel schedule, H-CUSP dissipative flux
� MUL0� MUL1� MUL2� MUL3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM0� MC_0� JUK0� VAL0�
0
�
20
�
40
�
60
�
80
�
100
�
120
�
140
�
160
�180
� 200�
220
�
240
�
260
�
280
�
300
�
320
�
340
�
360
�
380
�
400
�
420
�
440
�
460
�
480
�
500
�
520
�
540
�
560
�
580
�
600
�
620
�
640
�
660
�
680
�
700
�
720
�
740
�
760
�
780
�
800
�
820
�
840
�
860
�
880
�
900
�
920
�
940
�
960
�
980
�
1000
�
1020
�
1040
�
1060
�
(B10)
� MUL 0� MUL1� MUL2� MUL 3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM 0� MC_0� JUK0� VAL0�
0
�
2
�
4
�
6
�
8
�
10
�
12
�
14
�
16
�
18
�
20
�
22
�
24
�
26
�
28
�
30
�
32
�
34
�
36
�
38
�
40
�
42
�
44
�
46
�
48
�
50
�
52
�
54
�
56
�
58
�
60
�
62
�
64
�
66
�
68
�
70
�
72
�
74
�
76
�
78
�
80
�
82
�
84
�
86
�
88
�
90
�
92
�
94
�
96
�
98
�
100
�
102
�
104
�
106
�
108
�
110
�
112
�
114
�
116
�
118
�
120
�
122
�
124
�
126
�
128
�
130
�
132
�
134
�
136
�
138
�
140
�
142
�
144
�
146
�
148
�
150
�
152
�
154
�
156
�
158
�
160
�
162
�
164
�
166
�
168
�
170
�
172
�
174
�
176
�
178
�
180
�
182
�
184
�
186
�
188
�
190
�
192
�
194
�
196
�
198
�
200
�
202
�
204
�
206
�
208
�
210
�
212
�
214
�
216
�
218
�
220
�
222
�
224
�
226
�
228
�
230
�
232
�
234
�
236
�
238
�
240
�
242
�
244
�
246
�
248
�
250
�
252
�
254
�
256
�
258
�
260
�
262
�
264
�
266
�
268
�
270
�
272
�
274
�
276
�
278
�
280
�
282
�
284
�
286
�
288
�
290
�
292
�
294
�
296
�
298
�
300
�
302
�
304
�
306
�
308
�
310
�
312
�
314
�
316
�
318
�
320
�
322
�
324
�
326
�
328
�
330
�
332
�
334
� 336�338
�
340
�
342
�
344
�
346
�
348
�
350
�
352
�
354
�
356
�
358
�
360
�
362
�
364
�
366
�
368
�
370
�
372
�
374
�
376
�
378
�
380
�
382
�
384
�
386
�
388
�
390
�
392
�
394
�
396
�
398
�
400
�
402
�
404
�
406
�
408
�
410
�
412
�
414
�
416
�
418
�
420
�
422
�
424
�
426
�
428
�
430
�
432
�
434
�
436
�
438
�
440
�
442
�
444
�
446
�
448
�
450
�
452
�
454
�
456
�
458
�
460
�
462
�
464
�
466
�
468
�
470
�
472
�
474
�
476
�
478
�
480
�
482
�
484
�
486
�
488
�
490
�
492
�
494
�
496
�
498
�
500
�
502
�
504
�
506
�
508
�
510
�
512
�
514
�
516
�
518
�
520
�
522
�
524
�
526
�
528
�
530
�
532
�
534
�
536
�
538
�
540
�
542
�
544
�
546
�
548
�
550
�
552
�
554
�
556
�
558
�
560
�
562
�
564
�
566
�
568
�
570
�
572
�
574
�
576
�
578
�
580
�
582
�
584
�
586
�
588
�
590
�
592
�
594
�
596
�
598
�
600
�
602
�
604
�
606
�
608
�
610
�
612
�
614
�
616
�
618
�
620
�
622
�
624
�
626
�
628
�
630
�
632
�
634
�
636
�
638
�
640
�
642
�
644
�
646
�
648
�
650
�
652
�
654
�
656
�
658
�
660
�
662
�
664
�
666
�
668
�
670
�
672
�
674
�
676
�
678
�
680
�
682
�
684
�
686
�
688
�
690
�
692
�
694
�
696
�
698
�
700
�
702
�
704
�
706
�
708
�
710
�
712
�
714
�
716
�
718
�
720
�
722
�
724
�
726
�
728
�
730
�
732
�
734
�
736
�
738
�
740
�
742
�
744
�
746
�
748
�
750
�
752
�
754
�
756
�
758
�
760
�
762
�
764
�
766
�
768
�
770
�
772
�
774
�
776
�
778
�
780
�
782
�
784
�
786
�
788
�
790
�
792
�
794
�
796
�
798
�
800
�
802
�
804
�
806
�
808
�
810
�
812
�
814
�
816
�
818
�
820
�
822
�
824
�
826
�
828
�
830
�
832
�
834
�
836
�
838
�
840
�
842
�
844
�
846
�
848
�
850
�
852
�
854
�
856
�
858
�
860
�
862
�
864
�
866
�
868
�
870
�
872
�
874
�
876
�
878
�
880
�
882
�
884
�
886
�
888
�
890
�
892
�
894
�
896
�
898
�
900
�
902
�
904
�
906
�
908
�
910
�
912
�
914
�
916
�
918
�
920
�
922
�
924
�
926
�
928
�
930
�
932
�
934
�
936
�
938
�
940
�
942
�
944
�
946
�
948
�
950
�
952
�
954
�
956
�
958
�
960
�
962
�
964
�
966
�
968
�
970
�
972
�
974
�
976
�
978
�
980
�
982
�
984
�
986
�
988
�
990
�
992
�
994
�
996
�
998
�
1000
�
1002
�
1004
�
1006
�
1008
�
1010
�
1012
�
1014
�
1016
�
1018
�
1020
�
1022
�
1024
�
1026
�
1028
�
1030
�
1032
�
1034
�
1036
�
1038
�
1040
�
1042
�
1044
�
1046
�
1048
�
1050
�
1052
�
1054
�
1056
�
1058
�
1060
�
(B10)
�
S P R E A D
�
C O M M U C D A T A
�
C O M M U C D A T A
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D
�
S P R E A D
�
S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
C O M M U C D A T A
�
S P R E A D
�
C O M M U C D A T A
�
N S E L E C T
�
I E Q3 2
�
P A S S
�
S P R E A D
�
N S E L E C T
�
I E Q3 2
�
S E L E C T
�
S E L E C T
�
I E Q 3 2
�
N S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
N S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
N S E L E C T
�
I E Q3 2
�
I S U B3 2
�
I S U B3 2
�
S P R E A D
�
N S E L E C T
�
P A S S
�
S E L E C T
�
I E Q 3 2
�
S P R E A D _ W T
�
S P W R I T E
�
I L T3 2
�
I A D D3 2
�
I S U B3 2
�
S P R E A D
�
N S E L E C T
�
C O M M U C D A T A
�
I S U B3 2
�
S P R E A D _ W T
�
S P W R I T E
�
N S E L E C T
�
I S U B3 2
�
I L T3 2
�
I A D D 3 2
�
C O M M U C D A T A
�
S P R E A D
�
N S E L E C T
�
I A D D3 2
�
S P R E A D _ W T
�
S P W R I T E
�
I S U B3 2
�
I S U B 3 2
�
S P R E A D
�
S E L E C T
�
I L T3 2
�
I E Q 3 2
�
S P R E A D
�
I S U B3 2
�
I S U B 3 2
�
I L T 3 2
�
I A D D3 2
�
N S E L E C T
�
S E L E C T
�
S P R E A D
�
S E L E C T
�
S P R E A D
�
I S U B 3 2
�
I L T3 2
�
I A D D3 2
�
N S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D
�
I L T3 2
�
I A D D3 2
�
S E L E C T
�
S P R E A D
�
S P R E A D
�
S E L E C T
�
S P R E A D
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
C O M M U C D A T A
�
N S E L E C T
�
I E Q3 2
�
P A S S
�
S P R E A D
�
N S E L E C T
�
I E Q 3 2
�
S E L E C T
�
S E L E C T
�
I E Q3 2
�
N S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
N S E L E C T
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
N S E L E C T
�
I E Q3 2
�
S P R E A D
�
N S E L E C T
�
I S U B3 2
�
S E L E C T
�
I E Q 3 2
�
S P R E A D _ W T
�
S P W R I T E
�
I M U L 3 2
�
N S E L E C T
�
S P R E A D
�
I L T 3 2
�
I A D D3 2
�
S P R E A D _ W T
�
S P W R I T E
�
I S U B3 2
�
S P R E A D
�
S P R E A D
�
S E L E C T
�
I M U L 3 2
�
I L T 3 2
�
I A D D3 2
�
S P R E A D
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
N S E L E C T
�
P A S S
�
S E L E C T
�
I E Q 3 2
�
S P R E A D
�
N S E L E C T
�
S P R E A D
�
S P R E A D _ W T
�
S P W R I T E
�
F I N V S Q R T _ L O O K U P
�
S P R E A D
�
F M U L
�
S P R E A D
�
S P R E A D
�
C O N D _ I N _ R
�
S P R E A D
�
S P R E A D
�
F M U L
�
S P R E A D
�
S P C R E A D _ W T
�
S P C W R I T E
�
S P R E A D
�
C O M M U C D A T A
�
S P R E A D
�
F S U B
�
S P R E A D
�
C O M M U C D A T A
�
N S E L E C T
�
C O M M U C D A T A
�
S P R E A D
�
S P R E A D _ W T
�
S P W R I T E
�
C O M M U C D A T A
�
S P R E A D
�
F M U L
�
C O N D _ I N _ D
�
S P R E A D
�
S E L E C T
�
I E Q3 2
�
S P R E A D
�
N S E L E C T
�
S E L E C T
�
I E Q3 2
�
S P R E A D _ W T
�
S P W R I T E
�
S P C R E A D _ W T
�
S P C W R I T E
�
N S E L E C T
�
C O M M U C D A T A
�
F M U L
�
S P R E A D _ W T
�
S P W R I T E
�
C O M M U C D A T A
�
C O M M U C D A T A
�
S P R E A D
�
S P R E A D
�
S E L E C T
�
S P R E A D
�
F M U L
�
S P R E A D _ W T
�
S P W R I T E
�
S P R E A D
�
N S E L E C T
�
I E Q 3 2
�
P A S S
�
S P R E A D
�
N S E L E C T
�
I E Q3 2
�
S E L E C T
� MUL 0� MUL1� MUL2� MUL 3� DIV0� INO0� INO1� INO2� INO3� INO4� INO5� INO6� INO7� SP_0� SP_0� COM 0� MC_0� JUK0� VAL0�
0
�
2
�
4
�
6
�
8
�
10
�
12
�
14
�
16
�
18
�
20
�
22
�
24
�
26
�
28
�
30
�
32
�
34
�
36
�
38
�
40
�
42
�
44
�
46
�
48
�
50
�
52
�
54
�
56
�
58
�
60
�
62
�
64
�
66
�
68
�
70
�
72
�
74
�
76
�
78
�
80
�
82
�
84
�
86
�
88
�
90
�
92
�
94
�
96
�
98
�
100
�
102
�
104
�
106
�
108
�
110
�
112
�
114
�
116
�
118
�
120
�
122
�
124
�
126
�
128
�
130
�
132
�
134
�
136
�
138
�
140
�
142
�
144
�
146
�
148
�
150
�
152
�
154
�
156
�
158
�
160
�
162
�
164
�
166
�
168
�
170
�
172
�
174
�
176
�
178
�
180
�
182
�
184
�
186
�
188
�
190
�
192
�
194
�
196
�
198
�
200
�
202
�
204
�
206
�
208
�
210
�
212
�
214
�
216
�
218
�
220
�
222
�
224
�
226
�
228
�
230
�
232
�
234
�
236
�
238
�
240
�
242
�
244
�
246
�
248
�
250
�
252
�
254
�
256
�
258
�
260
�
262
�
264
�
266
�
268
�
270
�
272
�
274
�
276
�
278
�
280
�
282
�
284
�
286
�
288
�
290
�
292
�
294
�
296
�
298
�
300
�
302
�
304
�
306
�
308
�
310
�
312
�
314
�
316
�
318
�
320
�
322
�
324
�
326
�
328
�
330
�
332
�
334
�
336
�
338
�
340
�
342
�
344
�
346
�
348
�
350
�
352
�
354
�
356
�
358
�
360
�
362
�
364
�
366
�
368
�
370
�
372
�
374
�
376
�
378
�
380
�
382
�
384
�
386
�
388
�
390
�
392
�
394
�
396
�
398
�
400
�
402
�
404
�
406
�
408
�
410
�
412
�
414
�
416
�
418
�
420
�
422
�
424
�
426
�
428
�
430
�
432
�
434
�
436
�
438
�
440
�
442
�
444
�
446
�
448
�
450
�
452
�
454
�
456
�
458
�
460
�
462
�
464
�
466
�
468
�
470
�
472
�
474
�
476
�
478
�
480
�
482
�
484
�
486
�
488
�
490
�
492
�
494
�
496
�
498
�
500
�
502
�
504
�
506
�
508
�
510
�
512
�
514
�
516
�
518
�
520
�
522
�
524
�
526
�
528
�
530
�
532
�
534
�
536
�
538
�
540
�
542
�
544
�
546
�
548
�
550
�
552
�
554
�
556
�
558
�
560
�
562
�
564
�
566
�
568
�
570
�
572
�
574
�
576
�
578
�
580
�
582
�
584
�
586
�
588
�
590
�
592
�
594
�
596
�
598
�
600
�
602
�
604
�
606
�
608
�
610
�
612
�
614
�
616
�
618
�
620
�
622
�
624
�
626
�
628
�
630
�
632
�
634
�
636
�
638
�
640
�
642
�644
� 646�
648
�
650
�
652
�
654
�
656
�
658
�
660
�
662
�
664
�
666
�
668
�
670
�
672
�
674
�
676
�
678
�
680
�
682
�
684
�
686
�
688
�
690
�
692
�
694
�
696
�
698
�
700
�
702
�
704
�
706
�
708
�
710
�
712
�
714
�
716
�
718
�
720
�
722
�
724
�
726
�
728
�
730
�
732
�
734
�
736
�
738
�
740
�
742
�
744
�
746
�
748
�
750
�
752
�
754
�
756
�
758
�
760
�
762
�
764
�
766
�
768
�
770
�
772
�
774
�
776
�
778
�
780
�
782
�
784
�
786
�
788
�
790
�
792
�
794
�
796
�
798
�
800
�
802
�
804
�
806
�
808
�
810
�
812
�
814
�
816
�
818
�
820
�
822
�
824
�
826
�
828
�
830
�
832
�
834
�
836
�
838
�
840
�
842
�
844
�
846
�
848
�
850
�
852
�
854
�
856
�
858
�
860
�
862
�
864
�
866
�
868
�
870
�
872
�
874
�
876
�
878
�
880
�
882
�
884
�
886
�
888
�
890
�
892
�
894
�
896
�
898
�
900
�
902
�
904
�
906
�
908
�
910
�
912
�
914
�
916
�
918
�
920
�
922
�
924
�
926
�
928
�
930
�
932
�
934
�
936
�
938
�
940
�
942
�
944
�
946
�
948
�
950
�
952
�
954
�
956
�
958
�
960
�
962
�
964
�
966
�
968
�
970
�
972
�
974
�
976
�
978
�
980
�
982
�
984
�
986
�
988
�
990
�
992
�
994
�
996
�
998
�
1000
�
1002
�
1004
�
1006
�
1008
�
1010
�
1012
�
1014
�
1016
�
1018
�
1020
�
1022
�
1024
�
1026
�
1028
�
1030
�
1032
�
1034
�
1036
�
1038
�
1040
�
1042
�
1044
�
1046
�
1048
�
1050
�
1052
�
1054
�
1056
�
1058
�
1060
�
(B10)
�
F S Q R T
� F S U B� F A D D�
F S U B
�
F M U L
�
F M U L
�
F S U B
�
F A D D
�
F M U L
�
S P R E A D
�
F S Q R T
�
F M U L
�
F S U B
�
F M U L
�
F S U B
�
S P R E A D
�
P A S S
�
F S U B
�
F A D D
�
F A D D
�
F A D D
�
P A S S
�
P A S S
�
S P R E A D
�
F M U L
�
F A D D
�
F S U B
�
F M U L
�
S P R E A D
�
F M U L
�
P A S S
�
P A S S
�
F A D D
�
F A D D
�
F A D D
�
S P R E A D
�
F M U L
�
F S U B
�
F M U L
�
F A D D
�
P A S S
�
S P R E A D
�
P A S S
�
F A D D
�
F S U B
�
F A D D
�
F A D D
�
S P R E A D
�
N S E L E C T
�
F M U L
�
F S U B
�
F S U B
�
S P R E A D
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F A D D
�
F S U B
�
S P R E A D
�
P A S S
�
F S U B
�
F A D D
�
F S U B
�
F A D D
�
S P R E A D
�
F S U B
�
F S U B
�
F A B S
�
F A D D
�
S P R E A D
�
F M U L
�
F A D D
�
F M U L
�
F A D D
�
S P R E A D
�
P A S S
�
F M U L
�
F M U L
�
P A S S
�
F A D D
�
S P R E A D
�
P A S S
�
F M U L
�
F A B S
�
F M U L
�
F S U B
�
S P R E A D
�
F S Q R T
�
F S U B
�
F S U B
�
F L T
�
F S U B
�
S P R E A D
�
F S U B
�
F A D D
�
F I N V S Q R T _ L O O K U P
�
F L T
�
F M U L
�
F M U L
�
F M U L
�
F A D D
�
F S U B
�
S P R E A D
�
F S U B
�
F M U L
�
F M U L
�
F S U B
�
P A S S
�
P A S S
�
S P R E A D
�
F A D D
�
F A D D
�
F A D D
�
S P R E A D
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F S U B
�
F S U B
�
S P R E A D
�
F M U L
�
F M U L
�
F S U B
�
F L T
�
P A S S
�
S P R E A D
�
P A S S
�
F M U L
�
F A D D
�
F M U L
�
P A S S
�
F S U B
�
S P R E A D
�
P A S S
�
F I N V S Q R T _ L O O K U P
�
F A B S
�
F A D D
�
F M U L
�
F A D D
�
S P R E A D
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
P A S S
�
F M U L
�
S P R E A D
�
F M U L
�
F M U L
�
F S U B
�
S E L E C T
�
F S U B
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F S U B
�
F A D D
�
P A S S
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F A B S
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
P A S S
�
F M U L
�
P A S S
�
F S U B
�
F S U B
�
F M U L
�
F A B S
�
F M U L
�
F A B S
�
F I N V S Q R T _ L O O K U P
�
F S U B
�
F L T
�
F M U L
�
F A D D
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F S U B
�
F M U L
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F M U L
�
F L T
�
F S U B
�
F A D D
�
F S U B
�
F M U L
�
F S U B
�
P A S S
�
P A S S
�
S P R E A D
�
F A D D
�
F S U B
�
F M U L
�
F S U B
�
S P R E A D
�
F M U L
�
F M U L
�
F S U B
�
F S U B
�
F M U L
�
F M U L
�
F M U L
�
F A D D
�
N S E L E C T
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F S U B
�
F M U L
�
P A S S
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P C R E A D _ W T
�
S P C W R I T E
�
F S U B
�
F S U B
�
F M U L
�
F M U L
�
F S U B
�
F S U B
�
F M U L
�
P A S S
�
C O M M U C D A T A
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
F M U L
�
F L T
�
F M U L
�
F M U L
�
P A S S
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S E L E C T
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
P A S S
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
N S E L E C T
�
F M U L
�
F M U L
�
F S U B
�
F S U B
�
F I N V S Q R T _ L O O K U P
�
F M U L
�
F M U L
�
F A D D
�
F A D D
�
P A S S
�
F M U L
�
F M U L
�
P A S S
�
F S U B
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F S U B
�
S E L E C T
�
F M U L
�
F M U L
�
F S U B
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
P A S S
�
F M U L
�
F M U L
�
F S U B
�
F M U L
�
P A S S
�
N S E L E C T
�
F S U B
�
F S U B
�
F S U B
�
F M U L
�
F I N V S Q R T _ L O O K U P
�
S E L E C T
�
S P R E A D
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S P R E A D _ W T
�
S P W R I T E
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S P R E A D
�
F M U L
�
F M U L
�
N S E L E C T
�
F S U B
�
F S U B
�
P A S S
�
S P R E A D
�
F A D D
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
F S U B
�
F S U B
�
F S U B
�
F S U B
�
P A S S
�
P A S S
�
S P R E A D
�
F S U B
�
F S U B
�
S E L E C T
�
F M U L
�
F M U L
�
S P R E A D
�
F S Q R T
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
C O M M U C D A T A
�
F S U B
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
P A S S
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
P A S S
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S P R E A D
�
F M U L
�
F M U L
�
F S U B
�
F A D D
�
P A S S
�
S P R E A D
�
P A S S
�
F M U L
�
F S U B
�
F S U B
�
F M U L
�
P A S S
�
N S E L E C T
�
S P R E A D
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S P R E A D
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
P A S S
�
F M U L
�
F S U B
�
F A D D
�
F M U L
�
P A S S
�
S P R E A D
�
F M U L
�
F A D D
�
F M U L
�
F M U L
�
S P R E A D
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S P R E A D
�
S E L E C T
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
S P R E A D
�
F S U B
�
F M U L
�
F M U L
�
F S U B
�
S P R E A D
�
S E L E C T
�
F S Q R T
�
F M U L
�
F M U L
�
F M U L
�
F S U B
�
P A S S
�
S P R E A D _ W T
�
S P W R I T E
�
F M U L
�
F M U L
�
F M U L
�
F M U L
�
P A S S
�
S E L E C T
Set up of the 5x5 stencil: 20% utilization
Computation of the flux: 90% utilization
This kernel currently runs at 37 GFlops on the Merrimac simulator1:
58% of the peak performances
Improving the performance of the stencil setup will bring the performance up to 50 GFlops
1The simulator does not yet support MADD instructions: the peak performance is 64 GFlops
Working on 3D version of StreamFLO
Moving from Euler to Navier-Stokes
Evaluate different strategies for flux computations
Ongoing work
Brooktran
Why do we need Fortran support?Most scientific and high performance codes are in Fortran
National labs, NASA, aerospace companies have a huge investment in Fortran codes
The codes have been thoroughly tested and validated
They can be HUGE
Even if rewriting a code in a different language is not be a big deal, the validation process is
A code not fully validated can be acceptable in academia but not for real missions.
Fortran
The path from C to Brook is much easier than the one from Fortran to Brook:
C to Brook: similar to OpenMP parallelization in extent of changes
Start from original code and, one by one, “streamify” functions.
You can start working on the time consuming part of the code
Very easy to check the results since all the I/O & utility functions are working
Fortran to Brook: more extensive changes than even MPI parallelization
The code has to be rewritten from scratch
A lot of time must be spent rewriting I/O and utility functions
Checking the results and debugging is very time consuming
Porting codes to Merrimac
Mixed language programming: Fortran+BrookUse the original structure of the Fortran code.
The subroutines that are floating-point intensive are replaced by Brook kernels
Streams are a view of memory, so we just need to pass the proper memory information to Brook
It requires some “glue” code and a standard Fortran compiler
Possible paths from Fortran to Brook (1)
Example of mixed language programming// FORTRAN main
program samplereal, allocatable, dimension(:):: a,b,cinteger:: nn=1000allocate(a(n),b(n),c(n))call brook_sum(a,b,c,n)end program sample
// Brook functionvoid brook_sum(a,b,c,n)float a[],b[],c[];int *n;{ floats stream_a, stream_b, stream_c; streamLoad(stream_a,a,n); streamLoad(stream_b,b,n); add_array(stream_a,stream_b, stream_c);streamStore(stream_c,c,n);}
// Brook kernelkernel add_array( floats a, floats b, out floats c){ c=a+b; }
Brooktran: streaming language that uses Fortran syntax
The setup of streams is done through library calls
The kernels are written using a Fortran syntax and have the same constraints as Brook kernels
Possible paths from Fortran to Brook (2)
// FORTRAN mainprogram samplereal, allocatable, dimension(:):: a,b,cstream, real, dimension(:) :: stream_a, stream_b, stream_cinteger:: nn=1000allocate(a(n),b(n),c(n))call streamLoad(stream_a, a, n)call streamLoad(stream_b, b, n)call add_array(stream_a, stream_b, stream_c)call StreamStore(stream_c,c,n)end program sample
// Brooktran kernelkernel subroutine add_array(a, b, c)stream, real, intent(in):: a, bstream, real, intent(out):: cc=a+bend subroutine add_array
Example of Brooktran
A Fortran syntax of Brook will help porting legacy codes to Merrimac
Open64 already has a Fortran95 front-end
Fortran 9x array syntax makes stream code very compact
Brooktran
Brooktran syntax
We need to add the following keywords to Fortran:
stream: used to define a stream; it is a native compound object much like an array
kernel: used to specify a function or subroutine that can be executed by the streaming processor unit
reduce: used for reduction arguments in kernels
New keywords
Kernel functions or subroutines are declared by placing the “kernel” keyword before the function or subroutine name
Arguments of the call have the same restrictions as in Brook.
All arguments need to have explicit “intents”
Kernel
kernel subroutine streamsum( a ,b, c, sum)stream, real, intent (in):: a,bstream, real, intent (out):: creal, intent(reduce):: sum
c=a+bsum=sum+c
end subroutine streamsum
Streams are a native compound object like arrays
The shape is defined by the dimension given in the declaration
Stream
stream, type(real), dimension(:,:):: aFor streams that are generated from stencil of group operators, we can specify the “shape” of each element
stream, type(real), dimension(:,:):: b(3,3)real, dimension(:,:) :: meshstreamSource(a,array,2,100,100)streamGroup(b,a,STREAM_STENCIL_HALO,2,-1,1,-1,1)
Exampleprogram compute_meshtype gridcell
real:: xreal:: y
end type gridcelltype(gridcell), dimension(:,:),allocatable :: meshreal, dimension(:,:),allocatable:: volstream, type(gridcell), dimension(:,:):: a, b(2,2)stream, type(real), dimension(:,:):: cnx=3; ny=2allocate(mesh(nx+1,ny+1),vol(nx,0:ny+1))...........call streamSource(a,mesh,2,nx+1,ny+1)call streamStencil(b,a,STREAM_STENCIL_HALO,2,0,1,0,1)call ComputeMetric(b,c,volmin,volmax)call streamSink(c,vol(1,1),nx,ny)
kernel subroutine ComputeMetric(grid,volume,volmin,volmax) stream, intent(in), type(gridcell):: grid(2,2) stream, intent(out),type(real)::volumereal, intent(reduce):: volmin,volmax volume=.5*((grid(2,2).x-grid(1,1).x)*(grid(1,2).y-grid(2,1).y) & & -(grid(2,2).y-grid(1,1).y)*(grid(1,2).x-grid(2,1).x));volmin=min(volmin,volume)vomax=max(volmax,volume)end subroutine ComputeMetric
mesh
vol
Stream load/store, domain, etc is done with function calls:
call streamSource(Y,X,100*200)
or
Y=streamSource(X,100*200)
In Brooktran, streams have an associated shape. We should modify the load operator
call streamSource(Y,X,2,100,200)
We can use the Fortran 9x array syntax:call streamSource(Y,X(51:100,101:200),50*100)
Stream manipulation
Conclusions
Conclusions
Cost/Performance: 100:1 compared to clusters.Programmable: applicable to large class of scientific applications.Porting and developing new code made easier: stream language, support of legacy codes.
Arithmetic intensity is sufficient. Bandwidth is not going to be the limiting factor in these applications. Computation can be naturally organized in a streaming fashion.The interaction between the application developers and the language development group has helped insured that Brook can be used to code real scientific applications.Architecture has been refined in the process of evaluating these applications.Implementation is much easier than MPI. Brook hides most of the parallelization complexity from the user. The code is very clean and easy to understand. The streaming versions of these applications are in the range of 1000-5000 lines of code.
Plan
FY02 Demonstrate simple 2D applications on single-node stream processor simulator
FY03 Demonstrate more complex 3D applications on multi-node stream processor simulator
FY04 Detailed microarchitecture, refine programming tools
FY05 Detailed design of streaming supercomputer
FY06 Construct prototype streaming supercomputers
Current effort is technology development– Demonstrate feasibility of stream
technology– Reduce risk, solve key technical issuesNext step is full scale development
For additional information
http://merrimac.stanford.edu
http://cits.stanford.edu
• (Chapter 5, technical report )