Stanford Streaming Supercomputer Eric Darve Mechanical Engineering Department Stanford University Stream R egisterFile C lust. 0 C lust. 15 M icro- ctrl Scalar Proc. Scalar C ache M em ory System Netw ork DRAM DRAM Inter-cluster Crossbar Stream C trl
Stanford Streaming Supercomputer. Eric Darve Mechanical Engineering Department Stanford University. Overview of Streaming Project. Main PIs: Pat Hanrahan, [email protected] Bill Dally, [email protected] Objectives: Cost/Performance: 100:1 compared to clusters. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Stanford Streaming Supercomputer
Eric Darve
Mechanical Engineering Department
Stanford University
Stream Register File
Clust.0
Clust.15
Micro-ctrl
ScalarProc.
ScalarCache
Memory SystemNetwork
DRAM DRAM
Inter-cluster Crossbar
StreamCtrl
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 2/33
Sandia National Laboratories and Cray Inc. finalize $90 million contract for new supercomputer
Collaboration on Red Storm System under Department of Energy’s Advanced Simulation and Computing Program (ASCI)
ALBUQUERQUE, N.M. and SEATTLE, Wash. — The Department of Energy’s Sandia National Laboratories and Cray Inc. (Nasdaq NM: CRAY) today announced that they have finalized a multiyear contract, valued at approximately $90 million, under which Cray will collaborate with Sandia to develop and deliver a new massively parallel processing (MPP) supercomputer called Red Storm. InJune 2002, Sandia reported that Cray had been selected for the award, subject to successful contract negotiations.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 5/33
• Numbers are sketchy today, but even if we are off by 2x, improvement over status quo is large
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 6/33
ES
RedStorm
Desktop SSS
SSSASCI machines
GFLOPS
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 7/33
How did we achieve that?
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 8/33
VLSI Makes Computation Plentiful
VLSI: very large-scale integration. This is the current level of computer microchip miniaturization (refers to microchips containing in the hundreds of thousands of transistors.)
• Abundant, inexpensive arithmetic
– Can put 100s of 64-bit ALUs on a chip
– 20pJ per FP operation
• (Relatively) high off-chip bandwidth
– 1Tb/s demonstrated, 2nJ per word off chip
• Memory is inexpensive $100/Gbyte
nVidia GeForce4~120 Gflops/sec~1.2 Tops/sec
Velio VC30031Tb/s I/O BW
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 9/33
But VLSI imposes some constraintsCurrent Architecture: few ALUs / chip = expensive and limited performance.
Objective for SSS architecture: • Keep hundreds of ALUs/chip
busy.
Difficulty:• Locality of data: we need to
match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth.
• Preliminary schedule obtained using the Imagine architecture:– High arithmetic
intensity: all ALUs are kept busy. Gflops expected to be very high.
– SRF bandwidth is sufficient. About 1 word for 30 instructions.
• Results helped guide architectural decisions for SSS.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 23/33
Observations• Arithmetic intensity is sufficient. Bandwidth is not going to be the
limiting factor in these applications. Computation can be naturally organized in a streaming fashion.
• The interaction between the application developers and the language development group has helped insured that Brook can be used to code real scientific applications.
• Architecture has been refined in the process of evaluating these applications.
• Implementation is much easier than MPI. Brook hides all the parallelization complexity from the user. The code is very clean and easy to understand. The streaming versions of these applications are in the range of 1000-5000 lines of code.
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 24/33
A GPU is a stream processor
• The GPU on a Graphics Card is streaming processor.• n VIDIA recently announced that their latest graphics
card, the NV30, will be programmable and capable of delivering 51 Gflops peak performance (1.6 Gflops for Pentium 4).
Can we use this computing power for scientific application?
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 25/33
Cg: Assembly or High-level?
Assembly…
DP3 R0, c[11].xyzx, c[11].xyzx;
RSQ R0, R0.x;
MUL R0, R0.x, c[11].xyzx;
MOV R1, c[3];
MUL R1, R1.x, c[0].xyzx;
DP3 R2, R1.xyzx, R1.xyzx;
RSQ R2, R2.x;
MUL R1, R2.x, R1.xyzx;
ADD R2, R0.xyzx, R1.xyzx;
DP3 R3, R2.xyzx, R2.xyzx;
RSQ R3, R3.x;
MUL R2, R3.x, R2.xyzx;
DP3 R2, R1.xyzx, R2.xyzx;
MAX R2, c[3].z, R2.x;
MOV R2.z, c[3].y;
MOV R2.w, c[3].y;
LIT R2, R2;
...
Assembly…
DP3 R0, c[11].xyzx, c[11].xyzx;
RSQ R0, R0.x;
MUL R0, R0.x, c[11].xyzx;
MOV R1, c[3];
MUL R1, R1.x, c[0].xyzx;
DP3 R2, R1.xyzx, R1.xyzx;
RSQ R2, R2.x;
MUL R1, R2.x, R1.xyzx;
ADD R2, R0.xyzx, R1.xyzx;
DP3 R3, R2.xyzx, R2.xyzx;
RSQ R3, R3.x;
MUL R2, R3.x, R2.xyzx;
DP3 R2, R1.xyzx, R2.xyzx;
MAX R2, c[3].z, R2.x;
MOV R2.z, c[3].y;
MOV R2.w, c[3].y;
LIT R2, R2;
...
Cg
COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);
Cg
COLOR cPlastic = Ca + Cd * dot(Nf, L) + Cs * pow(max(0, dot(Nf, H)), phongExp);
or PhongShader
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 26/33
Cg uses separate vertex and fragment programs
ApplicationVertexProcessor
FragmentProcessor
Assem
bly &R
asterization
Fram
ebufferO
perations
Fram
ebuffer
Program Program
Textures
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 27/33
Characteristics of NV30 & Cg • Characteristics of GPU:
– optimized for 4-vector arithmetic– Cg has vector data types and operations
e.g. float2, float3, float4– Cg also has matrix data types
e.g. float3x3, float3x4, float4x4• Some Math:
– Sin/cos/etc.– Normalize
• Dot product: dot(v1,v2);• Matrix multiply:
– matrix-vector: mul(M, v); // returns a vector– vector-matrix: mul(v, M); // returns a vector– matrix-matrix: mul(M, N); // returns a matrix
Innermost loop in C: computation of LJ and Coulomb interactions.
for (k=nj0;k<nj1;k++) { //loop over indices in neighborlistjnr = jjnr[k]; //get index of next j atom (array LOAD) j3 = 3*jnr; //calc j atom index in coord & force arraysjx = pos[j3]; //load x,y,z coordinates for j atomjy = pos[j3+1];jz = pos[j3+2];qq = iq*charge[jnr]; //load j charge and calc. productdx = ix – jx; //calc vector distance i-jdy = iy – jy;dz = iz – jz;rsq = dx*dx+dy*dy+dz*dz; //calc square distance i-jrinv = 1.0/sqrt(rsq); //1/rrinvsq = rinv*rinv; //1/(r*r)vcoul = qq*rinv; //potential from this interactionfscal = vcoul*rinvsq; //scalarforce/|dr|vctot += vcoul; //add to temporary potential variablefix += dx*fscal; //add to i atom temporary force variablefiy += dy*fscal; //F=dr*scalarforce/|dr|fiz += dz*fscal;force[j3] -= dx*fscal; //subtract from j atom forcesforce[j3+1]-= dy*fscal;force[j3+2]-= dz*fscal;
}
Example: MD
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 29/33
Inner loop in Cg/* Find the index and coordinates of j atom */jnr = f4tex1D (jjnr, k);
/* Get the atom position */j1 = f3tex1D(pos, jnr.x);j2 = f3tex1D(pos, jnr.y);j3 = f3tex1D(pos, jnr.z);j4 = f3tex1D(pos, jnr.w);
We compute four interactions at a time so that we can take advantage of high performance of vector arithmetic.
We are fetching coordinates of atom: data is stored as texture
/* Get the vectorial distance, and r^2 */d1 = i - j1;d2 = i - j2;d3 = i - j3;d4 = i - j4;
Contains x, y and z coordinates of force and total energy.
Computing total force due to 4 interactions
Computing total potential energy for this particle
12/10/2002 Eric Darve - Stanford Streaming Supercomputer 33/33
Conclusion• 3 representative applications show high bandwidth ratios:
streamMD, streamFlo, StreamFEM.
• Feasibility of streaming established for scientific applications: high arithmetic intensity, bandwidth hierarchy is sufficient.
• Available today: NVidia NV30 graphics card.
• Future work:– StreamMD to GROMACS (Folding @ Home)– StreamFEM and StreamFLO to 3D– Multinode versions of all applications– Sparse solvers for implicit time-stepping– Adaptive meshing– Numerics