Capability Computing Challenges and Payoffs

F1-ATPase 2.5bell Lemieux

Seconds/StepF1-ATPase (PME)327,506 atoms

*

Some scaling successes at PSCNAMD now scales to 3000 processors, > 1Tf sustainedEarthquake simulation code, 2048 processors, 87% parallel efficiency. Real-time tele-immersion code scales to 1536 processorsIncreased scaling of the Car-Parrinello Ab-Initio Molecular Dynamics (CPAIMD) code from its previous limit of 128 processors (for 128 states) to 1536 processors.

*

Payoffs- Insight into important real-life problemsInsightsStructure to function of biomoleculesIncreased realism to confront experimental dataEarthquakes and design of buildingsQCDNovel uses of HPCTeleimmersionInternet simulation

*

How Aquaporins Work (Schulten group, University of Illinois)Aquaporins are proteins which conduct large volumes of water through cell walls while filtering out charged particles like hydrogen ions. Start with known crystal structure, simulate 12 nanoseconds of molecular dynamics of over 100,000 atoms, using NAMD

*

Aquaporin mechanismWater moves through aquaporin channels in single file. Oxygen leads the way in. At the most constricted point of channel, water molecule flips. Protons cant do this.

Animation pointed to by 2003 Nobel chemistry prize announcement

*

High Resolution Forward and Inverse Earthquake Modeling on Terascale ComputersVolkan Akcelik, Jacobo Bielak, Ioannis Epanomeritakis Antonio Fernandez, Omar Ghattas, Eui Joong KimJulio Lopez, David O'Hallaron, Tiankai TuCarnegie Mellon University

George BirosUniversity of Pennsylvania

John Urbanic Pittsburgh Supercomputing Center

*

Complexity of earthquake ground motion simulationMultiple spatial scaleswavelengths vary from O(10m) to O(1000m)Basin/source dimensions are O(100km) Multiple temporal scales O(0.01s) to resolve highest frequencies of source O(10s) to resolve of shaking within the basinSo need unstructured grids even though good parallel performance harder to achieveHighly irregular basin geometryHighly heterogeneous soils material properties Geology and source parameters observable only indirectly

*

Performance of forward earthquake modeling code on PSC Terascale systemLargest simulation28 Oct 2001 Compton aftershock in Greater LA Basin maximum resolved frequency: 1.85Hz100m/s min shear wave velocityphysical size: 100x100x37.5 km3 # of elements: 899,591,066# of grid points: 1,023,371,641 # of slaves: 125,726,86225 sec wallclock/time step on 1024 PEs65 Gb inputlemieux at PSC

*

Role of PSCAssistance inOptimizationEfficient IO of terabyte size datasetsExpediting schedulingVisualization

*

Inverse problem: Use records of past seismic events to improve velocity modelS. CA significant earthquakes since 1812Seismometer locations and intensity mapfor Northridge earthquake

*

Inverse problem

*

Major recognition

This entire effort won Gordon Bell prize for special achievement, 2003, the premier prize for outstanding computations in HPC. Given to the entry that utilizes innovative techniques to demonstrate the most dramatic gain in sustained performance for an important class of real-world application.

*

QCDIncreased realism to confront experimental dataQCD compelling evidence for the need to include quark virtual degrees of freedomImprovements due to continued algorithmic development, access to major platforms and sustained effort over decades

*

Tele-immersion (real time)can process 6 frames/sec (640 x 480) from 10 camera triplets using 1800 processors.Henry Fuchs, U. of North Carolina

*

Simulating Network traffic(almost real time)George Riley et al (Georgia Tech)Simulating networks with > 5M elements. modeled 106M packet transmissions in one second of wall clock time, using 1500 processorsNear real time web traffic simulationEmpirical HTTP Traffic model [Mah, Infocom 97]1.1M nodes, 1.0M web browsers, 20.5M TCP Connections541 seconds of wall clock time on 512 processors to simulate 300 seconds of network operationFastest detailed computer simulations of computer networks ever constructed

*

Where are grids in all this?Grids aimed primarily at:Availability- computing on demandReduce influence effect of geographic distanceMake services more transparentMotivated by remote data, on-line instruments, sensors, as well as computersThey also contribute to the highest end by aggregating resources. The emerging vision is to use cyberinfrastructure to build more ubiquitous, comprehensive digital environments that become interactive and functionally complete for research communities in terms of people, data, information, tools, and instruments and that operate at unprecedented levels of computational, storage, and data transfer capacity. NSF Blue Ribbon Panel on Cyberinfrastructure

*

DTF (2001)IA 64 clusters at 4 sites10 Gb/s point to point linksCan deliver 30 Gb/s between 2 sites

Physical Topology(Full Mesh)LAChicagoSDSCNCSACaltechANL

*

Extensible Terascale Facility (2002)Make network scalable, so introduce hubsAllow heterogeneous architecture, and retain interoperabilityFirst step is integration of PSCs TCS machineMany more computer science interoperability issues

3 new sites approved in 2003(Texas, Oak Ridge, Indiana)

*

Examples of Science DriversGriPhyn - Particle physics- Large Hadron Collider at CERN

Overwhelming amount of data for analysis (>1 PB/year)Find rare events resulting from the decays of massive new particles in a dominating background Need new services to support world-wide data access and remote collaboration for coordinated management of distributed computation and data without centralized control

*

Examples of Science DriversNVO- National Virtual ObservatoryBreakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogues, in different wavebands, from gamma- and X-rays, optical, infrared, through radio. Soon it will be easier to "dial-up" a part of the sky than wait many months to access a telescope. Need multi-terabyte on-line databases interoperating seamlessly, interlinked catalogues, sophisticated query engines research results from on-line data will be just as rich as that from "real" telescopes

*

UK Teragrid HPC-Grid Experiment TeraGyroid: Lattice-Boltzmann simulations of defect dynamics in amphiphilic liquid crystals

Peter Coveney (University College London), Richard Blake (Daresbury Lab)Stephen Pickles (Manchester). Bruce Boghosian (Tufts)ANL

*

Project PartnersReality Grid partners: University College London (Application, Visualisation, Networking)University of Manchester (Application, Visualisation, Networking)Edinburgh Parallel Computing Centre (Application)Tufts University (Application)UK High-End Computing Services- HPCx run by the University of Edinburgh and CCLRC Daresbury Laboratory (Compute, Networking, Coordination)- CSAR run by the University of Manchester and CSC (Compute and Visualisation)Teragrid sites at: Argonne National Laboratory (Visualization, Networking)National Center for Supercomputing Applications (Compute)Pittsburgh Supercomputing Center (Compute, Visualisation)San Diego Supercomputer Center (Compute)

*

Project explanationAmphiphiles are chemicals with hydrophobic (water-avoiding) tails and hydrophilic (water attracting) heads. When dispersed in solvents or oil/water mixtures, self assemble into complex shapes; some (gyroids) are of particular interest in biology.Shapes depend on parameters like abundance and initial distribution of each componentthe strength of the surfactant-surfactant coupling, Desired structures can sometimes only be seen in very large systems. E.g. smaller region form gyroids in different directions and how they then interact is of major significance. Project goal is to study defect pathways and dynamics in gyroid self-assembly

*

NetworkingTeragridUKAmsterdamBT provisionNetherlight

*

Distribution of functionComputations run at HPCx, CSAR, SDSC, PSC and NCSA. (7 TB memory - 5K processors in integrated resource) One Gigabit of LB3D data is generated per simulation time-step.Visualisation run at Manchester/ UCL/ Argonne Scientists steering calculations from UCL and Boston over Access Grid. Steering requires reliable near-real time data transport across the Grid to visualization engines. Visualisation output and collaborations multicast to SC03 Phoenix and visualised on the show floor in the University of Manchester booth

*

Exploring parameter spacethrough computational steeringRewind and restart from checkpoint.Lamellar phase: surfactant bilayers between water layers.Cubic micellar phase, low surfactant density gradient.Cubic micellar phase, high surfactant density gradient.Self-assembly starts.Initial condition: Random water/ surfactant mixture.Lamellar phase: surfactant bilayers between water layers.

*

ResultsLinking these resources allowed computation of the largest set of lattice-Boltzmann (LB) simulations ever performed, involving lattices of over one billion sites.

*

How do upcoming developments deal with the major technical issuesMemory bandwidthOld Crays- 2loads and a store/clock= 12B/flopTCS, better than most commodity processors 1 B/flopEarth Simulator 4 B/flopPowerTCS, ~700 kWEarth Simulator, ~4 MWSpaceTCS, ~2500 sq feetASCI Q New machine room of ~40,000 sq feetEarth Simulator, 3250 sq metersReliability

*

Short term responsesLivermore, BlueGene/LSandia, Red Storm

*

BlueGene/L (Livermore)System on a chipIBM powerPC with reduced clock (700 Mhz) for lower power consumption2 processor/node each 2.8 GF peak256 MB/node (small, but allows up to 2GB/node)Memory on chip, to increase memory bandwidth to 2Bytes/flopCommunications processor on chip, speeds interprocessor communication (175MB/s/link)Total 360 Tf, 65536 nodes in 3D torusTotal power 1MWfloor space 2500 sq ftvery fault tolerant (expect 1 failure/week)

*

BlueGene/L ScienceProtein folding (molecular dynamics needs small memory and large floating point capability)Materials science, (again molecular dyanmics)

*

Red Storm (Sandia)Inspired by the T3E- a true MPPOpteron chip from AMD, 2 Ghz clock, 4 Gflop, 1.3B/flop memory bandwidthHigh-bandwidth proprietary interconnect (from Cray) bandwidth of 6.4 GB/sec, as good as local memory10,000 cpus, 3-d torus40 Tf peak, 10TB memory exceeding 256 Gbyte Analysis of cosmic microwave background data:MAXIMA data, 5.3 x 1016 flops, 100 Gbyte memBOOMERANG data 1019 flops, 3.2 Tbyte memFuture MAP data, 1020 flops, 16 Tbyte memFuture PLANCK data 1023 flops, 1.6 Pbyte mem

*

Take home lessonsFielding capability computing takes considerable thought and expertiseComputing power continues to grow, and it will be accessible to youThink of challenging problems to stress existing systems and justify more powerful ones

Much too long for 45 minutes. Only got through the first 31 slides effectively. Have to trim back considerably for 45 minute talk.Balanced national portfolio also requires access to the highest end capability.Constellation means collections of SMPs (here > 16)MPPs differ from clusters by size (MPP is over 1000 processors)Clusters increasingly important, but definitely not the whole storyDistinguish between tighter interconnectand looser (GigE, Myrinet) better for parameter studies, Monte Carlo etc.

Users- rewrite codeT3D- alpha cost 10x what the ret of the engineering costsOften hear that MPPs only sustain a few percent of peakWere getting 20% on real, irregular applications. Can get up to 75% on special applications. Linpack not the best measure.Recall that Cray XMP, YMP, C90 only averaged around 25%Chose EV68Chose Quadrics- low latency, high bandwidth, 2 rails for redundancyMore data than it makes sense to move around.Aggregate local disk 40TB, aggregate bandwidth > 65 GB/sec

Will say more about what we have done tomorrow at the machine workshop.Encouraged vendor to test on our machine You wouldnt let us do that. were ending up doing that anyway.

Partly due to Scaling Advantage program, started May.Every factor of 3? 10? requires a rethinking of the problemProcessor speeds and even bandwidths have progressed much faster than latency reduction.However, the general view was that latency was much less a problem than inadequate memory bandwidth, because latency hiding tricks are quite powerful.

Asynchronous example- rather than waiting after a send for acknowledgment, put in buffer, go off and do work, and later check that message has been sent off. Post non-blocking receives first.

Schulten and kale, Univ of Illinois

Did 25 nanosecond (2 different version of the protein)Took more than 70 hours, using 512 Compaq EV68 processors to simulate 12 nanoseconds

Make point about 10^4 in each spatial dir = 10^12, plus time 10^3, gives petascale work -> exascale computing Therefore we pursue unstructured mesh methods (despite the difficulty in generating large irregular meshes and obtaining good parallel performance)Grid generated on Marvels, need large shared memory.mention % of peak of >20%; on par with earth simulator for unstructured meshesGood scaling (real)Picture of PSC machine (6 Tf, 3000 alphas)Used 2048 or 3000 processorsShow quicktime movie (iso_acoustic_129.mov) to show accuracyInverse problem has begunScale 30+km x 30+km x 15+kmGenerate target model, faultShow velocity isosurfaces from inverse model (pause it to show unresolved detail in target)Today, severely stressing the capabilities of the most powerful systems in the country.Current forward runs will take weeks of dedicated TCS time (to get up to 2-3 Hz). Inverse runs even longer.Need to deal with larger geographic regions, and higher frequencies

Due to algorithmic improvementsAccess to major platforms and sustained effort over decades.Independent camera images sent to PSC.AI techniques predict next frame and only send corrections.Images ??Riley at Georgia Tech Faculty driving the project is Richard Fujimoto128^3 in many places, 512^3 on HPCx, 1024^3 on TCS.Explain how to reach different states of the system by rewinding and changing parametersES is 12.3 GB/sec per node, 64 Gflop, so 0.2 B/flopBut topologies different.From David Bailey. DOE workshops

Capability Computing Challenges and Payoffs

Documents