F1-ATPase 2.5bell Lemieux
Seconds/StepF1-ATPase (PME)327,506 atoms
*
Some scaling successes at PSCNAMD now scales to 3000 processors,
> 1Tf sustainedEarthquake simulation code, 2048 processors, 87%
parallel efficiency. Real-time tele-immersion code scales to 1536
processorsIncreased scaling of the Car-Parrinello Ab-Initio
Molecular Dynamics (CPAIMD) code from its previous limit of 128
processors (for 128 states) to 1536 processors.
*
Payoffs- Insight into important real-life
problemsInsightsStructure to function of biomoleculesIncreased
realism to confront experimental dataEarthquakes and design of
buildingsQCDNovel uses of HPCTeleimmersionInternet simulation
*
How Aquaporins Work (Schulten group, University of
Illinois)Aquaporins are proteins which conduct large volumes of
water through cell walls while filtering out charged particles like
hydrogen ions. Start with known crystal structure, simulate 12
nanoseconds of molecular dynamics of over 100,000 atoms, using
NAMD
*
Aquaporin mechanismWater moves through aquaporin channels in
single file. Oxygen leads the way in. At the most constricted point
of channel, water molecule flips. Protons cant do this.
Animation pointed to by 2003 Nobel chemistry prize
announcement
*
High Resolution Forward and Inverse Earthquake Modeling on
Terascale ComputersVolkan Akcelik, Jacobo Bielak, Ioannis
Epanomeritakis Antonio Fernandez, Omar Ghattas, Eui Joong KimJulio
Lopez, David O'Hallaron, Tiankai TuCarnegie Mellon University
George BirosUniversity of Pennsylvania
John Urbanic Pittsburgh Supercomputing Center
*
Complexity of earthquake ground motion simulationMultiple
spatial scaleswavelengths vary from O(10m) to O(1000m)Basin/source
dimensions are O(100km) Multiple temporal scales O(0.01s) to
resolve highest frequencies of source O(10s) to resolve of shaking
within the basinSo need unstructured grids even though good
parallel performance harder to achieveHighly irregular basin
geometryHighly heterogeneous soils material properties Geology and
source parameters observable only indirectly
*
Performance of forward earthquake modeling code on PSC Terascale
systemLargest simulation28 Oct 2001 Compton aftershock in Greater
LA Basin maximum resolved frequency: 1.85Hz100m/s min shear wave
velocityphysical size: 100x100x37.5 km3 # of elements: 899,591,066#
of grid points: 1,023,371,641 # of slaves: 125,726,86225 sec
wallclock/time step on 1024 PEs65 Gb inputlemieux at PSC
*
Role of PSCAssistance inOptimizationEfficient IO of terabyte
size datasetsExpediting schedulingVisualization
*
Inverse problem: Use records of past seismic events to improve
velocity modelS. CA significant earthquakes since 1812Seismometer
locations and intensity mapfor Northridge earthquake
*
Inverse problem
*
Major recognition
This entire effort won Gordon Bell prize for special
achievement, 2003, the premier prize for outstanding computations
in HPC. Given to the entry that utilizes innovative techniques to
demonstrate the most dramatic gain in sustained performance for an
important class of real-world application.
*
QCDIncreased realism to confront experimental dataQCD compelling
evidence for the need to include quark virtual degrees of
freedomImprovements due to continued algorithmic development,
access to major platforms and sustained effort over decades
*
Tele-immersion (real time)can process 6 frames/sec (640 x 480)
from 10 camera triplets using 1800 processors.Henry Fuchs, U. of
North Carolina
*
Simulating Network traffic(almost real time)George Riley et al
(Georgia Tech)Simulating networks with > 5M elements. modeled
106M packet transmissions in one second of wall clock time, using
1500 processorsNear real time web traffic simulationEmpirical HTTP
Traffic model [Mah, Infocom 97]1.1M nodes, 1.0M web browsers, 20.5M
TCP Connections541 seconds of wall clock time on 512 processors to
simulate 300 seconds of network operationFastest detailed computer
simulations of computer networks ever constructed
*
Where are grids in all this?Grids aimed primarily
at:Availability- computing on demandReduce influence effect of
geographic distanceMake services more transparentMotivated by
remote data, on-line instruments, sensors, as well as computersThey
also contribute to the highest end by aggregating resources. The
emerging vision is to use cyberinfrastructure to build more
ubiquitous, comprehensive digital environments that become
interactive and functionally complete for research communities in
terms of people, data, information, tools, and instruments and that
operate at unprecedented levels of computational, storage, and data
transfer capacity. NSF Blue Ribbon Panel on Cyberinfrastructure
*
DTF (2001)IA 64 clusters at 4 sites10 Gb/s point to point
linksCan deliver 30 Gb/s between 2 sites
Physical Topology(Full Mesh)LAChicagoSDSCNCSACaltechANL
*
Extensible Terascale Facility (2002)Make network scalable, so
introduce hubsAllow heterogeneous architecture, and retain
interoperabilityFirst step is integration of PSCs TCS machineMany
more computer science interoperability issues
3 new sites approved in 2003(Texas, Oak Ridge, Indiana)
*
Examples of Science DriversGriPhyn - Particle physics- Large
Hadron Collider at CERN
Overwhelming amount of data for analysis (>1 PB/year)Find
rare events resulting from the decays of massive new particles in a
dominating background Need new services to support world-wide data
access and remote collaboration for coordinated management of
distributed computation and data without centralized control
*
Examples of Science DriversNVO- National Virtual
ObservatoryBreakthroughs in telescope, detector, and computer
technology allow astronomical surveys to produce terabytes of
images and catalogues, in different wavebands, from gamma- and
X-rays, optical, infrared, through radio. Soon it will be easier to
"dial-up" a part of the sky than wait many months to access a
telescope. Need multi-terabyte on-line databases interoperating
seamlessly, interlinked catalogues, sophisticated query engines
research results from on-line data will be just as rich as that
from "real" telescopes
*
UK Teragrid HPC-Grid Experiment TeraGyroid: Lattice-Boltzmann
simulations of defect dynamics in amphiphilic liquid crystals
Peter Coveney (University College London), Richard Blake
(Daresbury Lab)Stephen Pickles (Manchester). Bruce Boghosian
(Tufts)ANL
*
Project PartnersReality Grid partners: University College London
(Application, Visualisation, Networking)University of Manchester
(Application, Visualisation, Networking)Edinburgh Parallel
Computing Centre (Application)Tufts University (Application)UK
High-End Computing Services- HPCx run by the University of
Edinburgh and CCLRC Daresbury Laboratory (Compute, Networking,
Coordination)- CSAR run by the University of Manchester and CSC
(Compute and Visualisation)Teragrid sites at: Argonne National
Laboratory (Visualization, Networking)National Center for
Supercomputing Applications (Compute)Pittsburgh Supercomputing
Center (Compute, Visualisation)San Diego Supercomputer Center
(Compute)
*
Project explanationAmphiphiles are chemicals with hydrophobic
(water-avoiding) tails and hydrophilic (water attracting) heads.
When dispersed in solvents or oil/water mixtures, self assemble
into complex shapes; some (gyroids) are of particular interest in
biology.Shapes depend on parameters like abundance and initial
distribution of each componentthe strength of the
surfactant-surfactant coupling, Desired structures can sometimes
only be seen in very large systems. E.g. smaller region form
gyroids in different directions and how they then interact is of
major significance. Project goal is to study defect pathways and
dynamics in gyroid self-assembly
*
NetworkingTeragridUKAmsterdamBT provisionNetherlight
*
Distribution of functionComputations run at HPCx, CSAR, SDSC,
PSC and NCSA. (7 TB memory - 5K processors in integrated resource)
One Gigabit of LB3D data is generated per simulation
time-step.Visualisation run at Manchester/ UCL/ Argonne Scientists
steering calculations from UCL and Boston over Access Grid.
Steering requires reliable near-real time data transport across the
Grid to visualization engines. Visualisation output and
collaborations multicast to SC03 Phoenix and visualised on the show
floor in the University of Manchester booth
*
Exploring parameter spacethrough computational steeringRewind
and restart from checkpoint.Lamellar phase: surfactant bilayers
between water layers.Cubic micellar phase, low surfactant density
gradient.Cubic micellar phase, high surfactant density
gradient.Self-assembly starts.Initial condition: Random water/
surfactant mixture.Lamellar phase: surfactant bilayers between
water layers.
*
ResultsLinking these resources allowed computation of the
largest set of lattice-Boltzmann (LB) simulations ever performed,
involving lattices of over one billion sites.
*
How do upcoming developments deal with the major technical
issuesMemory bandwidthOld Crays- 2loads and a store/clock=
12B/flopTCS, better than most commodity processors 1 B/flopEarth
Simulator 4 B/flopPowerTCS, ~700 kWEarth Simulator, ~4 MWSpaceTCS,
~2500 sq feetASCI Q New machine room of ~40,000 sq feetEarth
Simulator, 3250 sq metersReliability
*
Short term responsesLivermore, BlueGene/LSandia, Red Storm
*
BlueGene/L (Livermore)System on a chipIBM powerPC with reduced
clock (700 Mhz) for lower power consumption2 processor/node each
2.8 GF peak256 MB/node (small, but allows up to 2GB/node)Memory on
chip, to increase memory bandwidth to 2Bytes/flopCommunications
processor on chip, speeds interprocessor communication
(175MB/s/link)Total 360 Tf, 65536 nodes in 3D torusTotal power
1MWfloor space 2500 sq ftvery fault tolerant (expect 1
failure/week)
*
BlueGene/L ScienceProtein folding (molecular dynamics needs
small memory and large floating point capability)Materials science,
(again molecular dyanmics)
*
Red Storm (Sandia)Inspired by the T3E- a true MPPOpteron chip
from AMD, 2 Ghz clock, 4 Gflop, 1.3B/flop memory
bandwidthHigh-bandwidth proprietary interconnect (from Cray)
bandwidth of 6.4 GB/sec, as good as local memory10,000 cpus, 3-d
torus40 Tf peak, 10TB memory exceeding 256 Gbyte Analysis of cosmic
microwave background data:MAXIMA data, 5.3 x 1016 flops, 100 Gbyte
memBOOMERANG data 1019 flops, 3.2 Tbyte memFuture MAP data, 1020
flops, 16 Tbyte memFuture PLANCK data 1023 flops, 1.6 Pbyte mem
*
Take home lessonsFielding capability computing takes
considerable thought and expertiseComputing power continues to
grow, and it will be accessible to youThink of challenging problems
to stress existing systems and justify more powerful ones
Much too long for 45 minutes. Only got through the first 31
slides effectively. Have to trim back considerably for 45 minute
talk.Balanced national portfolio also requires access to the
highest end capability.Constellation means collections of SMPs
(here > 16)MPPs differ from clusters by size (MPP is over 1000
processors)Clusters increasingly important, but definitely not the
whole storyDistinguish between tighter interconnectand looser
(GigE, Myrinet) better for parameter studies, Monte Carlo etc.
Users- rewrite codeT3D- alpha cost 10x what the ret of the
engineering costsOften hear that MPPs only sustain a few percent of
peakWere getting 20% on real, irregular applications. Can get up to
75% on special applications. Linpack not the best measure.Recall
that Cray XMP, YMP, C90 only averaged around 25%Chose EV68Chose
Quadrics- low latency, high bandwidth, 2 rails for redundancyMore
data than it makes sense to move around.Aggregate local disk 40TB,
aggregate bandwidth > 65 GB/sec
Will say more about what we have done tomorrow at the machine
workshop.Encouraged vendor to test on our machine You wouldnt let
us do that. were ending up doing that anyway.
Partly due to Scaling Advantage program, started May.Every
factor of 3? 10? requires a rethinking of the problemProcessor
speeds and even bandwidths have progressed much faster than latency
reduction.However, the general view was that latency was much less
a problem than inadequate memory bandwidth, because latency hiding
tricks are quite powerful.
Asynchronous example- rather than waiting after a send for
acknowledgment, put in buffer, go off and do work, and later check
that message has been sent off. Post non-blocking receives
first.
Schulten and kale, Univ of Illinois
Did 25 nanosecond (2 different version of the protein)Took more
than 70 hours, using 512 Compaq EV68 processors to simulate 12
nanoseconds
Make point about 10^4 in each spatial dir = 10^12, plus time
10^3, gives petascale work -> exascale computing Therefore we
pursue unstructured mesh methods (despite the difficulty in
generating large irregular meshes and obtaining good parallel
performance)Grid generated on Marvels, need large shared
memory.mention % of peak of >20%; on par with earth simulator
for unstructured meshesGood scaling (real)Picture of PSC machine (6
Tf, 3000 alphas)Used 2048 or 3000 processorsShow quicktime movie
(iso_acoustic_129.mov) to show accuracyInverse problem has
begunScale 30+km x 30+km x 15+kmGenerate target model, faultShow
velocity isosurfaces from inverse model (pause it to show
unresolved detail in target)Today, severely stressing the
capabilities of the most powerful systems in the country.Current
forward runs will take weeks of dedicated TCS time (to get up to
2-3 Hz). Inverse runs even longer.Need to deal with larger
geographic regions, and higher frequencies
Due to algorithmic improvementsAccess to major platforms and
sustained effort over decades.Independent camera images sent to
PSC.AI techniques predict next frame and only send
corrections.Images ??Riley at Georgia Tech Faculty driving the
project is Richard Fujimoto128^3 in many places, 512^3 on HPCx,
1024^3 on TCS.Explain how to reach different states of the system
by rewinding and changing parametersES is 12.3 GB/sec per node, 64
Gflop, so 0.2 B/flopBut topologies different.From David Bailey. DOE
workshops