Science on the TeraGrid Daniel S. Katz [email protected] Director of Science, TeraGrid GIG Senior Computational Researcher, Computation Institute, University.

Science on the TeraGrid

Daniel S. Katz

[email protected]

Director of Science, TeraGrid GIG

Senior Computational Researcher, Computation Institute, University of Chicago & Argonne National

Laboratory

Affiliate Faculty, Center for Computation & Technology, LSU

Adjunct Associate Professor, Electrical and Computer Engineering Department,

LSU

Outline

•Introduction to TeraGrid

•Current Science

•Future Science

What is the TeraGrid• World’s largest distributed cyberinfrastructure for open scientific

research, supported by US NSF

• Integrated high performance computers (>2 PF HPC & >27000 HTC CPUs), data resources (>3 PB disk, >60 PB tape, data collections), visualization, experimental facilities (VMs, GPUs, FPGAs), network at 11 Resource Provider sites

• Allocated to US researchers and their collaborators through national peer-review process

• DEEP: provide powerful computational resources to enable research that can’t otherwise be accomplished

• WIDE: grow the community of computational science and make the resources easily accessible

• OPEN: connect with new resources and institutions

• Integration: Single: portal, sign-on, help desk, allocations process, advanced user support, EOT, campus champions

Governance

•11 Resource Providers (RPs) funded under separate agreements with NSF– Different start and end dates– Different goals– Different agreements– Different funding models

•1 Coordinating Body – Grid Integration Group (GIG)– University of Chicago/Argonne National Laboratory– Subcontracts to all RPs and six other universities– 7-8 Area Directors– Working groups with members from many RPs

•TeraGrid Forum with Chair

TG New Large Resources

•Ranger@TACC– First NSF ‘Track2’ HPC system– 504 TF– 15,744 Quad-Core AMD

Opteron processors– 123 TB memory, 1.7 PB disk

•Kraken@NICS (UT/ORNL)– Second NSF ‘Track2’ HPC system– 1 PF Cray XT5 system– 16,512 compute sockets, 99,072 cores– 129 TB memory, 3.3 PB disk

Blue Waters@NCSANSF Track 110 PF peakComing in 2011

Who Uses TeraGrid (2008)

How TeraGrid Is Used

Use ModalityUse ModalityCommunity SizeCommunity Size

(rough est. - number of (rough est. - number of users)users)

Batch Computing on Individual Resources 850Exploratory and Application Porting 650Workflow, Ensemble, and Parameter Sweep 250Science Gateway Access 500Remote Interactive Steering and Visualization 35Tightly-Coupled Distributed Computation 102006 data

How One Uses TeraGrid

ComputeService

VizService

DataService

Network, Accounting, …

RP 1

RP 3

RP 2

TeraGrid Infrastructure (Accounting, Network, Authorization,…)

POPS (for now)

Science Gateways

UserPortal

Command Line

Slide courtesy of Dane Skow and Craig Stewart

User Portal: portal.teragrid.org

Access to resources

•Terminal: ssh, gsissh•Portal: TeraGrid user portal, Gateways– Once logged in to

portal, click on “Login”

•Also, SSO from command-line

Science Gateways

•A natural extension of Internet & Web 2.0• Idea resonates with Scientists

– Researchers can imagine scientific capabilities provided through familiar interface• Mostly web portal or web or client-server program

•Designed by communities; provide interfaces understood by those communities– Also provide access to greater capabilities (back end)– Without user understand details of capabilities– Scientists know they can undertake more complex analyses

and that’s all they want to focus on– TeraGrid provides tools to help developer

•Seamless access doesn’t come for free– Hinges on very capable developer

Slide courtesy of Nancy Wilkins-Diehr

TeraGrid -> XD Future

•Current RP agreements end in March 2011– Except track 2 centers (current and future)

•TeraGrid XD (eXtreme Digital) starts in April 2011– Era of potential interoperation with OSG and others– New types of science applications?

•Current TG GIG continues through July 2011– Allows four months of overlap in coordination– Probable overlap between GIG and XD members

•Blue Waters (track 1) production in 2011

12

Outline


•Current Science

•Future Science

TG App: Predicting storms• Hurricanes and tornadoes cause massive

loss of life and damage to property• TeraGrid supported spring 2007 NOAA and

University of Oklahoma Hazardous Weather Testbed– Major Goal: assess how well ensemble

forecasting predicts thunderstorms, including the supercells that spawn tornadoes

– Nightly reservation at PSC, spawning jobs at NCSA as needed for details

– Input, output, and intermediate data transfers– Delivers “better than real time” prediction– Used 675,000 CPU hours for the season– Used 312 TB on HPSS storage at PSC

Slide courtesy of Dennis Gannon, ex-IU, and LEAD Collaboration

TG App: SCEC-PSHA

• Part of SCEC (Tom Jordan, USC)

• Using the large scale simulation data, estimate probablistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period)

• Used by hospitals, power plants, schools, etc. as part of their risk assessment

• For each location, need a Cybershake run followed by roughly 840,000 parallel short jobs (420,000 rupture forecasts, 420,000 extraction of peak ground motion)– Parallelize across locations, not individual

workflows

• Completed 40 locations to date, targeting 200 in 2009, and 2000 in 2010

Managing these requires effective grid workflow tools for job submission, data management and error recovery, using Pegasus (ISI) and DAGman (Wisconsin)

15Information/image courtesy of Phil Maechling

TG App: GridChem

Slide courtesy of Joohyun Kim

TG Apps: Genius and Materials

HemeLB on LONILAMMPS on TeraGrid

Fully-atomistic simulations of clay-polymer nanocomposites

Slide courtesy of Steven Manos and Peter Coveney

Why cross-site / distributed runs?

1.Rapid turnaround, conglomeration of idle processors to run a single large job

2.Run big compute & big memory jobs not possible on a single machine

Modeling blood flow before (during?) surgery

Outline


•Current Science

•Future Science

ENZO• ENZO simulated cosmological

structure formation• Big current production simulation:

– 4096x4096x4096 non-adaptive mesh, 16 fields per mesh point

– 64 billion dark matter particles – About 4000 MPI processes,

1-8 OpenMP threads per process– Reads 5 TB input data– Writes 8 TB data files

• Restart reads latest 8 TB file

– All I/O uses HDF5, each MPI process reading/writing their own data– Over few months for simulation, >100 data files written, >20 read for

restarts– 24 hour batch runs

• 5-10 data files output per run• Needs ~100 TB free disk space at start of run

– (adaptive case is different, but I/O is roughly similar)

Slide courtesy of Robert Harkness

ENZO Calculation Stages1. Generate initial conditions for density, matter velocity field, dark matter particle

position and velocity (using parallel HDF5)– Using NICS Kraken w/ 4K MPI processes, though TACC Ranger is reasonable alternative– ~5 TB initial data created in 10 0.5-TB files

2. Decompose initial conditions for # of MPI tasks in simulation (using sequential HDF5)– Decomposition of mesh into "tiles" needed for MPI tasks requires strided reads in large data cube

• Very costly on NICS Kraken, but can be done more efficiently on TACC Ranger• If done on Ranger, then 2 TB (4 512-GB files) must be transmitted from NICS to TACC, and after running MPI

decomposition task (with 4K MPI tasks), 8K files (2 TB) must be returned to NICS

– Dark matter particle sort onto "tiles" is most efficient on NICS Kraken because it has a superior interconnect• Sort usually run in 8 slices using 4K MPI tasks

3. Evolve time (using sequential HDF5) – Dump data files during run– Archive data files (8 TB every couple of hours -> 1100 MB/sec but NICS HPSS only reaches 300

MB/sec)

4. Derive data products – Capture 5-6 fields from each data file (~256 GB each)– Send to ANL or SDSC for data analysis or viz– Archive output of data analysis of viz (back at NICS)

Overall run produces >1 PB data, >100 TB to be archived, >100 TB free disk space

Slide courtesy of Robert Harkness

Science Portals 2.0• Workspace customized by user for specific projects• Pluggable into iGoogle or other compliant (and open source) containers• Integrates with user workspace to provide a complete and dynamic view of the user’s

science, alongside other aspects of their lives (e.g. weather, news)• Integrates with social networks (e.g. FriendConnect, MySpace) to support collaborative

science• TG User Portal provides this same information, but this view is more dynamic and more

reusable, and can be more flexibly integrated into the user’s workspace• Gadgets suitable for use on mobile devices

Technology Detail• Gadgets are HTML/JavaScript embedded in XML• Gadgets conform to specs that are supported by many containers (iGoogle, Shindig, Orkut, MySpace)

Technology Detail• Gadgets are HTML/JavaScript embedded in XML• Gadgets conform to specs that are supported by many containers (iGoogle, Shindig, Orkut, MySpace)

Resource Load gadget shows current view of load on available resourcesResource Load gadget shows current view of load on available resources

Job Status gadget shows current view of job queues by site, user, statusJob Status gadget shows current view of job queues by site, user, status

File Transfer gadget for lightweight access to data stores, simple global file searching, and reliable transfers

File Transfer gadget for lightweight access to data stores, simple global file searching, and reliable transfers

Domain science gadgets complement general purpose gadgets to encompass the full range of scientists’ interests

Domain science gadgets complement general purpose gadgets to encompass the full range of scientists’ interests

Slide courtesy of Wenjun Wu, Thomas Uram, Michael Papka

OLSGW Gadgets

•OLSGW Integrates bio-informatics applications

•BLAST, InterProScan, CLUSTALW , MUSCLE, PSIPRED, ACCPRO, VSL2

•454 Pyrosequencing service under development

•Four OLSGW gadgets have been published in the iGoogle gadget directory. Search for “TeraGrid Life Science”.

•OLSGW Integrates bio-informatics applications

•BLAST, InterProScan, CLUSTALW , MUSCLE, PSIPRED, ACCPRO, VSL2

•454 Pyrosequencing service under development

•Four OLSGW gadgets have been published in the iGoogle gadget directory. Search for “TeraGrid Life Science”.

Slide courtesy of Wenjun Wu, Thomas Uram, Michael Papka

PolarGrid

• Goal: Work with Center for Remote Sensing of Ice Sheets

• Requirements:– View CReSIS data sets, run

filters, and view results through Web map interfaces;

– See/Share user’s events in a Calendar;

– Update results to a common repository with appropriate access controls;

– Post the status of computational experiments.

– Support collaboration and information exchange by interfacing to blogs and discussion areas

Slide courtesy of Raminder Singh, Gerald Guo, Marlon Pierce

Login ScreenInterface to create new users and login using existing accounts. Integrated with OpenID API for authentication.

Solution: Web 2.0-enabled PolarGrid Portal

PolarGrid

Home Page with a set of gadgets like Google Calendar, Picasa, Facebook, Blog, Twitter

Slide courtesy of Raminder Singh, Gerald Guo, Marlon Pierce

Google Gadgets & Satellite Data• Purdue: disseminating remote sensing products

– Flash- and HTML-based gadgets, programmed by undergraduate students

• Bring traffic/usage to a broad set of satellite products at Purdue Terrestrial Observatory

• Backend system including commercial satellite data receivers, a smaller cluster, TG data collection, etc.

• User control includes zoom, pause/resume, frame step through, etc.

MODIS satellite viewer

GOES-12 satellite viewerSlide courtesy of Carol Song

Future App: Real-time High Resolution Radar Data

•Delivering 3D visualization of radar data via a Google gadget

•LiveRadar3D – Super high res, real-time

NEXRAD data– Continuously updated as new

data comes– 3D rendering that includes

multiple stations in the US– Significant processing (high

throughout) and rendering supported by TG systems

– To be released next spring

Slide courtesy of Carol Song

Understanding Distributed Applications

Development Objectives

• Interoperability: Ability to work across multiple distributed resources

•Distributed Scale-Out: The ability to utilize multiple distributed resources concurrently

•Extensibility: Support new patterns/abstractions, different programming systems, functionality & Infrastructure

•Adaptivity: Response to fluctuations in dynamic resource and availability of dynamic data

•Simplicity: Accommodate above distributed concerns at different levels easily…

Challenge: How to develop DA effectively and efficiently with the above as first-class objectives?

Developed by Jha, Katz, Parashar, Rana, Weissman

David H

ockney Pearblossom

Highw

ay 1986Montage Workflow

1 2 3

mProject 1 mProject 2 mProject 3

1 2 3

mDiff 1 2 mDiff 2 3

mFitplane D12 mFitplane D23

ax + by + c = 0 dx + ey + f = 0

a1x + b1y + c1 = 0

a2x + b2y + c2 = 0

a3x + b3y + c3 = 0

mBackground 1 mBackground 2 mBackground 3

1 2 3

D12D23

mConcatFit

mBgModel

ax + by + c = 0

dx + ey + f = 0

mAdd 1 mAdd 2

Final Mosaic(Overlapping Tiles)

Montage on the Grid UsingPegasus

Example DAG for 10 input files

mAdd

mBackground

mBgModel

mProject

mDiff

mFitPlane

mConcatFit

Data Stage-in nodes

Montage compute nodes

Data stage-out nodes

Registration nodes

Pegasus

Grid Information Systems

Information about available resources, data location

Grid

Condor DAGMan

Maps an abstract workflow to an executable form

Executes the workflow

MyProxy

User’s grid credentials

http://pegasus.isi.edu/

Application Development Phase

Generation & Exec. Planning Phase

Execution Phase

DAG-based Workflow Applications:Extensibility Approach

Done with Andre Merzky, Katerina Stamou, Shantenu Jha

SAGA-based DAG ExecutionPreserving Performance

Done with Andre Merzky, Katerina Stamou, Shantenu Jha

32

Credit: Yaakoub El-Khamra, Shantenu Jha

• Suppose you want to perform a history match on a 1 million grid cell problem, with a thousand ensemble members

• The entire system will have a few billion degrees of freedom

• This will increase the need for autonomy, fault tolerance, self healing etc...

App: EnKF for Oil Reservoir Sim.

• Ensemble Kalman Filters – recursive filters that can be used to handle large, noisy data; the data are sent through the Kalman filter to obtain the true state of the data

• Used here to model oil field, w/ data from sensors, to build a model that represents reality

• Using EnKF, scientific problem solved using “multiple models” (ensemble members)

• Physical models represented as ensembles that vary in size from large MPI-style jobs to long-running single processor tasks

• Varying parameters sometimes also lead to varying systems of equations and entirely new scenarios, increasing both computational and memory requirements

• Each model must converge before the next stage can begin, hence dynamically load-balancing to ensure that all models complete as close to each other as possible is a desired aim

• In the general case the number of jobs required varies between stages.

EnKF Results: Scaling-Out

•Using more machines decreases the TTC and variation between experiments

•Using BQP (batch queue prediction) decreases the TTC & variation between experiments further

•Lowest time to completion achieved when using BQP and all available resources

Credit: Yaakoub El-Khamra, Shantenu Jha

Khamra & Jha, GMAC, ICAC’09

R=Ranger, Q=Queen Bee, A=Abe

TeraGrid: Both Operations and Research

•Operations – Facilities/services on which researchers rely– Infrastructure on which other providers build

AND

•R&D – Learning how to do distributed, collaborative science on a

global, federated infrastructure– Learning how to run multi-institution shared infrastructure

http://teragrid.org/

34

Science on the TeraGrid Daniel S. Katz [email protected] Director of Science, TeraGrid GIG Senior Computational Researcher, Computation Institute, University.

Documents