Ian Foster Computation Institute Argonne National Lab & University of Chicago Logistical Programming Model Implications of New Supercomputing Applications
May 10, 2015
Ian Foster
Computation Institute
Argonne National Lab & University of Chicago
From the Heroic to the Logistical
Programming Model Implications of New Supercomputing Applications
What will we do with 1+ Exaflops and 1M+ cores?
4
Or, If You Prefer, A Worldwide Grid (or Cloud)
EGEE
55
1) Tackle Bigger and Bigger Problems
ComputationalScientist
as Hero
6
2) Tackle More Complex Problems
ComputationalScientist
as LogisticsOfficer
77
“More Complex Problems” Ensemble runs to quantify climate model uncertainty Identify potential drug targets by screening a database
of ligand structures against target proteins Study economic model sensitivity to parameters Analyze turbulence dataset from many perspectives Perform numerical optimization to determine optimal
resource assignment in energy problems Mine collection of data from advanced light sources Construct databases of computed properties of chemical
compounds Analyze data from the Large Hadron Collider Analyze log data from 100,000-node parallel
computations
88
Programming Model Issues
Massive task parallelism Massive data parallelism Integrating black box applications Complex task dependencies (task graphs) Failure, and other execution management issues Data management: input, intermediate, output Dynamic computations (task graphs) Dynamic data access to large, diverse datasets Long-running computations Documenting provenance of data products
9
9
Problem Types
Number of tasks
Inputdatasize
1 1K 1M
Hi
Med
Lo
HeroicMPI
tasks
Dataanalysis,mining
Many loosely coupled tasks
Much data and complex tasks
10
An Incomplete and Simplistic View ofProgramming Models and Tools
Many TasksDAGMan+Pegasus
Karajan+Swift
Much DataMapReduce/Hadoop
Dryad
Complex Tasks, Much DataDryad, Pig, Sawzall
Swift+Falkon
Single task, modest dataMPI, etc., etc., etc.
1111
Image courtesy Pat Behling and
Yun Liu, UW Madison
NCAR computer + grad student160 ensemble members in 75 days
TeraGrid + “Virtual Data System”250 ensemble members in 4 days
Many Tasks
Climate Ensemble Simulations
(Using FOAM,2005)
12
Many Many Tasks:Identifying Potential Drug Targets
2M+ ligands Protein xtarget(s)
(Mike Kubal, Benoit Roux, and others)
13
start
report
DOCK6Receptor
(1 per protein:defines pocket
to bind to)
ZINC3-D
structures
ligands complexes
NAB scriptparameters
(defines flexibleresidues, #MDsteps)
Amber Score:1. AmberizeLigand3. AmberizeComplex5. RunNABScript
end
BuildNABScript
NABScript
NABScript
Template
Amber prep:2. AmberizeReceptor4. perl: gen nabscript
FREDReceptor
(1 per protein:defines pocket
to bind to)
Manually prepDOCK6 rec file
Manually prepFRED rec file
1 protein(1MB)
6 GB2M
structures(6 GB)
DOCK6FRED ~4M x 60s x 1 cpu~60K cpu-hrs
Amber~10K x 20m x 1 cpu
~3K cpu-hrs
Select best ~500
~500 x 10hr x 100 cpu~500K cpu-hrsGCMC
PDBprotein
descriptions
Select best ~5KSelect best ~5K
For 1 target:4 million tasks
500,000 cpu-hrs(50 cpu-years)
14
DOCK on SiCortex CPU cores: 5760 Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years Average task time: 660.3 sec
(does not include ~800 sec to
stage input data)
IoanRaicu
ZhaoZhang
15
DOCK on BG/P: ~1M Tasks on 118,000 CPUs
CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization:
Sustained: 99.6% Overall: 78.3%
• GPFS
• 1 script (~5KB)
• 2 file read (~10KB)
• 1 file write (~10KB)
• RAM (cached from GPFS on first task per node)
• 1 binary (~7MB)
• Static input data (~45MB)IoanRaicu
ZhaoZhang
MikeWilde
Time (secs)
16
Managing 120K CPUs
Slower shared storage
High-speed local disk
Falkon
17
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 180 360 540 720 900 1080 1260 1440Time (sec)
CP
U C
ore
s
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
0 180 360 540 720 900 1080 1260 1440
Mic
ro-T
asks
Idle CPUsBusy CPUsWait Queue LengthCompleted Micro-Tasks
MARS Economic Model
Parameter Study 2,048 BG/P CPU cores Tasks: 49,152 Micro-tasks: 7,077,888 Elapsed time: 1,601 secs CPU Hours: 894
ZhaoZhang
MikeWilde
1818
19
AstroPortal Stacking Service Purpose
On-demand “stacks” of random locations within ~10TB dataset
Challenge Rapid access to 10-10K “random” files Time-varying load
Sample Workloads
S4 SloanData
+
+++
+
+
=
+
Web page
or Web Service
Locality Number of Objects Number of Files1 111700 111700
1.38 154345 1116992 97999 490003 88857 296204 76575 191455 60590 1212010 46480 465020 40460 202530 23695 790
20
AstroPortal Stacking Servicewith Data Diffusion
Aggregate throughput: 39Gb/s
10X higher than GPFS Reduced load on GPFS
0.49Gb/s 1/10 of the original load
0
5
10
15
20
25
30
35
40
45
50
1 1.38 2 3 4 5 10 20 30Locality
Ag
gre
gat
e T
hro
ug
hp
ut
(Gb
/s)
Data Diffusion Throughput LocalData Diffusion Throughput Cache-to-CacheData Diffusion Throughput GPFSGPFS Throughput (FIT)GPFS Throughput (GZ)
Big performance gains as locality increases
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 1.38 2 3 4 5 10 20 30 Ideal
Locality
Tim
e (m
s) p
er s
tack
per
CP
U
Data Diffusion (GZ)Data Diffusion (FIT)GPFS (GZ)GPFS (FIT)
Ioan Raicu, 11:15am TOMORROW
21
B. Berriman, J. Good (Caltech)J. Jacob, D. Katz (JPL)
22
MontageBenchmark
(Yong Zhao, Ioan Raicu, U.Chicago)
MPI: ~950 lines of C for one stagePegasus: ~1200 lines of C + tools to
generate DAG for specific dataset SwiftScript: ~92 lines for any dataset
23
Summary Peta- and exa-scale computers enable us to
tackle new problems at greater scales Parameter studies, ensembles, interactive data
analysis, “workflows” of various kinds Such apps frequently stress petascale hardware
and software in interesting ways New programming models and tools required
Mixed task/data parallelism, task management complex data management, failure, …
Tools (DAGman, Swift, Hadoop, …) exist but need refinement
Interesting connections to distributed systems
More info: www.ci.uchicago.edu/swift