- 1.Big process for big dataProcess automation for data-driven
scienceIan FosterComputation InstituteMathematics and Computer
Science DivisionDepartment of Computer ScienceArgonne National
Laboratory & The University of ChicagoTalk at DOE Big Data
Technology Summit, Washington DC, October 9, 2012 www.ci.anl.gov
www.ci.uchicago.edu
2. Big data is not new at DOELarge Hadron Collider Higgs
discovery onlypossible because of theextraordinaryachievements of
gridcomputing15 PB/yearRolf Heuer, CERN DG173 TB/day500 MB/secLHC
ComputingGrid (10+ GB/sec)www.ci.anl.gov2www.ci.uchicago.edu 3. But
it is now ubiquitous: e.g., genomics www.ci.anl.gov3 Kahn, Science,
331 (6018): 728-729 www.ci.uchicago.edu 4. But it is now
ubiquitous: e.g., genomics 6 years Computing x10 (x30 at
DOE)www.ci.anl.gov4 Kahn, Science, 331 (6018):
728-729www.ci.uchicago.edu 5. But it is now ubiquitous: e.g.,
genomics 6 years Computing x10 (x30 at DOE) Genome sequencing
x105www.ci.anl.gov5 Kahn, Science, 331 (6018):
728-729www.ci.uchicago.edu 6. Now ubiquitous: e.g., light sources18
ordersof magnitude12 orders ofin 5 decades!magnitudein 6
decadeswww.ci.anl.gov 6 Credit: Linda Youngwww.ci.uchicago.edu 7.
Now ubiquitous: e.g., light sourceswww.ci.anl.gov7 Source:
Francesco de Carlowww.ci.uchicago.edu 8. Local flows already exceed
those of LHC External Argonne data sources 163flows in TB/day
99(estimates)Advanced Photon SourceArgonne143
10Short-Long-Leadership term termComputing 100 storage 50
storageFacility 150 100Other sourcesOther sourcesthat remain tothat
remain to be quantified be quantified Data
analysiswww.ci.anl.gov8www.ci.uchicago.edu 9. Big data demands new
analysis modelsToday Desired www.ci.anl.gov9 Source: Francesco de
Carlo www.ci.uchicago.edu 9 10. Its velocity and variety as well as
volume Proteomics Phenotypes Transcriptomics Genomes Growth curves
MetabolomicsMetabolicReconciled Phenotype Model Modelpredictions
Flux Integrated predictions Assembly Annotation
modelHypothesesRegulonRegulatoryPathway predictionmodel designs
www.ci.anl.gov10Credit: Chris Henry et al. www.ci.uchicago.edu 11.
Exponentially increasing complexity Run experimentCollect dataMove
dataCheck dataAnnotate dataShare data Find similar data Link to
literature Analyze data Publish
datawww.ci.anl.gov11www.ci.uchicago.edu 12. www.ci.anl.gov12
www.ci.uchicago.edu 13. Tripit exemplifies process automationMe
Other services Book flights Record flightsSuggest hotel Book hotel
Record hotelGet weatherPrepare mapsShare infoMonitor pricesMonitor
flight www.ci.anl.gov13 www.ci.uchicago.edu 14. Big data requires
big process Run experiment OutsourcedCollect dataIntuitiveMove
dataIntegrativeCheck dataAnnotate data Research ITShare dataas a
service Find similar data Link to literatureSecure Performant
Analyze dataReliable Publish data www.ci.anl.gov14
www.ci.uchicago.edu 15. Characterizing big process
requirementsTelescopeIn millions of labs Simulation worldwide,
researchers struggle with massive data, advanced software, complex
protocols, burdensome reportingStaging IngestRegistry Community
Repository AnalysisNext-gengenome ArchiveMirrorsequencerAccelerate
discovery and innovation by outsourcing difficult tasks 15
www.ci.anl.gov www.ci.uchicago.edu 16. Characterizing big process
requirements Telescope In millions of labs Simulation worldwide,
researchers struggle with massive data, advanced software, complex
Data movement is a frequentburdensome reporting protocols,challenge
Between facilities, archives,Registry researchersStagingIngest Many
files, large data volumes Community With security, reliability,
performance
RepositoryAnalysisNext-gengenomeArchiveMirrorsequencerAccelerate
discovery and innovation by outsourcing difficult tasks 16
www.ci.anl.gov www.ci.uchicago.edu 17. Globus Online: Big process
for big dataData movement as a serviceSecure, automated, reliable,
high-speed movement, synchronization of many files www.ci.anl.gov17
www.ci.uchicago.edu 18. 6,000 users500 M files, 7 PB moved99.9%
availability 19. Examples of Globus Online in actionK. Heitmann
(ANL) moves 22TB cosmology data at 5 Gb/s LANL ANLB. Winjum (UCLA)
moves 900K-file plasma physics datasets UCLA - NERSCDan Kozak
(Caltech) replicates 1 PB LIGO astronomy data for
resilienceSupercomputer centers, genome facilities, light sources,
universities all recommend itwww.ci.anl.gov19www.ci.uchicago.edu
20. Sizes of transfers Jan-Jun; size of circles prop. to log size
Automation expands use of networksRed=NERSC/LBL/ESnet;
Green=ORNL/BNL; Blue=ANL;Yellow=FNAL; Grey=OtherTransfers Jan-June
2012,1e+12Size (bytes) vs timeSize log(transfer rate)Red:
NERSC/LBL/Esnet1e+09Green: ORNL, LBLBlue: ANL bytes_xferedYellow:
FNAL1e+06Grey: Other1e+031e+00JanMar
MayJulwww.ci.anl.gov20www.ci.uchicago.edu 21. Need much more than
data movementTelescopeIn millions of labs Simulation worldwide,
researchers struggle with massive data, advanced software, complex
protocols, burdensome reportingStaging IngestRegistry Community
Repository AnalysisNext-gengenome ArchiveMirrorsequencerAccelerate
discovery and innovation by outsourcing difficult tasks 21
www.ci.anl.gov www.ci.uchicago.edu 22. Need much more than data
movement Ingest, cata loging, inteSharing, collaboration,Identity,
grou ps, security Analysis, sim ulation, visu ... grationannotation
alizationStagingIngestRegistry Community
RepositoryAnalysisNext-gengenome ArchiveMirrorsequencerAccelerate
discovery and innovation by outsourcing difficult tasks 22
www.ci.anl.gov www.ci.uchicago.edu 23. Earth System Grid: Data
movementOutsource data transfer Client data download Replication
between sitesNo ESGF client software needed20+ times faster than
HTTP www.ci.anl.gov23 earthsystemgrid.org www.ci.uchicago.edu 24.
Kbase: Identity, group, data movementwww.ci.anl.gov24
kbase.science.energy.gov www.ci.uchicago.edu 25. Genomics: Data
movement and analysisGalaxy-based workflow managementPublic Globus
Online Data Integrated Galaxy Web-based UI data Drag-n-drop
Sequenc- SequencinGlobus Online providesStorage libraries workflow
creationing g Centers Easily add newcenters High-performance
Fault-tolerant Lab Research tools Secure Local Cluster/ Analytical
toolsSeq Cloudfile transfer between all Center run on scalabledata
endpointscomputersGalaxy in CloudData managementData analysis
www.ci.anl.gov25 Source: Ravi Madduriwww.ci.uchicago.edu 26.
Integrating observation and simulation1Cloud properties and
precipitation characteristics in large-scale models and cloud-
resolving models (e.g., CMIP5 models, GCRM)Percentage of mapped
radar domain in Darwin with returns>10 dBz over the period 19 to
22 January 2006. Retrieve CompareConstruct structured4-D
atmosphericstate (CAN)2 Precipitating storm structures; storm
lifecycles; Analytics Analytics statistical representation of storm
scale properties;3 predictive cloud models www.ci.anl.gov26 Scott
Colliswww.ci.uchicago.edu 27. Integrating observation and
simulationLevel 1Level 2 Level 3 PBsTBsGBs www.ci.anl.gov27 Salman
Habib, Katrin Heitmann www.ci.uchicago.edu 28. Integrating
observation and simulation www.ci.anl.gov28 Salman Habib, Katrin
Heitmann www.ci.uchicago.edu 29. In summary: Big process for big
dataAccelerate discovery and innovation worldwideby providing
research IT as a serviceOutsource time-consuming tasks to provide
large numbers of researchers with unprecedented access to powerful
tools; enable a massive shortening of cycle times in time-consuming
research processes; and reduce research IT costs via economies of
scaleAccelerate existing science; enable new science
www.ci.anl.gov29 www.ci.uchicago.edu 30. Thank
you!foster@anl.govwww.ci.anl.govwww.mcs.anl.govwww.globusonline.org
www.ci.anl.gov www.ci.uchicago.edu