Ian Foster Accelerating data-driven discovery in energy science Distinguished Fellow
Ian Foster
Acceleratingdata-driven discovery in energy science
Distinguished Fellow
Life Sciences and Biology
Advanced MaterialsCondensed Matter
Physics
Chemistry and Catalysis
Soft Materials
Environmental and Geo Sciences
Can we determine pathways that lead to novel states and
nonequilibrium assemblies?
Can we observe – and control –
nanoscale chemical transformations in
macroscopic systems?
Can we create new materials with extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and ultimately harness –
dynamic heterogeneity in complex correlated
systems?
Can we unravel the secrets of biological function – across length scales?
Can we understand physical and chemical processes in the most extreme environments?
2
New tools are needed to answer the most pressing scientific Qs
The resulting data delugeSpans biology, climate, cosmology, materials, physics, urban sciences, …
Simulation dataPetascale exascale simulations; simulation datasets as laboratories; high-throughput characterization; etc.
Experimental dataLight sources, genome sequencing, next-gen ARM radar, sky surveys, high-throughput experiments, etc.
New research methods that depend on coupling1) Of computation and experiment 2) Across data sources and types - inverse problems, computer control - knowledge integration, analysis
Scientific progress requirescollaborative discovery engines
informaticsanalysis
high-throughputexperiments
problemspecification
modeling and simulation
analysis &visualization
experimentaldesign
analysis &visualization
Integrateddatabases
Rick Stevens
Example: A discovery engine for disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimentalscattering
Material composition
Simulated structure
Simulatedscattering
La 60%Sr
40%
Detect errors (secs—mins)
Knowledge basePast experiments;
simulations; literature; expert knowledge
Select experiments (mins—hours)
Contribute to knowledge base
Simulations driven by experiments (mins—days)
Knowledge-drivendecision making
Evolutionary optimization
Acceleratingdata-driven discovery
in energy science
(1) Eliminate data friction
Eliminating data friction is essential to modern science
Civilization advancesby extending the number of important operations which we can perform without thinking about them (Whitehead, 1912)
Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
Software as a service (SaaS) as lubricant
Customer relationship management (CRM):
A knowledge-intensive processHistorically, handled manually or via expensive, inflexible on-premise software
SaaS has revolutionized how CRM is consumed Outsource to provider who
runs software on cloud Access via simple interfaces Ease of use Cost Flexibility Complexity
Drag picture to placeholder or click icon to add
SaaSOn-premise
Globus: Research data management as a service
Essential research data management services File transfer Data sharing Data publication Identity and groups
Builds on 15 years of DOE research
Outsourced and automated High availability, reliability,
performance, scalability Convenient for
Casual users: Web interfaces Power users: APIs Administrators: Install, manage
globus.org
10
“I need to easily, quickly, & reliably move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer Public Cloud
11
“I need to get data from a scientific instrument to my analysis system.”
Next GenSequencer
Light Sheet Microscope
MRI Advanced Light Source
12
“I need to easily and securely share my data with my colleagues.”
13
Globus and the research data lifecycle
Researcher initiates transfer request; or requested automatically by script, science gateway
1
InstrumentCompute Facility
Globus transfers files reliably, securely
2
Globus controls access to shared
files on existing storage; no need
to move files to cloud storage!
4
Curator reviews and approves; data set
published on campus or other system
7
Researcher selects files to share, selects user or group,
and sets access permissions
3
Collaborator logs in to Globus and accesses shared files; no local
account required; download via Globus
5
Researcher assembles data set;
describes it using metadata (Dublin core and domain-
specific)
6
6
Peers, collaborators search and discover datasets; transfer and share using Globus
8
Publication Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS Only a web browser required
• Use storage system of your choice
• Access using your campus credentials
Globus at a glance
4 major services
13 national labs use Globus
services
100 PBpetabytes transferred
8,000 active endpoints
20 billion files processed
>300 users are active
daily
25,000 registered users
99.95% uptime over the past two years
>30 subscribers
The biggest transfer to date is
1 petabyte
The longest-running transfer to
date took
3 months
We’re eager to learn what
you want to do with Globus services
15
One APS node connects to125 locationsthru mid 2014
Same node(1 Gbps link)
Globus and DOE: Terabytes per month
Globus and DOE: Running total terabytes
Globus and DOE: Active users per month
Response has been gratifying"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory
"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates, Professional Research Assistant, NOAA
“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing
"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" - Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications
"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory
"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user
"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley National Lab
"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator, Oak Ridge National Laboratory
"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user
"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC support team, NCSA
21
Globus service APIs serve as a science platform
Identity, Group, andProfile Management
… Globus Toolkit
Glo
bus
API
s
Glo
bus
Con
nectData Publication & Discovery
File Sharing
File Transfer & Replication
Globus platform services enable new application capabilities
Publication as service for ACME
Globus platform accelerates development of new services
Operating a sustainable service
Globus is a not-for-profit service for researchers
We adopt a subscription- supported freemium modelSubscribers get extra features, rapid support
We’re engaged in crossing the chasm
Support from DOE will contribute to long-term success
Acceleratingdata-driven discovery
in energy science
(2) Liberate scientific data
Q: What is the biggest obstacle to data sharing in science?
A: The vast majority of data that is lost, or not online;if online, not described; if described, not indexedNot accessibleNot discoverableNot used
Contrast with common practice for consumer photos (iPhoto) Automated capture Publish then curate Processing to add value Outsourced storage
We must automate the capture, linking, and indexing of all data
Globus publication service encodes and automates data publication pipelines
Example application: Materials Data Facility for materials simulation and experiment data
Proposed distributed virtual collections index, organize, tag, & manage distributed data
Think iPhoto on steroids –backed by domain knowledge and supercomputing power
Drag picture to placeholder or click icon to add
We must automate the capture, linking, and indexing of all data
chiDB: Human-computer collaboration to extract Flory-Huggins ( ) parameters from 𝞆polymers literatureR. Tchoua et al.
Plenario: Spatially and temporally integrated, linked, and searchable database of urban dataC. Catlett, B. Goldstein, T. Malik et al.
Drag picture to placeholder or click icon to addDrag picture to placeholder or click icon to add
30
“I need to publish my data so that others can find it and use it.”
ScholarlyPublication
ReferenceDataset
Research CommunityCollaboration
Publish dashboard
31
Start a new submission
32
33
Describe submission: 1) Dublin Core
34
Describe submission: 2) Science metadata
Assemble the dataset
35
36
Transfer files to submission endpoint
37
Check dataset is assembled correctly
Submission now in curation workflow
38
Search published datasets
39
Search across collections
Discover a published dataset
41
Select a published dataset
42
View downloaded dataset
43
Configuring a publication pipeline: Publication “facets”
URL Handle DOIidentifier
none standard customdescription
domain-specific
none acceptance machine-validatedcuration
human-validated
anonymous Public collaboratorsaccess
embargoed
transient project lifetime “forever”preservation
archive
44
Acceleratingdata-driven discovery
in energy science
(3) Create discovery engines at DOE facilities
Recall: A discovery engine for disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimentalscattering
Material composition
Simulated structure
Simulatedscattering
La 60%Sr
40%
Detect errors (secs—mins)
Knowledge basePast experiments;
simulations; literature; expert knowledge
Select experiments (mins—hours)
Contribute to knowledge base
Simulations driven by experiments (mins—days)
Knowledge-drivendecision making
Evolutionary optimization
SimulationCharacterize,
PredictAssimilateSteer data acquisition
Data analysisReconstruct,
detect features, auto-correlate,
particle distributions, …
Science automation servicesScripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow~2 GB/s total burst~200 TB/month~10 concurrent flows(Today: x10 in 5 yrs)
IntegrationOptimize, fit, …
Configure CheckGuide
Batch
Immediate
0.001 1 100+PFlops
Precomputematerial
database
Reconstruct image
Auto-correlation
Feature detection
Scientific opportunities Probe material structure and
function at unprecedented scalesTechnical challenges Many experimental modalities Data rates and computation
needs vary widely; increasing Knowledge management,
integration, synthesis
Towards discovery engines for energy science (Argonne LDRD)
Linking experiment and computation
Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper wire, 0.2mm diameter
Advanced Photon Source
Experimental and simulated scattering from manganite
49
1: Run script (EL1.layer)2. Lookup file name=EL1.layeruser=Antontype=reconstruction
Storage locations
3: Transfer inputs
Compute facilities
4: Run app
6: Update catalogs
5: Transfer results
Externalcollaborators
Collaboration catalogs
Provenance
Files & Metadata
Scriptlibraries
0: Develop or reuse script
Researchers
Tying it all together: An energy sciences infrastructure
informaticsanalysis
high-throughputexperiments
problemspecification
modeling and simulation
analysis &visualization
experimentaldesign
analysis &visualization
Integrateddatabases
Summary: Big opportunities and challenges for energy data
Immediate opportunities Reduce data friction and
accelerate discovery by deploying Globus services across all DOE facilities
Develop new services to capture, link energy data
Important research agenda Discovery engines to answer
major scientific questions New research modalities
linking computation and data Organization and analysis of
massive science data
Drag picture to placeholder or click icon to add
51
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
For more information: [email protected] to co-authors and Globus teamGlobus services (globus.org) Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.
Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014.
Publication (globus.org/data-publication) Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.