Accelerating Data-driven Discovery in Energy Science

Ian Foster

Acceleratingdata-driven discovery in energy science

Distinguished Fellow

Life Sciences and Biology

Advanced MaterialsCondensed Matter

Physics

Chemistry and Catalysis

Soft Materials

Environmental and Geo Sciences

Can we determine pathways that lead to novel states and

nonequilibrium assemblies?

Can we observe – and control –

nanoscale chemical transformations in

macroscopic systems?

Can we create new materials with extraordinary properties – by engineering

defects at the atomic scale?

Can we map – and ultimately harness –

dynamic heterogeneity in complex correlated

systems?

Can we unravel the secrets of biological function – across length scales?

Can we understand physical and chemical processes in the most extreme environments?

2

New tools are needed to answer the most pressing scientific Qs

The resulting data delugeSpans biology, climate, cosmology, materials, physics, urban sciences, …

Simulation dataPetascale exascale simulations; simulation datasets as laboratories; high-throughput characterization; etc.

Experimental dataLight sources, genome sequencing, next-gen ARM radar, sky surveys, high-throughput experiments, etc.

New research methods that depend on coupling1) Of computation and experiment 2) Across data sources and types - inverse problems, computer control - knowledge integration, analysis

Scientific progress requirescollaborative discovery engines

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation

analysis &visualization

experimentaldesign


Integrateddatabases

Rick Stevens

Example: A discovery engine for disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr

40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Acceleratingdata-driven discovery

in energy science

(1) Eliminate data friction

Eliminating data friction is essential to modern science

Civilization advancesby extending the number of important operations which we can perform without thinking about them (Whitehead, 1912)

Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)

Software as a service (SaaS) as lubricant

Customer relationship management (CRM):

A knowledge-intensive processHistorically, handled manually or via expensive, inflexible on-premise software

SaaS has revolutionized how CRM is consumed Outsource to provider who

runs software on cloud Access via simple interfaces Ease of use Cost Flexibility Complexity

Drag picture to placeholder or click icon to add

SaaSOn-premise

Globus: Research data management as a service

Essential research data management services File transfer Data sharing Data publication Identity and groups

Builds on 15 years of DOE research

Outsourced and automated High availability, reliability,

performance, scalability Convenient for

Casual users: Web interfaces Power users: APIs Administrators: Install, manage

globus.org

10

“I need to easily, quickly, & reliably move data to other locations.”

Research Computing HPC Cluster

Lab Server

Campus Home Filesystem

Desktop Workstation

Personal Laptop

DOE supercomputer Public Cloud

11

“I need to get data from a scientific instrument to my analysis system.”

Next GenSequencer

Light Sheet Microscope

MRI Advanced Light Source

12

“I need to easily and securely share my data with my colleagues.”

13

Globus and the research data lifecycle

Researcher initiates transfer request; or requested automatically by script, science gateway

1

InstrumentCompute Facility

Globus transfers files reliably, securely

2

Globus controls access to shared

files on existing storage; no need

to move files to cloud storage!

4

Curator reviews and approves; data set

published on campus or other system

7

Researcher selects files to share, selects user or group,

and sets access permissions

3

Collaborator logs in to Globus and accesses shared files; no local

account required; download via Globus

5

Researcher assembles data set;

describes it using metadata (Dublin core and domain-

specific)

6

6

Peers, collaborators search and discover datasets; transfer and share using Globus

8

Publication Repository

Personal Computer

Transfer

Share

Publish

Discover

• SaaS Only a web browser required

• Use storage system of your choice

• Access using your campus credentials

Globus at a glance

4 major services

13 national labs use Globus

services

100 PBpetabytes transferred

8,000 active endpoints

20 billion files processed

>300 users are active

daily

25,000 registered users

99.95% uptime over the past two years

>30 subscribers

The biggest transfer to date is

1 petabyte

The longest-running transfer to

date took

3 months

We’re eager to learn what

you want to do with Globus services

15

One APS node connects to125 locationsthru mid 2014

Same node(1 Gbps link)

Globus and DOE: Terabytes per month

Globus and DOE: Running total terabytes

Globus and DOE: Active users per month

Response has been gratifying"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory

"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates, Professional Research Assistant, NOAA

“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing

"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" - Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications

"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory

"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user

"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley National Lab

"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator, Oak Ridge National Laboratory

"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user

"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC support team, NCSA

21

Globus service APIs serve as a science platform

Identity, Group, andProfile Management

… Globus Toolkit

Glo

bus

API

s

Glo

bus

Con

nectData Publication & Discovery

File Sharing

File Transfer & Replication

Globus platform services enable new application capabilities

Publication as service for ACME

Globus platform accelerates development of new services

Operating a sustainable service

Globus is a not-for-profit service for researchers

We adopt a subscription- supported freemium modelSubscribers get extra features, rapid support

We’re engaged in crossing the chasm

Support from DOE will contribute to long-term success


in energy science

(2) Liberate scientific data

Q: What is the biggest obstacle to data sharing in science?

A: The vast majority of data that is lost, or not online;if online, not described; if described, not indexedNot accessibleNot discoverableNot used

Contrast with common practice for consumer photos (iPhoto) Automated capture Publish then curate Processing to add value Outsourced storage

We must automate the capture, linking, and indexing of all data

Globus publication service encodes and automates data publication pipelines

Example application: Materials Data Facility for materials simulation and experiment data

Proposed distributed virtual collections index, organize, tag, & manage distributed data

Think iPhoto on steroids –backed by domain knowledge and supercomputing power


We must automate the capture, linking, and indexing of all data

chiDB: Human-computer collaboration to extract Flory-Huggins ( ) parameters from 𝞆polymers literatureR. Tchoua et al.

Plenario: Spatially and temporally integrated, linked, and searchable database of urban dataC. Catlett, B. Goldstein, T. Malik et al.

Drag picture to placeholder or click icon to addDrag picture to placeholder or click icon to add

30

“I need to publish my data so that others can find it and use it.”

ScholarlyPublication

ReferenceDataset

Research CommunityCollaboration

Publish dashboard

31

Start a new submission

32

33

Describe submission: 1) Dublin Core

34

Describe submission: 2) Science metadata

Assemble the dataset

35

36

Transfer files to submission endpoint

37

Check dataset is assembled correctly

Submission now in curation workflow

38

Search published datasets

39

Search across collections

Discover a published dataset

41

Select a published dataset

42

View downloaded dataset

43

Configuring a publication pipeline: Publication “facets”

URL Handle DOIidentifier

none standard customdescription

domain-specific

none acceptance machine-validatedcuration

human-validated

anonymous Public collaboratorsaccess

embargoed

transient project lifetime “forever”preservation

archive

44


in energy science

(3) Create discovery engines at DOE facilities

Recall: A discovery engine for disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr

40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

SimulationCharacterize,

PredictAssimilateSteer data acquisition

Data analysisReconstruct,

detect features, auto-correlate,

particle distributions, …

Science automation servicesScripting, security, storage, cataloging, transfer

~0.001-0.5 GB/s/flow~2 GB/s total burst~200 TB/month~10 concurrent flows(Today: x10 in 5 yrs)

IntegrationOptimize, fit, …

Configure CheckGuide

Batch

Immediate

0.001 1 100+PFlops

Precomputematerial

database

Reconstruct image

Auto-correlation

Feature detection

Scientific opportunities Probe material structure and

function at unprecedented scalesTechnical challenges Many experimental modalities Data rates and computation

needs vary widely; increasing Knowledge management,

integration, synthesis

Towards discovery engines for energy science (Argonne LDRD)

Linking experiment and computation

Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).

Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.

X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response

2-BM

1-ID

6-ID

Populate

Sim Sim

Select

Sim

Microstructure of a copper wire, 0.2mm diameter

Advanced Photon Source

Experimental and simulated scattering from manganite

49

1: Run script (EL1.layer)2. Lookup file name=EL1.layeruser=Antontype=reconstruction

Storage locations

3: Transfer inputs

Compute facilities

4: Run app

6: Update catalogs

5: Transfer results

Externalcollaborators

Collaboration catalogs

Provenance

Files & Metadata

Scriptlibraries

0: Develop or reuse script

Researchers

Tying it all together: An energy sciences infrastructure

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation


experimentaldesign


Integrateddatabases

Summary: Big opportunities and challenges for energy data

Immediate opportunities Reduce data friction and

accelerate discovery by deploying Globus services across all DOE facilities

Develop new services to capture, link energy data

Important research agenda Discovery engines to answer

major scientific questions New research modalities

linking computation and data Organization and analysis of

massive science data


51

Thank you to our sponsors!

U.S. DEPARTMENT OF

ENERGY

For more information: [email protected] to co-authors and Globus teamGlobus services (globus.org) Foster, I. Globus Online: Accelerating and democratizing science through

cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,

Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.

Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014.

Publication (globus.org/data-publication) Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,

Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015

Discovery engines Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,

M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.

Accelerating Data-driven Discovery in Energy Science

Technology