Top Banner
Ian Foster Computation Institute Argonne National Lab & University of Chicago Services for science Creating knowledge in the Internet age
42
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Services For Science April 2009

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

Services for scienceCreating knowledge in the Internet age

Page 2: Services For Science April 2009

2

Page 3: Services For Science April 2009

3

Knowledge generation in astronomy ~1600

30 years? years

10 years6 years2 years

Page 4: Services For Science April 2009

4

Automation10

-1 108 Hz

data capture

Community10

0 104

astronomers(106 amateur)

ComputationData10

6 1015

Baggregate 10

-1 1015

Hzpeak

Literature10

1 105

pages/year

Astronomyfrom 1600 to 2010

Page 5: Services For Science April 2009

5

Knowledge generation in medicine~1600

Page 6: Services For Science April 2009

6

Biomedical research ~2010

...atcgaattccaggcgtcacattctcaattcca...

DNA sequencesalignments

MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...

Proteins sequence

2º structure 3º structure

Protein-ProteinInteractions

metabolism pathways

receptor-ligand 4º structure

Polymorphism and Variants

genetic variants individual patients

epidemiology

Physiology Cellular biology

Biochemistry Neurobiology

Endocrinology etc.>10

6

ESTs Expression patternsLarge-scale screensGenetics and Maps

Linkage Cytogenetic Clone-based

From John Wooley>10

6

>109

>106

>105

>109

Page 7: Services For Science April 2009

7

More data does not always mean more knowledge

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

Page 8: Services For Science April 2009

8

Knowledge generation as a systems problem

Many diverse actors Complex, often rapidly evolving processes Need for scalability in multiple dimensions

With systemic properties Rate of knowledge generation (throughput) Time to answer questions (latency) Completeness of exploration Robustness to errors

Page 9: Services For Science April 2009

9

Data

An incomplete list of process steps

Discover

Access

Integrate

Analyze

Mine

Publish

Annotate

Validate

CurateShare

Artisanal

Industrial

Data

Analyses

Models

Experiments

Literature

Page 10: Services For Science April 2009

10

SOA as an integrating framework?

We expose data and software as services …

which others discover, decide to use, …

and compose to create new functions ...

which they publish as new services.

Technical …• Complexity• Semantics• Distribution• Scale

socio-technical challenges• Incentives• Policy, trust• Reproducibility• Life cycle

“Service-Oriented Science”, Science, 2005

and

Page 11: Services For Science April 2009

11

Proteomics Genomics Transcriptomics Protein sequence prediction Phenotypic studies Phylogeny Sequence analysis Protein structure prediction Protein-protein interaction Metabolomics Model organism collections Systems biology Health epidemiology Organisms Disease ….

1070 molecular bio databases Nucleic Acids Research Jan 2008

(96 in Jan 2001)

Slide: Carole Goble

Page 12: Services For Science April 2009

12

The cancer Biomedical Informatics Grid

Globus

Page 13: Services For Science April 2009

13

As of Sept 18, 2008:

122 participants81 services

62 data19 analytical

Page 14: Services For Science April 2009

14

As of Oct19, 2008:

122 participants105 services

70 data35 analytical

Page 15: Services For Science April 2009

15

Automating the routine

Location AMicroarray, Protein,

Image data

Location BMicroarray, Protein,

Image data

Location CMicroarray, Protein,

Image data

Location CImage Analysis

Location DImage Analysis

Microarray and protein databases at other institutions

Different database systems, data

representations, security

Different program

invocation, remote access, data transfer

Page 16: Services For Science April 2009

16

Automating the routine

Location AMicroarray, Protein,

Image data

Location BMicroarray, Protein,

Image data

Location CMicroarray, Protein,

Image data

Location CImage Analysis

Location DImage Analysis

caGrid Service Interfaces

caGridEnviron-

ment

Registered Object

Definitions

Advertise-ment

Log on, Grid credentials

Query and Analysis Workflow

Discovery

Microarray & protein databases at other

institutions

Globus

Page 17: Services For Science April 2009

17

Location AMicroarray, Protein,

Image data

Location BMicroarray, Protein,

Image data

Location CMicroarray, Protein,

Image data

Location CImage Analysis

Location DImage Analysis

caGrid Service Interfaces

caGridEnviron-

ment

Registered Object

Definitions

Advertise-ment

Log on, Grid credentials

Query and Analysis Workflow

Discovery

Microarray & protein databases at other

institutions

Service authoringMetadata

services

Serviceregistries

Securityservices

Quality control

Queries,workflows

caGrid

Globus

Computeresources

Page 18: Services For Science April 2009

18

Lifecycle issues

caGrid

Cancer Data Standards Repository

Discovery Composition

Execution Analysis

Community

reuse

generate

Page 19: Services For Science April 2009

19

Service

Core Services

Client

XSDWSDL

Grid Service

Service Definition

Data TypeDefinitions

Service API

Grid Client

Client API

Registered In

Object Definitions

SemanticallyDescribed In

XMLObjectsSerialize To

ValidatesAgainst

Client Uses

Cancer Data Standards Repository

Enterprise Vocabulary

Services

Objects

GlobalModel

Exchange

GMERegistered In

ObjectDefinitions

Objects

Metadata Services

Page 20: Services For Science April 2009

20

Kepler

BPEL

Ptolemy II

Taverna

Trident

Page 21: Services For Science April 2009

21

Microarray clustering (Taverna)*

Query and retrieve microarray data of interest from a caArrayScrub data service at Columbia University

Preprocess, or normalize the microarray data using the GenePattern analytical service at the Broad Institute at MIT

Run hierarchical clustering using the geWorkbench analytical service at Columbia University

Workflow in/output

caGrid services

“Shim” servicesothers

*Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster. Orchestrating caGrid Services in Taverna. ICWS 08.

Wei Tan

Page 22: Services For Science April 2009

22

Executiontrace

Execution

result as XML

1936 gene

expressions

Page 23: Services For Science April 2009

23

Workflows as communicationExperimental method

Know-how

Standing operating procedures

Transparent science

Intellectual property

First class scientific assets

Memes

Variant design

To be reused and mashed up

Hard to design, esp. for reuse

Hard to reuse, esp. across discipline boundaries

Slide: Carole Goble

Page 24: Services For Science April 2009

2424

Reproducible science means— context— trust — easy access to methods

Page 25: Services For Science April 2009

2525

Workflows are another form of scholarly outcome to publish,

curate and cite and archive along with data and publications

Page 26: Services For Science April 2009

27

Functional Magnetic Resonance Imaging (fMRI)

MikeWilde

Page 27: Services For Science April 2009

28

Parallel scripting

Page 28: Services For Science April 2009

29

Computation as a first-class entity Capture information about relationships among

Data (varying locations and representations) Programs (& inputs, outputs, constraints) Computations (& execution environments)

Apply this information to: Discovery of data and programs Computation management Provenance Planning, scheduling,

performance optimization

Data

Program Computation

operates-on

execution-of

created-by

consumed-by

A Virtual Data System for Representing, Querying & Automating Data Derivation [SSDBM02]

Page 29: Services For Science April 2009

30

Example: fMRI analysis3a.h

align_warp/1

3a.i

3a.s.h

softmean/9

3a.s.i

3a.w

reslice/2

4a.h

align_warp/3

4a.i

4a.s.h 4a.s.i

4a.w

reslice/4

5a.h

align_warp/5

5a.i

5a.s.h 5a.s.i

5a.w

reslice/6

6a.h

align_warp/7

6a.i

6a.s.h 6a.s.i

6a.w

reslice/8

ref.h ref.i

atlas.h atlas.i

slicer/10 slicer/12 slicer/14

atlas_x.jpg

atlas_x.ppm

convert/11

atlas_y.jpg

atlas_y.ppm

convert/13

atlas_z.jpg

atlas_z.ppm

convert/15

First Provenance Challenge, http://twiki.ipaw.info/ [CCPE06]

Page 30: Services For Science April 2009

31

Query examples Query by procedure signature

Show procedures that have inputs of type subjectImage and output types of warp

Query by actual arguments Show align_warp calls (including all arguments), with argument

model=rigid Query by annotation

List anonymized subject images for young subjects: Find datasets of type subjectImage , annotated with privacy=anonymized and

subjectType=young

Basic lineage graph queries Find all datasets derived from dataset ‘5a’

Graph pattern matching Show me all output datasets of softmean calls that were aligned

with model=affine

Page 31: Services For Science April 2009

32

Challenges of scale

Number of participants

Volume of data

Diversity of data

Number of data producers

Amount of computation

Page 32: Services For Science April 2009

33

Hosting and provisioning

People create services (data or function) …

which others discover, decide to use, …

and compose to create a new function ...

which they publish as a new service.

I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!

I hope that this “someone else” can manage security, reliability, scalability, …

!!“Service-Oriented Science”, Science, 2005

Page 33: Services For Science April 2009

34

Provisioning for data-intensive workloads

Example: on-demand “stacking” of arbitrary locations within ~10TB sky survey

Challenges Random data access Much computing Time-varying load

Data diffusion

++++++

=

+

S SloanData

IoanRaicu

Page 34: Services For Science April 2009

35“Sine” workload, 2M tasks, 10MB:10ms ratio, 100 nodes,

GCC policy, 50GB caches/node

Page 35: Services For Science April 2009

36Same scenario, but with dynamic resource provisioning

Page 36: Services For Science April 2009

37

DOCK on BG/P: ~1M Tasks on 118,000 CPUs

CPU cores: 118784 Tasks: 934803 Elapsed time: 7257 sec Compute time: 21.43 CPU years Average task time: 667 sec Relative Efficiency: 99.7% (from 16 to 32 racks) Utilization:

Sustained: 99.6% Overall: 78.3%

• GPFS

• 1 script (~5KB)

• 2 file read (~10KB)

• 1 file write (~10KB)

• RAM (cached from GPFS on first task per node)

• 1 binary (~7MB)

• Static input data (~45MB)IoanRaicu

ZhaoZhang

MikeWilde

Time (secs)

Page 37: Services For Science April 2009

39

Efficiency relative to no-I/O case for 4 second tasks and varying data size (1KB to 1MB) for CIO and GPFS up to 32K processors

Page 38: Services For Science April 2009

40

Thanks!

DOE Office of Science

National Science Foundation

National Institutes of Health

Colleagues at Argonne, U.Chicago, USC/ISI, and elsewhere

Page 39: Services For Science April 2009

41

Knowledge generation as a systems problem

Many diverse actors Complex, often rapidly evolving processes Need for scalability in multiple dimensions With systemic properties

Rate of knowledge generation (throughput) Time to answer questions (latency) Completeness of exploration Robustness to errors

SOA as an integrating framework?

Page 40: Services For Science April 2009

42

Service-oriented science

People create services (data or function) …

which others discover, decide to use, …

and compose to create a new function ...

which they publish as a new service.

I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!

I hope that this “someone else” can manage security, reliability, scalability, …

!!“Service-Oriented Science”, Science, 2005

Page 41: Services For Science April 2009

43

People create services (data or function) …

which others discover, decide to use, …

and compose to create a new function ...

which they publish as a new service.

Service-oriented science

Profoundly revolutionary:

Accelerates the pace of enquiry

Introduces a new notion of “result”

Requires new reward structures, training, infrastructure

“Service-Oriented Science”, Science, 2005

Page 42: Services For Science April 2009

44

And big challenges …

Complexity and semantics

Documentation of results

Scaling in many dimensions

Sociology and incentives