- 1.eScience -- A Transformed Scientific MethodJim Gray ,eScience
Group,Microsoft Researchhttp://research.microsoft.com/~Gray in
collaboration withAlex SzalayDept. Physics & Astronomy Johns
Hopkins University http:// www.sdss.jhu.edu/~szalay /
2. Talk Goals
- Explain eScience (and what I am doing) &
- Recommend CSTB foster tools for
- data capture (lab info management systems)
- data curation (schemas, ontologies, provenance)
- data analysis (workflow, algorithms, databases, data
visualization )
- data+doc publication (active docs, data-doc integration)
- peer review (editorial services)
- access (doc + data archives and overlay journals)
- Scholarly communication (wikis for each article and
dataset)
3. eScience:What is it?
- Synthesis ofinformation technology and science.
- Sciencemethodsare evolving (tools).
- Science is being codified/objectified. How represent scientific
information and knowledge in computers?
- Science faces a data deluge. How to manage and analyze
information?
- Scientific communication changing
- publishing data & literature(curation, access,
preservation)
4. Science Paradigms
- Thousand years ago:science wasempirical
-
- describing natural phenomena
- Last few hundred years:theoreticalbranch
-
- using models, generalizations
- Last few decades:acomputationalbranch
-
- simulating complex phenomena
- Today: data exploration(eScience)
-
- unify theory, experiment, and simulation
-
- Data captured by instruments Orgenerated by simulator
-
- Information/Knowledge stored in computer
-
- Scientist analyzes database / files using data management and
statistics
5. X-Info
- The evolution of X-Info and Comp-X for each discipline X
- How to codify and represent our knowledge
- Building and executing models
- Integrating data and Literature
- Curation and long-term preservation
The Generic Problems Experiments & Instruments Simulations
facts facts answers questions Literature Other Archives facts facts
? 6. Experiment Budgets Software
- Millions of lines of code
- Repeated for experiment after experiment
- Not much sharing or learning
7. Experiment Budgets Software
- Millions of lines of code
- Repeated for experiment after experiment
- Not much sharing or learning
Action item Foster Tools andFoster Tool Support 8. Project
Pyramids In most disciplines there area few giga projects,several
mega consortiaand then many small labs. Often some instrument
creates need for giga-or mega-project Polar station Accelerator
Telescope Remote sensor Genome sequencer Supercomputer Tier 1, 2, 3
facilitiesto use instrument + dataInternational Multi-Campus Single
Lab 9. Pyramid Funding
- Giga Projects need Giga Funding Major Research Equipment
Grants
- Need projects at all scales
- computing example: supercomputers, + departmental clusters+ lab
clusters
- Fully fund giga projects,fund of smaller projects they get
matching fundsfrom other sources
- Petascale Computational Systems: Balanced Cyber-Infrastructure
in a Data-Centric World , IEEEComputer , V. 39.1, pp 110-112,
January, 2006.
10. Action item Invest in toolsat all levels 11. Need Lab Info
Management Systems(LIMSs)
- Pipeline Instrument + Simulator datato archive & publish to
web.
- NASA Level 0 (raw) dataLevel 1 (calibrated) Level 2
(derived)
- Needs workflow tool to manage pipeline
-
- SDSS,LifeUnderYourFeet MBARI Shore Side Data System.
12. Need Lab Info Management Systems(LIMSs)
- Pipeline Instrument + Simulator datato archive & publish to
web.
- NASA Level 0 (raw) dataLevel 1 (calibrated) Level 2
(derived)
- Needs workflow tool to manage pipeline
-
- SDSS,LifeUnderYourFeet MBARI Shore Side Data System.
Action item Foster generic LIMS 13. Science Needs Info
Management
- Simulators produce lots of data
- Experiments produce lots of data
-
- each simulation run produces a file
-
- each instrument-day produces a file
-
- each process step produces a file
-
- files have descriptive names
-
- files have similar formats (described elsewhere)
- Projects have millions of files (or soon will)
- No easy way to manage or analyze the data.
14. Data Analysis
-
- Needles in haystacks the Higgs particle
-
- Haystacks: Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
-
- Correlation functions are N 2 , likelihood techniques N 3
- Must accept approximate answers New algorithms
15. Analysis and Databases
- Much statistical analysis deals with
-
- Assembling relevant subsets
-
- Counting and building histograms
-
- Generating Monte-Carlo subsets
- Traditionally performed on files
- These tasks better done in structured store with
16. Data Delivery: Hitting a Wall
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years
- Oh!, and 1PB ~4,000 disks
- At some point you needindicesto limit search paralleldata
search and analysis
- This is wheredatabasescan help
- You can FTP 1 MB in 1 sec
FTP and GREP are not adequate 17. Accessing Data
- If there is too much data to move around,
- take the analysis to the data!
- Do all data manipulations at database
-
- Build custom procedures and functions in the database
- Automatic parallelism guaranteed
- Easy to build-in custom functionality
-
- Databases & Procedures being unified
-
- Example temporal and spatial indexing
- Easy to reorganize the data
-
- Multiple views, each optimal for certain analyses
-
- Building hierarchical summaries are trivial
- Scalable to Petabyte datasets
active databases! 18. Analysis and Databases
- Much statistical analysis deals with
-
- Assembling relevant subsets
-
- Counting and building histograms
-
- Generating Monte-Carlo subsets
- Traditionally performed on files
- These tasks better done in structured store with
Action item Foster Data Management Data AnalysisData
VisualizationAlgorithms &Tools 19. Let 100 Flowers Bloom
- Comp-X has some nice tools
- These tools grew from the community
- Its HARD to see a common pattern
-
- Linux vs FreeBSDwhy was Linux more successful? Community,
personality, timing, .???
- Lesson: let 100 flowers bloom.
20. Talk Goals
- Explain eScience (and what I am doing) &
- Recommend CSTB foster tools and tools for
- data capture (lab info management systems)
- data curation (schemas, ontologies, provenance)
- data analysis (workflow, algorithms, databases, data
visualization )
- data+doc publication (active docs, data-doc integration)
- peer review (editorial services)
- access (doc + data archives and overlay journals)
- Scholarly communication (wikis for each article and
dataset)
21. All Scientific Data Online
- Many disciplines overlap anduse data from other sciences.
- Internet can unifyall literature and data
- Go from literatureto computationto databack to literature.
- Information at your fingertips For everyone-everywhere
- Increase Scientific Information Velocity
- Huge increase in Science Productivity
Literature Derived andRe-combined data Raw Data 22. Unlocking
Peer-Reviewed Literature
- Agencies and Foundations mandatingresearch be public
domain.
-
- NIH (30 B$/y, 40k PIs,) (seehttp:// www.taxpayeraccess.org /
)
-
- Japan, China, Italy, South Africa,.
-
- Public Library of Science..
- Other agencies will follow NIH
23. How Does the New Library Work?
- Who pays for storage access(unfunded mandate) ?
-
- Its cheap: 1 milli-dollar per access
- But curation is not cheap :
-
- Author/Title/Subject/Citation/..
-
- NLM has a 6,000-line XSD for
documentshttp://dtd.nlm.nih.gov/publishing
-
- Need to capture document structure from author
-
-
- Sections, figures, equations, citations,
-
- NCBI-PubMedCentral is doing this
-
-
- Preparing for 1M articles/year
24. Pub Med Central International
- Information at your fingertips
- Deployed US, China, England, Italy, South Africa, Japan
- UK PMCIhttp://ukpmc.ac.uk/
- Each site can accept documents
- Federate thru web services
- Working to integrateWord/Excel/ with PubmedCentral e.g.
WordML,XSD ,
- To be clear:NCBI is doing 99.99% of the work.
25. Overlay Journals
- Articles and Data inpublic archives
- Journal title page in public archive.
- All covered by Creative Commons License
-
- http://creativecommons.org/
Data Archives articles Data Sets 26. Overlay Journals
- Articles and Data inpublic archives
- Journal title page in public archive.
- All covered by Creative Commons License
-
- http://creativecommons.org/
Journal Management System Data Archives articles title page Data
Sets 27. Overlay Journals
- Articles and Data inpublic archives
- Journal title page in public archive.
- All covered by Creative Commons License
-
- http://creativecommons.org/
Journal Management System Journal Collaboration System Data
Archives articles title page comments Data Sets 28. Overlay
Journals
- Articles and Data inpublic archives
- Journal title page in public archive.
- All covered by Creative Commons License
-
- http://creativecommons.org/
Journal Management System Journal Collaboration System Data
Archives Action item Do for other sciences what NLM has done for
BIO Genbank-PubMedCentral articles title page comments Data Sets
29. Better Authoring Tools
- Extend Authoring tools to
-
- capture document metadata(NLM tagset)
-
- represent documents in standard format
-
- Make active documents (words and data).
30. Conference Management Tool
- Currently a conferencepeer-reviewsystem (~300 conferences)
31. Publishing Peer Review
- & improveauthor-reader experience
- Capture classroomConferenceXP
- Moderated discussionsof published articles
32. Why Not a Wiki?
-
- There is a degree of confidentiality
-
- Its completely transparent
-
- But..Peer-Review is different.
-
- And, incidentally: review of proposals, projects,is more like
peer-review.
- Lets have Moderated Wiki re published literaturePLoS-One is
doing this
33. Why Not a Wiki?
-
- There is a degree of confidentiality
-
- Its completely transparent
-
- But..Peer-Review is different.
-
- And, incidentally: review of proposals, projects,is more like
peer-review.
- Lets have Moderated Wiki re published literaturePLoS-One is
doing this
Action item Foster new document authoring and publicationmodels
and tools 34. So What about Publishing Data?
-
- How precise? How accurate 42.5.01
-
- Show your workdataprovenance
35. Thought Experiment
- You have collected some data and want to publish science based
on it.
- How do you publish the dataso that others can read it
andreproduce your resultsin 100 years?
-
- Document collection process?
-
- How document data processing(scrubbing & reducing the
data)?
36. Objectifying Knowledge
- This requires agreement about
-
- Measurements :who/what/when/where/how
-
-
- Whats a planet, star, galaxy,?
-
-
- Whats a gene, protein, pathway?
- Need to objectify science:
-
- What are the methods (in the OO sense)?
- This is mostly Physics/Bio/Eco/Econ/...But CS can do generic
things
37. Objectifying Knowledge
- This requires agreement about
-
- Measurements:who/what/when/where/how
-
-
- Whats a planet, star, galaxy,?
-
-
- Whats a gene, protein, pathway?
- Need to objectify science:
-
- What are the methods (in the OO sense)?
- This is mostly Physics/Bio/Eco/Econ/...But CS can do generic
things
Warning! Painful discussions ahead: The O word: Ontology The S
word: Schema The CV words:Controlled Vocabulary Domain experts do
not agree 38. The Best Example: Entrez-GenBank http://
www.ncbi.nlm.nih.gov /
- Sequence data deposited with Genbank
- Literature references Genbank ID
- Entrez integrates and searches
Nucleotide sequences Protein sequences Taxon Phylogeny MMDB 3 -D
Structure PubMed abstracts Complete Genomes PubMed Entrez Genomes
Publishers Genome Centers 39. Publishing Data
-
- Projects last at least 3-5 years
-
- Data sent upwards only at the end of the project
-
- Data willneverbe centralized
- More responsibility on projects
-
- Becoming Publishers and Curators
- Data will reside with projects
-
- Analyses must be close to the data
Roles Authors Publishers Curators Consumers Traditional
Scientists Journals Libraries Scientists Emerging Collaborations
Project www site Bigger Archives Scientists 40. Data Pyramid
- Very extended distribution of data sets:
- Most datasets are small, and manually maintained (Excel
spreadsheets)
- Total volume dominated by multi-TB archives
- But, small datasets have real value
- Most data is born digitalcollected via electronic sensors or
generated by simulators.
41. Data Sharing/Publishing
- What is the business model (reward/career benefit)?
- Three tiers (power law!!!)
-
- (b) value added, refereed products
-
- (c) ad-hoc data, on-line sensors, images, outreach info
- Need Journal for Data to solve (b)
- Need VO-Flickr (a simple interface) (c)
- Mashups are emerging in science
- Need an integrated environment for virtual excursions for
education (C. Wong)
42. The Best Example: Entrez-GenBank http://
www.ncbi.nlm.nih.gov /
- Sequence data deposited with Genbank
- Literature references Genbank ID
- Entrez integrates and searches
Action item Foster Digital Data Libraries (not metadata, real
data) and integration with literature Nucleotide sequences Protein
sequences Taxon Phylogeny MMDB 3 -D Structure PubMed abstracts
Complete Genomes PubMed Entrez Genomes Publishers Genome Centers
43. Talk Goals
- Explain eScience (and what I am doing) &
- Recommend CSTB foster tools and tools for
- data capture (lab info management systems)
- data curation (schemas, ontologies, provenance)
- data analysis (workflow, algorithms, databases, data
visualization )
- data+doc publication (active docs, data-doc integration)
- peer review (editorial services)
- access (doc + data archives and overlay journals)
- Scholarly communication (wikis for each article and
dataset)
44. backup 45. Astronomy
- Help build world-wide telescope
-
- All astronomy data and literatureonline and cross indexed
-
- Tools to analyze the data
- OpenSkyQuery Federation of ~20 observatories.
-
- It works and is used every day
-
- Spatial extensions in SQL 2005
-
- A good example of Data Grid
-
- Good examples of Web Services.
46. World Wide Telescope Virtual Observatory
http://www.us-vo.org/ http://www.ivoa.net/
- Premise:Most data is (or could be online)
- So, the Internet is the worlds best telescope:
-
- It has data on every part of the sky
-
- In every measured spectral band:optical, x-ray, radio..
-
- As deep as the best instruments (2 years ago).
-
- It is up when you are up. The seeing is always great(noworking
at night, no clouds no moons no..).
-
- Its a smart telescope:links objects and data to literature on
them.
47. Why Astronomy Data?
- It has no commercial value
-
- Can freely share results with others
-
- Great for experimenting with algorithms
- It is real and well documented
-
- High-dimensional data(with confidence intervals)
- Manydifferent instrumentsfrommanydifferent
placesandmanydifferent times
- There is a lot of it (petabytes)
IRAS 100 ROSAT ~keV DSS Optical 2MASS 2 IRAS 25 NVSS 20cm WENSS
92cm GB 6cm 48. Time and Spectral Dimensions The Multiwavelength
Crab Nebulae X-ray,optical,infrared, andradioviews of the nearby
Crab Nebula, which is now in a state of chaotic expansion after a
supernova explosion first sighted in 1054 A.D. by Chinese
Astronomers. Slide courtesy of Robert Brunner @ CalTech. Crab
star1053 AD 49. SkyServer.SDSS.org
-
- Access to Sloan Digital Sky Survey Spectroscopic and Optical
surveys
-
- Raw Pixel data lives in file servers
-
- Catalog data (derived objects) lives in Database
-
- Online query to any and all
-
- 150 hours of online Astronomy
-
- Implicitly teaches data analysis
-
- Client query interface via Java Applet
-
- Query from Emacs, Python, .
-
- Cloned by other surveys (a template design)
-
- Web services are core of it.
50. SkyServer SkyServer.SDSS.org
- Like the TerraServer,but looking the other way:a picture of of
the universe
- Sloan Digital Sky Survey Data: Pixels + Data Mining
- About 400 attributes per object
- Spectrograms for 1% ofobjects
51. Demo of SkyServer
- Shows standard web server
- Explore sets of objects (data mining)
52. SkyQuery ( http://skyquery.net/ )
- Distributed Query tool using a set of web services
- Many astronomy archives fromPasadena, Chicago, Baltimore,
Cambridge (England)
- Has grown from 4 to 15 archives, now becominginternational
standard
SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary
o,TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)2 53. SkyQuery
Structure
2MASS INT SDSS FIRST SkyQuery Portal Image Cutout 54. Schema
(aka metadata)
- Everyone starts with the same schema Then the start arguing
about semantics.
- Virtual Observatory: http:// www.ivoa.net /
- Metadata based on Dublin Core: http://
www.ivoa.net/Documents/latest/RM.html
- Universal Content Descriptors
(UCD):http://vizier.u-strasbg.fr/doc/UCD.htx Captures quantitative
concepts and their units Reduced from~100,000 tables in literature
to ~1,000 terms
- VOtable a schema for answers to questions
http://www.us-vo.org/VOTable/
- Common Queries: Cone Search and Simple Image Access Protocol,
SQL
- Registry:http://www.ivoa.net/Documents/latest/RMExp.html still
a work in progress.
55. SkyServer/SkyQuery Evolution MyDBand Batch Jobs
- Problem:need multi-step data analysis (not just single
query).
- Solution:Allow personal databases on portal
- Problem: some queries are monsters
- Solution: Batch schedule on portal. Deposits answer in personal
database.
56. Ecosystem Sensor Net LifeUnderYourFeet.Org
- Small sensor net monitoring soil
- Sensors feed to a database
- Helping build system tocollect & organize data.
- Working on data analysis tools
- Prototype for other LIMS Laboratory Information Management
Systems
57. RNA Structural Genomics
- Goal:Predict secondary and tertiary structurefrom sequence.
Deduce tree of life.
- Technique:Analyzesequence variations sharinga common
structureacross tree of life
- Representingstructurally aligned sequencesis a key
challenge
- Creating a database-driven alignment workbench accessing public
and private sequence data
58. VHA Health Informatics
- VHA: largest standardized electronic medical records system in
US.
- Design, populate and tune a ~20 TB Data Warehouse and Analytics
environment
- Evaluate population health and treatment outcomes,
- Support epidemiological studies
-
-
- 1 Billionth Vital Sign loadedin April 06
-
-
- 30-minutes to population-wideobesity analysis (next slide)
-
-
- Discovered seasonality inblood pressure -- NEJM fall 06
59. HDR Vitals Based Body Mass Index Calculation on VHA FY04
Population Source: VHA Corporate Data Warehouse Total Patients
23,876 (0.7%) 701,089 (21.6%) 1,177,093 (36.2%) 1,347,098 (41.5%)
3,249,156 (100%)