Top Banner
1 Hypothesis-Driven Data Analysis of Science Repositories Yolanda Gil Information Sciences Institute and Department of Computer Science University of Southern California http://www.isi.edu/~gil @yolandagil [email protected]
31

Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

May 10, 2018

Download

Documents

vuthien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

1

Hypothesis-Driven Data Analysis of Science Repositories

Yolanda Gil

Information Sciences Instituteand Department of Computer Science

University of Southern California

http://www.isi.edu/~gil@[email protected]

Page 2: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

2

■ WINGS contributors: Varun Ratnakar, Daniel Garijo, Rajiv Mayani, Ricky Sethi, Hyunjoon Jo, Jihie Kim, Yan Liu, Dave Kale (USC), Ralph Bergmann (U Trier), William Cheung (HKBU), Oscar Corcho (UPM), Pedro Gonzalez & Gonzalo Castro (UCM), Paul Groth (VUA)

■ WINGS collaborators: Chris Mattmann & Paul Ramirez & Dan Crichton & Rishi Verma (JPL), Ewa Deelman & Gaurang Mehta & Karan Vahi (USC), Sofus Macskassy (ISI), Natalia Villanueva & Ari Kassin (UTEP)

■ Organic Data Science contributors: Felix Michel and Matheus Hauder (TUM); Varun Ratnakar (ISI); Chris Duffy (PSU); Paul Hanson, Hilary Dugan, Craig Snortheim (U Wisconsin); Jordan Read (USGS); Neda Jahanshad (USC), Julien Emile-Geay (USC), Nick McKay (NAU)

■ DISK contributors: Parag Mallick, Ravali Adusumilli, Hunter Boyce (Stanford); Suzanne Pierce, John Gentle (UT Austin)

■ Biomedicine: Parag Mallick & Ravali Adusumilli & Hunter Boyce (Stanford U.), Phil Bourne & Sarah Kinnings (UCSD), Chris Mason (Cornell); Joel Saltz & Tahsin Kurk (Emory U.); Jill Mesirov & Michael Reich (Broad); Randall Wetzel (CHLA); Shannon McWeeney & Christina Zhang (OHSU)

■ Geosciences: Suzanne Pierce & John Gentle (UT Austin), Chris Duffy (PSU); Paul Hanson (U Wisconsin); Tom Harmon & Sandra Villamizar (U Merced); Tom Jordan & Phil Maechlin (USC); Kim Olsen (SDSU)

■ And many others!

Acknowledgments

Page 3: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

3

Intelligent Systems forGeosciences

http://www.IS-GEO.org

NSF Workshop on Discovery Informatics

February 2-3, 2012

Arlington, VA

Final Workshop Report

August 31, 2012

This workshop was sponsored by the Division of Information and Intelligent Systems of the Directorate for Computer and Information Sciences at the National Science Foundation under grant number IIS-1151951.

Community Workshops

Page 4: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

4

Outline

■ Artificial intelligence and scientific discovery• The knowledge systems tradition

■ Our recent work on capturing knowledge about data analysis strategies• Hypothesis-driven data analysis

■ Representing and capturing data analysis knowledge• About data, software, methods, meta-analysis

■ Summary of AI challenges

Page 5: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

5

Outline

■ Artificial intelligence and scientific discovery• The knowledge systems tradition

■ Our recent work on capturing knowledge about data analysis strategies• Hypothesis-driven data analysis

■ Representing and capturing data analysis knowledge• About data, software, methods, meta-analysis

■ Summary of AI challenges

Page 6: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

6

AI’s Coming of Age IBM Watson Google Knowledge Graph Apple Siri

RoboCup Soccer

https://en.wikipedia.org/wiki/Watson_(computer)#/media/File:IBM_Watson.PNGhttps://en.wikipedia.org/wiki/Siri#/media/File:SirioniOS9.pnghttps://commons.wikimedia.org/wiki/File:Google_Knowledge_Panel.pnghttps://commons.wikimedia.org/wiki/File:13-06-28-robocup-eindhoven-005.jpghttp://www.greencarreports.com/news/1100482_tesla-autopilot-the-10-most-important-things-you-need-to-knowhttps://en.wikipedia.org/wiki/Netflix#/media/File:NetflixDVD.jpg

Tesla AutoPilotNetfix Recommenders

Page 7: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

7

AI’s Coming of Age IBM Watson Google Knowledge Graph Apple Siri

Robot Teams

https://en.wikipedia.org/wiki/Watson_(computer)#/media/File:IBM_Watson.PNGhttps://en.wikipedia.org/wiki/Siri#/media/File:SirioniOS9.pnghttps://commons.wikimedia.org/wiki/File:Google_Knowledge_Panel.pnghttps://commons.wikimedia.org/wiki/File:13-06-28-robocup-eindhoven-005.jpghttp://www.greencarreports.com/news/1100482_tesla-autopilot-the-10-most-important-things-you-need-to-knowhttps://en.wikipedia.org/wiki/Netflix#/media/File:NetflixDVD.jpg

Autonomous DriversRecommenders

Problem Solving Tradition

Data Systems Tradition

Page 8: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

8

Artificial Intelligence and Scientific Discovery:The Problem Solving Tradition

Pitts

burg

Pos

t Gaz

ette

Arc

hive

s

Page 9: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

9

Outline

■ Artificial intelligence and scientific discovery• The knowledge systems tradition

■ Our recent work on capturing knowledge about data analysis strategies• Hypothesis-driven data analysis

■ Representing and capturing data analysis knowledge• About data, software, methods, meta-analysis

■ Summary of AI challenges

Page 10: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

10

DISK: Automated DIscovery of Scientific Knowledge

■ Long-Term Goal: Human directs automated intelligent system to explore hypotheses of interest

• Hypothesis-driven data analysis and discovery– Systematic and reproducible analyses

• Report of findings with explanations (“Friday” meeting)

■ Approach: Intelligent system that captures common data analysis strategies used by scientists in a domain

• Build on WINGS intelligent workflow system that can adapt data analysis given the constraints of algorithmic steps

Work with P. Mallick, R. Adusumilli, H Boyce (Stanford U.)

Page 11: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

11

Scientific Data Analysis Today:Inefficient, Incomplete, Irreproducible, One-Shot

Hypothesis

Meta-analysis of results

Run workflows(methods)

Formulate line of inquiry (select method)

New data

Page 12: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

12

Hypothesis

Meta-analysis of results

Run

workflows (methods)

Formulate

line of inquiry (select method)

New data

New Hypothesis

Analytic Workflows

WorkflowLibrary

Meta-Workflows

Confidence Estimation

Benchmarking

Query to Retrieve

Data

Capturing Data Analysis Strategies through Lines of Inquiry

DATACATALOG

Lines of InquirySpecifyrelevantanalyticmethods(workflows),typeofdataneeded,andhowtocombineresults

Page 13: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

13

Knowledge about Multi-Omics Data Analysis Captured in Workflows

Page 14: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

14

Knowledge about Evidence Aggregation Captured in Meta-Workflows

■ After running the workflows, meta-workflows analyze their results and generate a combined confidence value

Wf#0# Wf#1# Wf#2#

simMetrics#

comparison*

hypothesis#

revisedHyp#

hypothesisRevision*

Page 15: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

15

Knowledge about Data Analytic Strategies Captured in Lines of Inquiry

Wf#0# Wf#1# Wf#2#

simMetrics#

comparison*

hypothesis#

revisedHyp#

hypothesisRevision*

data

Protein PRKCDBP is expressed in samples of patient P36

hypothesis

revisionPRKCDBP mutation is expressed in P36

meta-workflow

workflows

Page 16: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

16

Representing Hypothesis Evolution as New Data is Available and Analyzed

H3

H1 Genomic Line of Inquiry(LOI 1)Protein

PRKCDBP

associatedWith

Colon Cancer

RNASEQdata

resultingHypothesis

Proteomicdata

Proteogenomic Line of Inquiry (LOI 2)

QUERY

2

4

5resultingHypothesis

Protein PRKCDBP

Colon Cancer

Subtype

Proteomic data becomes available

associatedWith

C2

confidenceValue

revisionOf

revisionOf

H2 Protein PRKCDBP

associatedWith

C1Colon Cancer

QUERY

Genomic data becomes available

31

hasHypothesis

hasHypothesis

Page 17: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

17

Addressing Problems with CurrentScience Data Analysis Practice

■ Integrative studies of multi-source data are rare• Requires collaborations to cover all the specialized

expertise needed, so they take years

■ Data resources are constantly growing• Scientists rarely re-evaluate evidence for hypotheses

as new data becomes available

“Automated Hypothesis Testing with Large Scientific Data Repositories.”Y. Gil, D. Garijo, V. Ratnakar, R. Mayani, R. Adusumilli, H. Boyce, and P. Mallick. Proceedings of the 4th Annual Conference on Advances in Cognitive Systems (ACS), 2016.

“Towards Continuous Scientific Data Analysis and Hypothesis Evolution.”Y. Gil, D. Garijo, V. Ratnakar, R. Mayani, R. Adusumilli, H. Boyce, A. Srivastava and P. Mallick. To appear in the Proceedings of the 31st Annual Conference of the Association for the Advancement of Artificial Intelligence (AAAI), 2017.

Page 18: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

18

Representation Challenges in Hypothesis-Driven Discovery from Scientific Data Repositories

■ A domain-independent framework that works across scientific disciplines

■ Hypothesis representation and evolution■ Meta-reasoning strategies to select and prioritize

analyses■ Aggregation of evidence from multiple types of

data/observations■ Generation of explanations from provenance records■ Learning to improve from user feedback ■ Identifying “interestingness” in results■ Designing new data analysis methods■ Incorporating science knowledge and theories to guide

hypothesis formation

Page 19: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

19

Outline

■ Artificial intelligence and scientific discovery• The knowledge systems tradition

■ Our recent work on capturing knowledge about data analysis strategies• Hypothesis-driven data analysis

■ Representing and capturing data analysis knowledge• About data, software, methods, meta-analysis

■ Summary of AI challenges

Page 20: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

20

Representing and Capturing Scientific Knowledge

Data WorkflowsSoftware

ProvenanceMeta-Workflows

Page 21: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

21

From: http://www.ncdc.noaa.gov/paleo/metadata/noaa-coral-1865.html

{{ #ask: [[Is a::dataset]]| ?Domain=geochemistry| ?Archive| ?MeasurementMaterial| ?MeasurementStandard| ?MeasurementUnits}}

Knowledge about Data: Linked EarthWork with Julien-Emile Geay of USC and Nick McKay of NAU

AI opportunities:- collection- normalization- organization

Page 22: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

22

Linked Data and Entities:Semantic Web Objects as RDF + URIs

Palmyra Atoll

Palmyra Coral20C

Oxygen -16

SedimentCore

Isotopes

Page 23: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

23

Knowledge about Software: OntoSoft

PIHM PIHMgis DrEICH TauDEM WBMsed

Work with C. Duffy (PSU), C. Mattmann (JPL), S. Peckham (CU), and E. Robinson (ESIP)

AI opportunities:- functional descriptions - organization- linking to data

Page 24: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

24

Knowledge About Analytic Methods:WINGS

Knowledge about data and analytics:“This TDT method works for familial genomic datasets with more than 50 families”

AI opportunities:- abstraction- mining- composition

Page 25: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

25

Knowledge About Provenance:W3C PROV

SensorData-August2011

23 8 5 800

SensorData-TimePeriod

Metabolism-August2011

Metabolism-TimePeriod

AI opportunities:- explanation- reuse- updating

Page 26: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

26

Knowledge about Meta-Processes: Organic Data Science

!

Work with P. Hanson (U Wisc) and C. Duffy (PSU)

AI opportunities:- collaboration- group formation- community health

Page 27: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

27

Capture and Interlink Scientific Knowledge

Neotoma

Navier-Stokes

VegetationEstimates

Oxygen -16

Isotopes

Physical sample

DISK

Genomics of crops

EstimateAge ofWater

Palmyra Atoll

Palmyra Coral20C

SedimentCore

Page 28: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

28

Linked Data and Linked Knowledge

Neotoma

Navier-Stokes

VegetationEstimates

Oxygen -16

Isotopes

Physical sample

DISK

Springflow levels

AI opportunities:- interlinking- analysis- recommenders

EstimateAge ofWater

Palmyra Atoll

Palmyra Coral20C

SedimentCore

Page 29: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

29

Digital'Scholarship'

Provenance'and'methods:''Work%low/scripts.specifying.

data%low,.codes,..con%iguration.%iles,..

parameter.settings,.and..runtime.dependencies.

Data:'Include.data.as..

supplementary.materials.and.pointers.to..data.repositories.

Software:'For.data.preparation,.data.analysis,.and.visualization.

Open'Science'

Open'licenses:'Open.source.licenses.for...

data.and.software..(and.provenance/work%low).

Persistent'identi9iers:'For.data,.software,.and.authors.(and.provenance/work%low).

Sharing:'Deposit.data.and.software...(and.provenance/work%low)..in.publicly.shared.repositories.

Metadata:''Structured.descriptions.of.the..

characteristics.of.data.and.software.(and.provenance/work%low).

Citations:'Citations.for.data.and.software.(and.provenance/work%low).

Reproducible'Publication'

Text:'Narrative.of.the.method,.some.data.is.in.tables,..%igures/plots,.and.the..

software.used.is.mentioned.

Modern'Paper'

Geoscience'Paper'of'the'Future'Scientific Paper of the Future

Page 30: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

30

Outline

■ Artificial intelligence and scientific discovery• The knowledge systems tradition

■ Our recent work on capturing knowledge about data analysis strategies• Hypothesis-driven data analysis

■ Representing and capturing data analysis knowledge• About data, software, methods, meta-analysis

■ Summary of AI challenges

Page 31: Hypothesis-Driven Data Analysis of Science Repositories · AI’s Coming of Age ... Linked Data and Entities: Semantic Web Objects as RDF + URIs ... Linked Data and Linked Knowledge

31

Artificial Intelligence and Scientific Discovery:Challenges for Knowledge Systems

Metadata:- collection- normalization- organization

Software:- functional descrs.- organization- linking to data

Methods:- abstraction- mining- composition

Provenance:- explanation- reuse- updating

Meta-analysis:- collaboration- group formation- community health

Hypothesis-driven discovery:- A domain-independent framework across scientific disciplines- Hypothesis representation and evolution- Capturing analytic knowledge- Meta-reasoning strategies to select and prioritize analyses- Aggregation of evidence from multiple types of data/observations- Generation of explanations from provenance records- Learning to improve from user feedback - Identifying “interestingness” in results- Designing new data analysis methods- Incorporating science models to guide hypothesis formation

Knowledge Representation Challenges

Problem Solving Challenges