1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi . edu http://www. isi . edu/~gil Scientific Reproducibility through Semantic Workflows and Shared Provenance Representations
16
Embed
1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California [email protected] gil.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Yolanda Gil, PhDInformation Sciences Institute andDepartment of Computer ScienceUniversity of Southern California
Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming]
CNV Detection
Variant Discovery from Resequencing
Transmission Disequilibrium Test (TDT)
Association Tests
10
Major Features Workflow system
manages set up and execution
• Wings – set up• Pegasus -
execution Initial collection of
workflows captures common genomic analyses
Users can upload their own datasets
• Including collections of datasets
User data is secure• Not accessible by
others
11
Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]
12
Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]
13
Observations about Reproducibility with Workflows [Gil et al, forthcoming]
Effort involved in reproducing results is minor• 30 seconds to set up a workflow
A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses• Our workflows were independently developed and used “as is”
Semantic representations abstract the analysis method from the software that implements it• Our workflows used different analytic tools than the original
studies• Many implementations of same algorithm, some proprietary
Semantic constraints can be added to workflows to avoid analysis errors• Eg: in association analysis workflow, added constraint to remove
duplicate individuals initially to avoid problems downstream
14
Benefits of Semantic Workflows [Gil JSP-09]
Execution management: Automation of workflow
execution Managing distributed
computation Managing large data sets Security and access
control Provenance recording Low-cost high fidelity
reproducibility
Semantics and reasoning:
User assistance to correctly explore analysis “design space”
Validation of analyses Automated generation of
metadata Workflow retrieval and
discovery “Conceptual”
reproducibility
15
W3C Provenance Group (Y. Gil, chair):Goals
Provide state-of-the-art understanding and develop a roadmap for development and possible standardization
Articulate requirements for accessing and reasoning about provenance information• Develop use cases
Identify issues in provenance that are direct concern to the Semantic Web• Articulate relationships with other aspects of Web architecture
Report on state-of-the-art work on provenance Report on a roadmap for provenance in the Semantic
Web• Identify starting points for provenance representations• Identifying elements of a provenance architecture that would
benefit from standardization
16
W3C Provenance Group:Products of the Group to Date
Group formed in September 2009, open to new members• All information is public: http://www.w3.org/2005/Incubator/prov/wiki/
Developed a set of key dimensions for provenance (11/09)• Grouped into three major categories: content, management, use
Developed use cases for provenance (12/09)• More than 30 use cases, including ~10 in science but others are
relevant Developed requirements for provenance from use cases (1/10)
• User requirements: what is the purpose of the provenance information • Technical requirements: derived from the user requirements
Report on “Requirements for Provenance on the Web” Currently developing state-of-the-art report (expected 6/10) Started to develop recommendations (expected 9/10)
• Mappings across provenance vocabularies (eg: DC, OPM, SWAN,…)