Top Banner
Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester
44

Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Jan 15, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Metadata in myGrid: Finding Services for in silico Science

Dr Katy Wolstencroft

myGrid

University of Manchester

Page 2: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

……or how to use metadata and semantics to add value in a ‘standards free’ environment

Page 3: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Outline

• Introduction to Taverna, myGrid and myExperiment• Bioinformatics – use of Web services and other services• Semantic Service Discovery in myGrid • myGrid ontology• Our experiences• BioCatalogue – bioinformatics service registry

Page 4: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Taverna Workflow Workbench

• Design and execution of workflows

• Access to local and remote resources and analysis tools

• Automation of data flow• Iteration over large data

sets• Part of the myGrid project

Page 5: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Taverna Workflow Enactor

myExperimentWeb Interface

3rd Party Resources(Web Services, Grid Services)

Workflow Warehouse

Service / ComponentCatalogue

CustomDatasets

DefaultResults

ProvenanceWarehouse

Resources

Taverna Workbench

GUI

Client Applications

FetaInformation

Services

LogBookProvenanceManagement

Service Management

Service Ontology

Provenance Ontology

myGrid

Page 6: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Lots of Resources

NAR 2008 – over 1000 databases

Page 7: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Where From?

• Over 3500 services available

• Major Service Providers– European Bioinformatics Institute– DNA DataBank of Japan– NCBI – USA

• ‘Boutique’ Services– Individual research labs producing public data sets– Specialist tools for niche experiments

Page 8: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

What types of services?

• HTML• WSDL Web Services• BioMart • R-processor• BioMoby• Soaplab• Local Java services• Beanshell• Workflows

Variable or non-existent documentation or help

Page 9: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Taverna in a ‘open’ world

Advantages• Connection to lots of resources• Flexible system• Can adapt to new technologies

Disadvantages• Services are developed for other purposes• We can’t control how that work• We have to deal with the heterogeneity

Page 10: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Taverna Use

• Users worldwide• Over 48500 downloads• Bioinformatics – largest group

of users• Other users from

– astronomy,

– chemoinformatics,

– health informatics

– Systems Biology

– Social sciences

Page 11: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

An

dy Brass

Steve

Ke

mp

Pa

ul Fishe

r

• Sleeping Sickness in African Cattle• Caused by infection by parasite (Trypanosoma brucei)

• Some cattle breeds more resistant than others• Differences between resistant and susceptible cattle?• Can we breed cattle resistant to infection?

Fisher et al (2007).Nucleic Acids Res.35(16):5625-33

High throughput experiments

Microarray

QTL analysis

Page 12: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Bioinformatics Workflows

• Workflows allow high throughput experiments and automation

• Workflows are encapsulations of experiments• Workflows developed for one experiment can be reused

for others

• Easier to share, reuse and repurpose

The METHODS section of a scientific publication

Page 13: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.
Page 14: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Workflow Reuse

• Downloaded 836 times• Viewed 799 times

• Jo Pennock, lab biologist with no bioinformatics experience – Mouse whipworm infection

• Identified no candidate genes in 2 years with manual analysis

• Identified candidate genes in several hours using Paul’s workflow

Page 15: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Workflows are combinations of different servicesLocations and descriptions of services required at the design phase

Reusing workflows – need to understand what they do

In Silico Science Life Cycle

Page 16: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Finding Services

When using services, scientists need to:• Find them – in distributed locations, produced by

different host institutions• Interpret them – what do the services do - what

experiments can they perform using them?• Know how to invoke them – what data and initial

parameters do they need to supply?

Page 17: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

We could Google for them…

• If a service is called by the name you expect, you’ll find it– Search for ‘clustalw’ and ‘web service’

• What if its not? – The clustalw program from emboss is called ‘emma’– What if it’s the only web service version of clustalw?– Does it stop you designing your workflow?

Page 18: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Metadata from a WSDL

<wsdl:message name="getGlimmersResponse"> <wsdl:part name="getGlimmersReturn" type="xsd:string"/> </wsdl:message> <wsdl:message name="aboutServiceRequest"/> <wsdl:message name="getGlimmersRequest"> <wsdl:part name="in0" type="xsd:string"/> <wsdl:part name="in1" type="xsd:string"/> <wsdl:part name="in2" type="xsd:string"/> <wsdl:part name="in3" type="xsd:string"/> <wsdl:part name="in4" type="xsd:string"/> <wsdl:part name="in5" type="xsd:string"/> <wsdl:part name="in6" type="xsd:string"/> <wsdl:part name="in7" type="xsd:int"/> <wsdl:part name="in8" type="xsd:string"/>

Pathport Web service from the Virginia Bioinformatics Institute

http://pathport.vbi.vt.edu/services/wsdls/beta/glimmer.wsd

Name of the service

Uninformative names for parameters

What kind of string?

Page 19: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Semantics and Web Services

• SAWSDL – Semantic Annotations for WSDL working group

• Virtually no uptake by bioinformatics service providers

• Doesn’t address non-WSDL services

Page 20: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Adding Semantics – Annotating Services

Find services by their function instead of their name

• The services might be distributed, but a registry of service descriptions can be central and queried

• We need to annotate services with semantics

In myGrid, we use the Feta Semantic Discovery tool

and a semantic annotation tool – and expert curation

Page 21: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

myGrid Ontology

Logically separated into two parts:

• Service ontologyPhysical and operational features of (web) services

• Domain ontologyAnnotation vocabulary for core bioinformatics data, data types and their relationships

Page 22: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Service Ontology

• Models services from the point of view of the scientist– Where is it? – How many inputs/outputs?– Who hosts it?

• Invocation details are hidden by the Taverna workbench

• Differs from related initiatives in this respect

Page 23: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Domain Ontology

• Informatics: captures the key concepts of data, data structures, databases and metadata.

• Bioinformatics: The domain-specific data sources (e.g. the model organism sequencing databases), and domain-specific algorithms for searching and analyzing data (e.g. the sequence alignment algorithm, clustalw).

• Molecular biology: Concepts include examples such as, protein sequence, and nucleic acid sequence.

• Formats: A hierarchy describing bioinformatics file formats. For example, fasta format for sequence data, or phylip format for phylogenetic data

• Tasks: A hierarchy describing the generic tasks a service operation can perform. Examples include retrieving, displaying, and aligning.

Page 24: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Specialises

myGrid Ontology

Web Serviceontology

Task ontology

Informatics ontology

Molecular Biology ontology

Bioinformatics ontology

Contributes to

sequence

biological_sequence

protein_sequence

nucleotide_sequence

DNA_sequence

protein_structure_feature

BLASTp service

Similarity Search Service

BLAST service

InterProScan service

Page 25: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Example Service Annotation

• Example : BLAST from the DDBJ– Performs task: Alignment– Uses Method: Similarity Search Algorithm– Uses Resources: DNA/Protein sequence databases– Inputs:

• biological sequence (and format)

• database name (and format)

• blast program (and format)

– Outputs: Blast Report

• Minimum Information model

Page 26: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Minimum Models in Biology

• MIBBI – Minimum Information about Biomedical and Biological Investigations

– MIAME – Microarray experiments– MIAPE - Proteomics– MIRIAM – Biochemical models (SBML models)– Etc

– MIOAWS – Minimum Information About the Operation of the Web Service

Page 27: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

myGrid Ontology

First version of the ontology ~ 2002

Originally developed in DAML+OIL

Now developed in OWL and a version exported to RDFS

Number of classes in the ontology ~750

Domain and service ontology used by myGrid users and developers of myGrid related plugins

Service ontology also used by BioMoby

W3C compliant WRT ontology modelling

Page 28: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

How do we use the ontology?

Two methods of service description

1. Decision Making - reasoning

Single description – whole service modelOntology used to build a single, complete service description and

annotations are classified

Enables automated composition of workflows

2. Decision Support - querying

Composite matches to ontology terms

Multiple terms are used to query the annotations

Page 29: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Originally – Decision Making

• Difficult and time consuming to produce the detailed service descriptions

• Assumption that people would want automated workflow composition

RepeatMasker

Web service

GenePrediction

Web Service

BlastWeb Service

Sequence Predicted Genes out

Only 1 exists Many different algorithms – effective with different organisms etc

Works over underlying databases

Page 30: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Resource Compatibility Difference?

• Scientists choice – can they be sure the experiments are equivalent?

Example: Nucleotide sequence databases• GenBank - USA• EMBL - Europe• DDBJ - Japan

Nightly updates – mirrored data BUT the sequence annotation could be different

Page 31: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

myGrid – Decision Support

– Reducing the list of know services from thousands to several

– Scientist makes the final decision about which of a selection of services to use

– Services are ‘tagged’ with terms from the ontology – very simple!

– No requirement for OWL-DL reasoning – Generating service annotations is much easier

Page 32: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.
Page 33: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

So why do we need OWL?

Building workflows is a two-stage process

1. Assembly – identifying services that perform the scientific functions needed for the experiment

2. Gluing – identifying how (or more usually, if) theses services are compatible

If they are incompatible – we need services that convert data formats and act as connectors – we call these services Shims

Page 34: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Cases for using the OWL version

• Automatic shim integration– Shims don’t do anything scientific, so choosing one over another

makes no difference

• Detecting mismatches– A scientist has built a workflow and the output of processor 1 is

incompatible with processor 2

Page 35: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Limitations of the Current Model

• Feta discovery tool is only accessible from the Taverna Workbench

• Only pertinent to Taverna users – other people need to find and use web services

• Focuses on finding services, but not workflows. For reuse, we need to do both

• Closed annotation system - myGrid curator provides service descriptions – only 700 so far!

Page 36: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

BioCatalogue:Public Bioinformatics Service Registry

• Collaboration between University of Manchester and EBI• Expanding from a service for Taverna users to a service

for anyone using bio web services• Combine service and workflow discovery• Accelerating the process of gathering service

descriptions/annotations by engaging the scientific community

• Combines the myGrid initiative with BioMoby etc

Page 37: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Combining Service and Workflow Discovery

myExperiment – social networking – Web 2.0• Workflows tagged• No formal model• No control

• Services – semantically described, ontology terms

• Access each through the same interface

• Exchanging metadata objects

Page 38: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Screen shot of bio Service shopping site‘Shopping’ for Services and Workflows

Page 39: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Getting the Minimum

Community annotation • Must be easy and quick• Must allow partial descriptions • Multiple annotations of the same service

• What is the minimum information to enable – service discovery– service invocation

• Tagging terms to formal models – OWL, SKOS intermediate?

Page 40: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Grading Services

• Bronze – enough to locate the service. Example of service invocation

• Silver

• Gold

• Platinum – full description. All properties annotated – including dependencies between them – reliability metrics etc

Page 41: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Annotation Provenance

• Who said what about what?• Harvesting community annotation• Verifying and augmenting by a curator• ‘Trust’ Models

• Annotation versions– In a workflow context– As stand alone services

Page 42: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Annotation Process

Page 43: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

Open Issues

• ‘Open’ world means we cannot impose metadata standards

• Lots of heterogeneity• Ontology modelling stable standards to build upon• Web services – shifting standards – need flexibility for

future-proofing• Other services as well as web services• Combining and exchanging metadata objects behind

interfaces • Can we adopt something from the digital library

community? e.g. OAI and ORE (Open Archives InitiativeObject Reuse and Exchange )

Page 44: Metadata in my Grid: Finding Services for in silico Science Dr Katy Wolstencroft myGrid University of Manchester.

myGrid acknowledgements

Carole Goble, Norman Paton, Robert Stevens, Anil Wipat, David De Roure, Steve Pettifer

• OMII-UK Tom Oinn, Katy Wolstencroft, Daniele Turi, June Finch, Stuart Owen, David Withers, Stian Soiland, Franck Tanoh, Matthew Gamble, Alan Williams, Ian Dunlop

• Research Martin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, Antoon Goderis, Alastair Hampshire, Qiuwei Yu, Wang Kaixuan.

• Current contributors Matthew Pocock, James Marsh, Khalid Belhajjame, PsyGrid project, Bergen people, EMBRACE people.

• User Advocates and their bosses Simon Pearce, Claire Jennings, Hannah Tipney, May Tassabehji, Andy Brass, Paul Fisher, Peter Li, Simon Hubbard, Tracy Craddock, Doug Kell, Marco Roos, Matthew Pocock, Mark Wilkinson

• Past Contributors Matthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Juri Papay, Savas Parastatidis, Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Victor Tan, Paul Watson, and Chris Wroe.

• Industrial Dennis Quan, Sean Martin, Michael Niemi (IBM), Chimatica.• Funding EPSRC, Wellcome Trust.

http://www.mygrid.org.uk

http://www.myexperiment.org