The Changing Nature of Biomedical Research: Semantic e-Science

Post on 21-May-2015

59 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Keynote talk, at the KR4HC workshop at Artificial Intelligence in medicine Europe, Verona, 2009

Transcript

The Changing Nature of Biomedical Research: Semantic e-Science

Robert Stevens

BioHealth Informatics Group

University of Manchester

Robert.Stevens@manchester.ac.uk

Introduction

• (Modern bio-molecular) Science• E-Science• Semantics and science• Semantic e-Science

Ernest Rutherford

“All science is either physics or stamp collecting”

Image: http://en.wikipedia.org/wiki/File:Ernest_Rutherford2.jpg

Mathematical Sciences

Laws in Biology

Charles Darwin

Image: http://en.wikipedia.org/wiki/File:Charles_Darwin_01.jpg

On The Origin of Species - 1859

Central Dogma

Image: http://cellbio.utmb.edu/CELLBIO/DNA-RNA.jpg

Classic and Modern Biology

Genotype Phenotype

Modern biology

Classic biology

Speed of sequencing

• First human genome

– 10+ years to produce– Cost $500 million– Huge international effort

• Now done in 10 weeks

– (for $399)– http://tinyurl.com/genomecost– http://www.23andme.com

1000+ databases

• according to Nucleic Acids Research

PubMed: 2 papers per minute

• ~700,000 individual papers• Grows at 2 papers per minute

(see http://blogs.bbsrc.ac.uk for details)

Biology now has lots of facts

Lots of catalogues

Genome

Proteome

Transcriptome

Interactome

Metabolome

PHENOME

Creating Woods, not Trees

Genes

Proteins

Pathways

Interactions

LiteratureComplex Machines

Virtual Organism

…. from biological facts, we make a system that is some model of a real organism

Networks of Chemicals

Image: http://genome-www.stanford.edu/rap_sir/images/Web_FigF_RAP1_glycolysis.gif

Systems within Systems

Image: http://www.ehponline.org/members/2007/10373/fig1.jpg

Uniprot:- A protein database?

Navigating the Web of Knowledge in Bioinformatics

Bioinformatics Experiments are Data pipelines

Resources/S

ervices

Investigate the evolutionary relationships between proteins

Proteinsequences

Multiplesequencealignment

Query

[Peter Li]

My data

My tool

Linking together data resourcesHypo Science – the routine for the manyHyper Science – big projects, big science

The In Silico Experiment

• We can mine these data for possible hypotheses

• “what are the genes that are involved in some disease phenotype?”

• Correlate genes in QTL with differentially regulated genes in microarray via pathways; query the literature base with these genes, pathways and phenotype; …

• Resulting facts form some hypothesis: A co-ordinated set of SNPs increase cholesterol biosynthesis in macrophage, while delaying apoptosis of these cells; increased super-oxide production aids tolerance to trypanosomiasis in cattle

How bioinformatics was DoneIntegrating data sets

• Slave labour• Collections of Scripts• Warehouses• Applications

– Galaxy– Gaggle– Integr8– Ensembl– …..

• Workflows!

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta

Workflows: E. Science laboris

• Data preparation and analysis pipelines.• Data preparation pipelines• Data integration pipelines• Data analysis pipelines• Data annotation pipelines• Warehouse population refreshing• Data and text mining • Knowledge extraction.• Parameter sweeps over

simulations/computations• Model building and verification• Knowledge management and model

population• Hypothesis generation and modelling

• A workflow is a specification.• WFmS is the machinery for

coordinating the execution of (scientific) services and linking together (scientific) resources.

• Handles cross cutting concerns like: error handling, service invocation, data movement, data streaming, data provenance tracking, process auditing, execution monitoring, security access, blah blah…..

• Agile software development

Workflows: E. Science laboris

Enactment Engine

My data

My tool

Workflow Execution Engine

Workflow execution engineLocal desktop and remote server Implicit iteration over large data collectionsNested workflowsAutomated data flowEvent history log and data provenance trackingWithin-workflow programmingExtensibility points for plug-ins

Graphical workbenchFor ProfessionalsPlug-in architecture

Incorporate new service without coding. Services as they are.Access to local and remote resources and analysis tools

Re-Design

Rewritten

• Comparing resistant vs. susceptible strains – Microarrays

• Mapping quantitative traits – Classical genetics QTL

• Integrated Microarray data, genomic sequences, pathway data, literature mining.

Trypanosomiasis Study

Paul Fisher, et al Nucleic Acids Research, 2007, 35(16)

Genotype to Pathway

Created by Paul Fisher

Pathway to Phenotype

Created by Paul Fisher

• Eliminated user bias and premature filtering

• The scale and complexity of data and literature.

• Systematic data analysis

• Data analysis provenance

• Manageable amount of output data for biologists to interpret and verify

• Data driven science

“Looking where others hadn’t”

“make sense of this data” -> “does this make sense?”

http://www.youtube.com/watch?v=Y6_Kz5L010g

Transferring Characteristics

Uncharacterised protein

Tra1 La2 La3

High similarity transfer characteristics

… A Fact Based Discipline

• Rather than laws captured in mathematics….• We have lots of facts: the discipline’s knowledge• Rather than “calculating” what a protein does, we

investigate and write it down• Equivalent to writing down the trajectories of all

thrown objects and not doing ballistics!• To do biology one needs “the knowledge”

Heterogeneity

• 28 ways to format the representations of a biological sequence

• Though one way to represent the bases or amino acids…

• Different words same concept• Different concepts same words• Different and implicit data schema

An Identity Crisis

• Database entries have identifiers unique within their database

• The type of entity described in an entry doesn’t have an identifier

• Different entries about the same type talk about it differently

• How do we know when an entry in one DB talks about the same thing as another entry in another DB?

• That’s the skill of a bioinformatician

Categories and Category Labels

GO:0000368

U2-type nuclear mRNA 5' splice site recognition

spliceosomal E complex formation

spliceosomal E complex biosynthesis

spliceosomal CC complex formation

U2-type nuclear mRNA 5'-splice site recognition

The Role of Knowledge

• A lot of facts• Perhaps organised into a system• No equivalent of “laws of mechanics” – we

can’t do this biology with mathematics• Or at least not without knowing what the

numbers mean...• This is why we’ve been using ontologies!

Uses of Ontology in Bioinformatics

Post-Genomic Biology

• Fly, mouse, yeast, worm all have their own terminologies

• I want to compare genomes• How?• The genomic sequence is easily dealt with

computationally and comparisons are easy• This is not true of the annotations or knowledge of

those sequences• Need a common understanding

Annotation of Data

• Big effort to create controlled vocabularies using ontologies

• A huge annotation effort – describe the entities in DB with terms from ontologies

• The Gene Ontology (http://www.geneontology.org)• The Open Biomedical Ontologies Consortium

GO in Analysis

• Microarray analysis one of the original visions for GO• Clustering of modulated genes cluster about

functional attributes of their proteins• GO also used in, for example, semantic similarity;

text analysis; etc.

Biocatalogue content screenshot

Shield users and applications from service interoperability and incompatibility plumbing.

Turn your app into a service

Service providers Not only web services

How a bioinformatician assumes stuff should work

Pettifer, University of Manchester

inside

A collection of interactive tools for analysing protein sequence and structure

http://utopia.cs.manchester.ac.uk/

Semantic Descriptions of All

• Not just bio-entities in data• The laboratory experiments by which they were

generated• The protocols for their analysis • The services for their analysis

Semantic Integration

• Same identifiers means integration and interoperation• Most workflow hobbled by syntactic and semantic

heterogeneity• Syntactic integration (Bio2RDF)• Semantic integration via ontologies and naming

schemes• Enables better e-Science through semantic science

Fact Management

• When “stamp collecting” we’re collecting facts• Biology is a fact management activity• Knowing what these facts mean is very important• Science is performed on data and the semantics of data

enable us to do science• Semantic e-Science

Summary

• The nature of modern biology gives it interesting knowledge (fact) management issues

• It is a knowledge based discipline• Not unique, but often extreme• Ontologies seen as one component in management

(but not a panacea)• E-Science gives infra-structure for management;

semantics enable analysis• Actually, very light use of semantics

More Acknowledgements

• Phil Lord• Simon Jupp• Carole Goble

top related