Top Banner
Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK [email protected]
40

Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK [email protected].

Dec 28, 2015

Download

Documents

Rose Gregory
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

Science, Workflows and Collections

Professor Carole Goble

The University of Manchester, [email protected]

Page 2: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©2

Roadmap

How bioinformaticians will work (and are now)

The myGrid project - workflows Using publications in workflows Workflow implications for serials

Page 3: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©3

Williams-Beuren Syndrome

Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal

crossover (homologous recombination) during meiosis

Haploinsufficiency of the region results in the phenotype

Chr 7 ~155 Mb

~1.5 Mb7q11.23

**

WBS

SVAS

Patient deletions

CTA-315H11

CTB-51J22

‘Gap’

Physical Map

Hannah Tipney

Page 4: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©4

1. Identify new, overlapping sequence of interest2. Characterise the new sequence at nucleotide and

amino acid level

Cutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Page 5: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©5

In Life Sciences: Data, Publication, its all the same

Its just part of the experiment

No separation between data and publications

Publications are the context for data

Break the silo between published papers and published data

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

Page 6: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©6

Aside: A heretic speaks

Life Scientists read journals I’m a Computer Scientist. I

don’t. Its on the Web Its in PodCast talks or

Powerpoint Google is the Lord’s work What PhD students are for Journal publications too

outdated

Page 7: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©7

Bioinformatics pipelines on the web

Copy and paste from one web based application to another

Annotate by hand Disadvantages: time consuming, error prone, tacit

procedure so difficult to share both protocol and results

RepeatMasker BLASTn Twinscan

Page 8: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©8

Workflows for Science

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg

Page 9: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©9

“Workflow at its simplest is the movement of documents and/or tasks through a work process.

More specifically, workflow is the operational aspect of a work procedure: how tasks are structured, who performs them, what their relative order is, how they are synchronized, how information flows to support the tasks and how tasks are being tracked”.

Workflows for Science

Page 10: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©10

RepeatMasker

Web service

BLASTnWeb Service

TwinscanWeb Service

Sequence in

Predicted genes out

Simple scripting language specifies how steps of a pipeline link together

Hides all the fiddling about. Advantages : automation, quick to write, easier to

explain, share, relocate, and record provenance of results in a standard way

Workflows for Science

Page 11: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©11

Workflows describe the scientists in silico experiment Link together and cross reference data

in different repositories And that includes serials!

Remote, third party, external applications and services Accessible to the workflow machinery And that includes serials!

Results management Semantic metadata annotation of data Provenance tracking of results

Sharing and replicating know-how Reuse of workflows

Workflows for Science

Page 12: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©12

Page 13: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©13

WBS The first complete and

accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome

Perform one WBS pipeline from 2 weeks to 2 hours

Faster, automated, systematic and shareable

Page 14: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©15

Trypanosomiasis in cattle

Chicken genome

Reuseadapting and sharing best practice and know-how across a community by publishing workflows

Mouse genome

Grave Disease

Williams-Beuren Syndrome

Page 15: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©16

Trypanosomiasis in cattle

Identify the genetic difference responsible for resistance to trypanosomiasis and breed into productive cattle.

Mice as a model. Gene expression and

microarray analysis The literature

Associations between upregulated genes

Links between changed genes and genes in the Tir1 region

Page 16: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©17

Page 17: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©18

Page 18: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.
Page 19: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©20

PubMed Text Mining results

Page 20: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©21

Page 21: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©22

Chilibot text mining in Taverna

Page 22: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©23

Taverna output Chilibot web

page

Page 23: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©24

•Trypanosomes need cholesterol – and have a scavenger receptor – specific for HDL

•Resistant mice reduce available HDL – slowing trypanosome growth

New hypothesis:Resistance and susceptibility in mice is a function of cholesterol recycling pathway. Mice love lard.

lipoprotein and cholesterol

Page 24: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©25

Biological pathway, highlighted with RNA molecules (orange) and DNA QTL molecules (pink), discovered with the aid of Chilibot text mining over PubMed.

Page 25: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©26

Page 26: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©27

myGrid/Discovery Net

Specialist Term recognition software

Assigning Gene Ontology terms to papers in MedLine

Page 27: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©28

Science: Knowledge-driven

MEDLINE abstract; marked up by SciBorg

HTML-CMLversion

Page 28: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©29

“the development of online submission systems for scientific manuscripts provides a mechanism for including a mapping of the information in the manuscript to controlled terminologies as an integral part of the publishing process. It is not hard to envision that the indexing of a paper to controlled terms for anatomical, gene nomenclature, or functional terminologies would be a necessary requirement for acceptance of a paper for publication. This, then, would enable the rapid incorporation of the paper and its contents into bioinformatics systems. “ Judith Blake

Judith Blake, Bio-ontologies—fast and furiousNature Biotechnology  22, 773 - 774 (2004)

Page 29: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©30

Learning & Teaching workflows

Research & e-Science workflows

Aggregator services: national, commercial

Repositories : institutional, e-prints, subject, data, learning objects

Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules

Harvestingmetadata

Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media

Resource discovery, linking, embedding

Deposit / self-archiving

Peer-reviewed publications: journals, conference proceedings

Publication

Validation

Data analysis, transformation, mining, modelling

Resource discovery, linking, embedding

Deposit / self-archiving

Learning object creation, re-use

Searching , harvesting, embedding

Quality assurance bodies

Validation

Presentation services: subject, media-specific, data, commercial portals

Resource discovery, linking, embedding

The scholarly knowledge cycle.

Liz Lyon, Ariadne, July 2003.

This work is licensed under a Creative Commons LicenseAttribution-ShareAlike 2.0

© Liz Lyon (UKOLN, University of Bath), 2005

Page 30: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©31

eBank UK Project Aggregator service harvests metadata from institutional

repository (e-crystals archive) eBank service embedded in PSIgate portal for 3rd party search Service linking from data to derived research publication Embedding eBank service in learning workflows

UKOLN (lead), University of Southampton, University of Manchester

http://www.ukoln.ac.uk/projects/ebank-uk/

Page 31: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©32

Linking data to publications

Page 32: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©33

1 1 2 2 1 3 1 4

Sample of 4-flourinatedbiphenyl

Add CoolReflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

3 5

Add

g

Sample ofBr11OCB

2 6

Reflux

2 7

Cool

Water

Measure

30 ml

9

Liquid-liquid

extraction

DCM

Measure

3 of 40 ml

10

Dry

MgSO4

11

Filter(Buchner)

12

RemoveSolvent

by RotaryEvaporation

13

Fuse

Silica

14

ColumnChromatography

Ether/PetrolRatio

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Inorganics dissolve 2layers. Added brine

~20ml.

Organics are yellowsolution

Washed MgSO4 withDCM ~ 50ml

Measure

excess

Observation Types

weight - grammes

measure - ml, drops

annotate - text

temperature - K, °C

Key

Process

Input

Literal

Observation

Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove

Solventby Rotary

Evaporation

Fuse ColumnChromatography

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

Cool and addBr11OCB

Heat atreflux untilcompletion

Cool and addwater (30ml)

Combine organics,dry over MgSO4 &filter

Removesolvent invacuo

Liquid-liquid

extraction

Extract withDCM(3x40ml)

Fuse compound to silica &column in ether/petrol

4 8

Add

Add

text

Annotate

Annotate

text

Weigh

Annotate

g

Annotate Annotate

text text

Future Questions

Whether to have many subclasses of processes or fewer with annotations

How to depict destructive processes

How to depict taking lots of samples

What is the observation/process boundary? e.g. MRI scan

1.5918

Combechem

30 January 2004gvh, hrm, gms

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

image

To

Do

Lis

tP

lan

Pro

ce

ss

Re

co

rd

ProvenanceLog what, where,

when who

For data and for publications

Page 33: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©34

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

Page 34: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©35

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

Publications have to be computational services – web services They will be read and processed

by machines

Licensing that works!

Authorisation, Authentication and digital rights management (e.g. Shibboleth)

Integration of data and publications Workflows are linking results,

whatever the source

Common ids and persistent ids for citation (DOI, LSID, InCHI)

No silos

Page 35: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©36

Workflows

Web service

s

Text mining

Bioinformatics

Semantic mark-up

Semantic publishing at source In order to automate we need

better ways of interpreting the publication content

They will be read and processed by machines

Integration of data and publications Common vocabularies

Accessible full texts for text mining, Not just abstracts.

Page 36: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©37

Workflows

Bioinformatics

Data

Publications

Semantic markupProvenance

Page 37: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©38

Workflows

Bioinformatics

Data

Publications

Semantic markupProvenance

Publish workflows with data with publicationsPrivacy? Intellectual property?

Licensing models for services so can reuse and share results and workflows.

Page 38: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©39

Take home

Machines are reading your journals, not just people And if the Journals are not online then they unread Workflows are another form of outcome to publish

alongside data, metadata and publications Google rocks – I don’t use anything else!

http://www.mygrid.org.uk http://www.ukoln.ac.uk/projects/ebank-uk/ http://www.combechem.org

Page 39: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©40

Acknowledgements

The myGrid Team, esp. Tom Oinn Chris Wroe Antoon Goderis Andy Brass Paul Fisher Hannah Tipney May Tassabehji Rob Gaizauskas Ian Roberts

Discovery Net / Inforsense Vasa Curcin Moustafa M Ghanem

BioBank / CombeChem David De Roure Liz Lyon

Scientists Peter Murray-Rust Judith Blake Mike Ashburner

Page 40: Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK carole.goble@manchester.ac.uk.

©41

Digital Library workflows

Workflows for data capture, deposit, preservation, citation, discovery, mining &&….

Multiple workflows interacting together Workflows may call on each other, in a defined order Multiple workflows may use “common” services e.g.

Assign (identifier) Require sequential or parallel execution, have

dependencies, be time-limited, repetitive Have an owner (control) Include essential human interventions ? ? ?