11 Curation of Chemistry Data from the Laboratory to Publication Jeremy Frey & Simon Coles School of Chemistry University of Southampton Jeremy Frey &

Post on 04-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

11

Curation of Chemistry Data from the Laboratory

to Publication

Curation of Chemistry Data from the Laboratory

to Publication

Jeremy Frey & Simon ColesSchool of Chemistry

University of Southampton

Jeremy Frey & Simon ColesSchool of Chemistry

University of Southampton

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 22

The CombeChem Project

The CombeChem Project

End to End linking of data and informationLaboratory to publication and back againVery long data chains can be involved e.g.

from a chemistry lab to mouse genetic expression

The exponential world of combinatorial synthesis and high throughput analysis meets the exponentially growing power of computing “Automation, Semantics & the Grid”

End to End linking of data and informationLaboratory to publication and back againVery long data chains can be involved e.g.

from a chemistry lab to mouse genetic expression

The exponential world of combinatorial synthesis and high throughput analysis meets the exponentially growing power of computing “Automation, Semantics & the Grid”

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 33

Plan & COSHH

Digital Model

InformationIntegration

Report

Knowledge

Goal

Literature

Synthesis

not just one laboratory but many co-laboratories

working together

Analysis

Smart Laboratory

Smart Storage Smart Dissemination

Smart HCI

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 44

Problems with ‘Small Laboratory’ Working Practice

Problems with ‘Small Laboratory’ Working Practice

“Data from experiments conducted as recently as six months ago might be suddenly deemed important, but those researchers may never find those numbers – or if they did might not know what those numbers meant”

“Lost in some research assistant’s computer, the data are often irretrievable or an undecipherable string of digits”

“To vet experiments, correct errors, or find new breakthroughs, scientists desperately need better ways to store and retrieve research data”

“Data from Big Science is … easier to handle, understand and archive. Small Science is horribly heterogeneous and far more vast. In time Small Science will generate 2-3 times more data than Big Science.”

‘Lost in a Sea of Science Data’ S.Carlson, The Chronicle of Higher Education (23/06/2006)

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 55

The concept of Publication@Source

The concept of Publication@Source

Trace all the way back from publication to the original data – provenance

The data is the key - DataGridStart as you mean to go on – ELNs are a

necessityCuration of subsequently produced data

Trace all the way back from publication to the original data – provenance

The data is the key - DataGridStart as you mean to go on – ELNs are a

necessityCuration of subsequently produced data

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 66

Observations are nevercollected on note pads,

filter paper or other temporary paper for later transfer into a

notebook

If you are caught using the “scrap of paper” technique,

your improperly recorded data may be confiscated by your TA

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 77

Lab books are a big block to publication@source: if it’s not digital, it is more difficult to share

Need a usable digital lab book. Design by analogy to help Chemists and Computer Scientists work together.

Only some equipment is networked

This is where it all starts: The Lab & The Lab Book

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 88

COSHHleverage off things we already have to do

COSHHleverage off things we already have to do

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 99

1 1 2 2 1 3 1 4

Sample of 4-flourinatedbiphenyl

Add CoolReflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

3 5

Add

g

Sample ofBr11OCB

2 6

Reflux

2 7

Cool

Water

Measure

30 ml

9

Liquid-liquid

extraction

DCM

Measure

3 of 40 ml

10

Dry

MgSO4

11

Filter(Buchner)

12

RemoveSolvent

by RotaryEvaporation

13

Fuse

Silica

14

ColumnChromatography

Ether/PetrolRatio

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Inorganics dissolve 2layers. Added brine

~20ml.

Organics are yellowsolution

Washed MgSO4 withDCM ~ 50ml

Measure

excess

Observation Types

weight - grammes

measure - ml, drops

annotate - text

temperature - K, °C

Key

Process

Input

Literal

Observation

Add CoolRefluxAddAdd Reflux Cool Dry Filter Remove

Solventby Rotary

Evaporation

Fuse ColumnChromatography

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

Cool and addBr11OCB

Heat atreflux untilcompletion

Cool and addwater (30ml)

Combine organics,dry over MgSO4 &filter

Removesolvent invacuo

Liquid-liquid

extraction

Extract withDCM(3x40ml)

Fuse compound to silica &column in ether/petrol

4 8

Add

Add

text

Annotate

Annotate

text

Weigh

Annotate

g

Annotate Annotate

text text

Future Questions

Whether to have many subclasses of processes or fewer with annotations

How to depict destructive processes

How to depict taking lots of samples

What is the observation/process boundary? e.g. MRI scan

1.5918

Combechem

30 January 2004gvh, hrm, gms

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

image

To

Do

Lis

tP

lan

Pro

ce

ss

Re

co

rd

PLAN

Process Record

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1010

1 1 2 2 1 3

Sample of 4-flourinatedbiphenyl

Add Reflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Add RefluxAdd

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

text

Annotate

Annotate

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

1 1 2 2 1 3

Sample of 4-flourinatedbiphenyl

Add Reflux

Butanone Sample ofK2CO3Powder

Weigh

grammes0.9031

Measure

40 ml

Add

Weigh

2.0719 g

text

Butanone dried via silica column andmeasured into 100ml RB flask.

Used 1ml extra solvent to wash outcontainer.

Started reflux at 13.30. (Had tochange heater stirrer) Only reflux

for 45min, next step 14:15.

Add RefluxAdd

Dissolve 4-flourinatedbiphenyl inbutanone

Add K2CO3powder

Heat at refluxfor 1.5 hours

text

Annotate

Annotate

Ingredient List

Fluorinated biphenyl 0.9 gBr11OCB 1.59 gPotassium Carbonate 2.07 gButanone 40 ml

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1111

Key

Process

Input

Literal

Observation

pla

n-t

o-

hea

t_te

a_in

_wat

er

plan-to-add_tea_to_water

Add tea to hotwater

Heat tea for5 minutes

Filter off tealeaves

File: combechem/process/tea.rdfOntology: combechem/process/process-record.rdfs

13:41:36 14 July 2004© 2004 University of Southampton

Ste

ps

Pla

nP

roc

ess R

ec

ord

planned-weight_of_tea_leaves

5

planned_tea_leaves

plan-to-weigh_tea_leaves

processed-by-iv

material-observed-by

produces-observation

has-unitvalue

produces-substance

pla

n-t

o-f

ilter

_tea

produces-substance

300

has-unitvalue

processed-by-iv

material-observed-by

planned_some_water

plan-to-measure_some_water

produces-observation

planned-volume_of_some_water

processed-by

processed-by

next-step next-step

hea

t_te

a_in

_wa

ter

add_tea_to_water

weight_of_tea_leaves

5.021

tea_leaves

weighing_tea_leaves

processed-by-iv

material-observed-by

produces-observation

has-unitvalue

produces-

substance

filt

er_

tea

produces-substance

&cec;volumeunit-millilitre310

has-unitvalue

processed-by-iv

material-observed-by

some_water

measuring_some_water

produces-observation

volume_of_some_water

processed-by

processed-by

pla

n-t

o-t

ea_i

n_w

ater

pla

n-t

o-h

ot_

tea

tea_

in_w

ate

r

ho

t_te

a

step-text step-text step-text

experiment-pretty-name

The basic teaexperiment

experiment-description

Add tea leaves tohot water, refluxing,

filtering, drinking(maybe)

experimenter

starting-process

MakingTea

http://www.ecs.soton.ac.uk/info/#person-00389

process-record-of

material-record-of

process-record-of

produces-substance

pla

n-t

o-f

inis

he

d_t

ea

produces-substance

fin

ish

ed_t

ea

<tabletscribble>

value

process-observed-by

watching_tea_boil

produces-observation

heat_tea_notes

&cec;massunit-gramme

&cec;volumeunit-millilitre

&cec;massunit-gramme

Smarttea.org

Making Tea

Namespaces

rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs http://www.w3.org/2000/01/rdf-schema#xsd http://www.w3.org/2001/XMLSchema#akt http://www.aktors.org/ontology/portal#cml http://www.xml-cml.org/schema/cml2/corecec http://www.combechem.org/ontology/process/0.1#st http://smarttea.org/#

part-of-step

part-of-step

part-of-step

step1 step2 step3

experiment-goal

material-is-ingredient-of

material-is-ingredient-of

material-record-of

process-record-of

process-record-of

process-record-of

material-record-of

material-record-of

starting-step

getRecord()

There is a potential containment problem in pulling back partial RDF graphs from the triple store.

Solved by using multiple triple stores but boundaries are a major issue for the future.

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1212

ArchitectureArchitecture

SURIGSURIGSURIGData stores

SemanticData

Otherservices

Weights &Measures

Bench

Planner0

Viewer0

PH

PJava

“Client” LibrariesSOAP

JenaSURIG

Applications

Institutional archivesand m

etadata publication

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1313

The Analytical LaboratoryThe Analytical Laboratory

Capture information from places you would not want to put your eyes

Capture environmental data automatically

Capture people and movements

Provide this information in real time as well as for the laboratory record

Capture information from places you would not want to put your eyes

Capture environmental data automatically

Capture people and movements

Provide this information in real time as well as for the laboratory record

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1414

Data Source

ArchiveClient

WebClient

Mobilephone

Data Source

PDA

MessageBroker

TranslatorService

Pub-Sub systems provide the flexible & extensible approach to distribution

BLOG

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1515

Temperature – room, laser

Door & interlock, Motion Sensors

Air Conditioning failed

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1616

Databases - Our experienceDatabases - Our experience

What do you do when the actual users keep changing their mind?

Is a traditional relational database suitable?Danger of re-enforcing scientific bias against

relational database for laboratory data.RDF & Triple stores were again the solution

What do you do when the actual users keep changing their mind?

Is a traditional relational database suitable?Danger of re-enforcing scientific bias against

relational database for laboratory data.RDF & Triple stores were again the solution

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1717

RDF/RDFS High level Schema for chemical properties

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1818

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 1919

Triple Stores - The Heart of the Semantic WebScaling - 3Store response

Memory leak in testing program!

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2020

Scaling the triplestoresScaling the triplestores

Moved from…A model of harvesting data from multiple

sources into one scalable storetoA model of distributed RDF sources and

caching what is needed for the task at hand into multiple stores fit-for-purpose

Moved from…A model of harvesting data from multiple

sources into one scalable storetoA model of distributed RDF sources and

caching what is needed for the task at hand into multiple stores fit-for-purpose

The Semantic Web!

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2121

Experiments on the Grid: The NCS Service

Experiments on the Grid: The NCS Service

HTTPS

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2222

Binary raw data archived in Atlas Datastore

x300

ADS£’s

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2323

A Data-Rich Subject – the Crystallography ProblemA Data-Rich Subject – the Crystallography Problem

Cl

Cl

Cl

Cl

Cl

Cl

ClCl Cl

Cl

Cl

ClCl

O

O

O

O

N

N

N

N

N+

O

O

O

N+

O

O

O

30,000,000

1.5,000,000

450,000

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2424

The eCrystals Digital RepositoryThe eCrystals Digital Repository

http://ecrystals.chem.soton.ac.uk

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2525

Access to the underlying dataAccess to the underlying data

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2626

Aggregator services

Institutional data repositoriesValidation

Deposit

Publishers: peer-review journals, conference proceedings, etc

Publication

Validation

Data analysis, transformation, mining, modelling

Search, harvest

Presentation services / portals

Data discovery, linking, citation

Laboratory repositoryDeposit

The eCrystals ‘Global’ ModelThe eCrystals ‘Global’ Model

Preservation and curation

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2727

Laboratory Repositories and Information Management

Laboratory Repositories and Information Management

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2828

Need for a data archive in the laboratory

Need for a data archive in the laboratory

Not just the published spectra!

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 2929

Deposit

The R4L RepositoryThe R4L Repository

Search / Browse

Create new compound Add experiment data and metadata

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3030

Several groups making and analysing; the library Administrative Domains transfer or share the data

Several groups making and analysing; the library Administrative Domains transfer or share the data

Researcher

NationalArchive

ResearchGroup

InstitutionInternational

Database

ResearchGroup

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3131

SVG “active” graphics

Link to data, follow links back to the raw data archive

Link to simulation, full simulation data archived in BioSimGrid

R4L

Paper organized using RDF

AHM2006AHM2006 Data Curation WorkshopData Curation Workshop 3232

Summary:Summary:Making sure other people can find,

understand and re-use your data easily and with confidence (even when there is a huge amount of it!)

Make use of Plans to inform the digital context - metadata in advance

Have concern for the “End-to-End life cycle” of chemistry information from the start.

Understanding Usability and Human Computer Interaction is vital for adoption

Making sure other people can find, understand and re-use your data easily and with confidence (even when there is a huge amount of it!)

Make use of Plans to inform the digital context - metadata in advance

Have concern for the “End-to-End life cycle” of chemistry information from the start.

Understanding Usability and Human Computer Interaction is vital for adoption

top related