Top Banner
Tripling Tripling Bioinformatics Bioinformatics Productivity Productivity Jerven Bolleman Developer UniProtKB/Swiss-Prot
39

Biohackathon2013: Tripling Bioinformatics Productivity

Jan 15, 2015

Download

Education

jervenbolleman

Talking about RDF/SPARQL and what it means for bioinformatics. The main point is that SPARQL is an universal API to data.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biohackathon2013: Tripling Bioinformatics Productivity

TriplingTriplingBioinformaticsBioinformatics

ProductivityProductivity

Jerven Bolleman

Developer

UniProtKB/Swiss-Prot

Page 2: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Thank you

Page 3: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

UniProt.rdf SPARQL

Page 4: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

UniProt.rdf SPARQL

Page 5: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Page 6: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Data first

• Biocuration– Recover information ‘lost’ in papers

• curation ≠ data entry– Extract knowledge from data

• Structuring knowledge– to integrate with related data– to answer further questions

Page 7: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Biocuration

Page 8: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

• And the rock gets – larger every day

Biocuration

Page 9: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

MADNESS !

THIS iS Swiss-Prot !

Page 10: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

63% more triples in a year

Page 11: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Make data retrieval worthwhile

• If your data is not easily accessible, then no one will query it.

• Simple would be nice, but:– you cannot make it simpler than your data– if the biology is difficult, so is your database

• After retrieval you must:– visualize– summarize

Page 12: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL?Give me a betterpipette

Page 13: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Visualization is work

Page 14: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Visualization is work

Page 15: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Page 16: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

www.ebi.ac.uk/fgpt/gwas/

Page 17: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

SPARQL

CSV

SERVICE

UniProt.rdf SPARQL

18

Page 18: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQLor

CLAY

Page 19: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Progression of query languages

SQLXPath

XQuerySPARQL

Standardized1986

-

2011

1999-

2008

2008-

2013

SPARQL

Page 20: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SQL is not standardized

• 7th ISO standard version• Yet...

– SHOW TABLES– SELECT table_name FROM user_tables– LIST TABLES

• Schemas are not fully transferable – VARCHAR2 or VARCHAR or CHAR or TEXT...

SPARQL

Page 21: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

XPath/Xquery

• Fully standardized– Also in the marketplace

• Tree-based document query model– Assumes all data is in one document

SPARQL

Page 22: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL

• Fully standardized– Also in the marketplace

• Graph-based document query model– Assumes all data is reachable via the internet– Assumes nothing about the storage model

SPARQL

Page 23: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against

• RDBMS– R2RML -> D2RQ, Ultrawrap, XSPARQL...

• Programs– SADI...

• Triplestore– Mark logic, OWLIM, uRiKA, Oracle spatial or NoSQL...

• Key-value– Redis

• Bioinformatics flat file formats– sparql-bed

• CSV/TSV/Spreadsheets– Tarql, Sparqlify

SPARQL

Page 24: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

UniProt.rdf

SPARQL

UniProt.rdf SPARQL

CSV

SERVICE

Page 25: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSVbed file

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

chr7

127471196

127472363

pos1

0 +

127472363

127473530

pos2

SPARQL

Page 26: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

Start End

Page 27: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

faldo:start faldo:enda faldo:ExactPosition

Page 28: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SPARQL against CSV

• SPARQL works on relations between things– subject (thing)– predicate (relation)– object (thing)

• CSV is a relation between fields via headers

SPARQL

chr7 127471196 127472363 Pos1 0 + 127471196127472363

chr7 127472363 127473530 Pos2 0 + 127472363127473530

?start ?end

Page 29: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Sesame

Text

@Overridepublic

CloseableIteration getStatements(Resource subj,

URI pred, Value obj, Resource... namedgraph)

throws QueryEvaluationException {return new EmptyIteration();

}

Page 30: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Big(0) compared to other approaches

• If the SPARQL engine:– detects query is per CSV “line”

• O(number of lines)– else

• O(number of lines * number of joins)

• Same as – cat | perl -ne

Page 31: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

• Strengths– Isolates data format from querying– Easy to put data on the web

• (public SPARQL endpoints)– Single point of optimization

• e.g. parallel query execution– Other programs can still access data

• Weaknesses– Time to code SPARQL to CSV translation– Latency– Harder to hack the code to see what is going on

• (no pipe > to temporary file)

Page 32: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Doing this in PERL

wget ftp://ftp.ncbi...human_9606/VCF/00-All.vcf.gztabix 00-All.vcf.gz -B target_locations.bed | perl -ane'BEGIN{%patient=split /(\S+\n)/s, `cat target_locations.bed`} $alt_bases = $patient{"$F[0]\t$F[1]\t".($F[1]+length($F[3])-1)."\t"}; chomp $alt_bases;print join("\t", @F[0..4], $1), "\n" if $F[4] eq $alt_bases and /MAF=(\d\.\d+)/'

Page 33: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SELECT ?patientSnp ?dbSnp ?maf {?patientSnp a ?mutationType ;

faldo:begin ?patientBegin ; faldo:end   ?patientEnd ; rdf:value ?patientValue .

?mutationType rdfs:subClassOf :mutation .SERVICE

<ftp://ftp.ncbi.../human_9606/VCF/00-All.vcf.gz>{?dbSnp a ?mutationType ;

faldo:begin ?patientBegin ;            faldo:end   ?patientEnd ;

rdf:value ?patientValue ; :MinorAlleleFrequency ?maf .}}

Doing this in SPARQL

Page 34: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

At your SERVICE

Page 35: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

SELECT ?doi ?citatingDoiWHERE{ uniprot:P06280 up:annotation ?annotation ;              up:citation ?citation . ?citation dc:identifier ?doiRaw ; up:name "Nature" . ?annotation a up:Disease_Annotation . BIND (substr(?doiRaw, 5) as ?doi) SERVICE<http://data.nature.com/sparql>{  ?article prism:doi ?doi ; nature:hasCitation ?citationCitingCitation . ?citationCitingCitation prism:doi ?citatingDoi  }}

Page 36: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Benefits of SERVICE

• In a world where data keeps growing– upload a 1KB query = cheap– download a 500GB dataset = expensive

• SPARQL viable via the web– 400GB of UniProt data can stay at UniProt– Your NGS data can stay in your data centre

• Easiest data compression is avoiding a 100 copies ;)

Page 37: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Network of SPARQL endpoints

• Like a social network– value increases the more members there are

Page 38: Biohackathon2013: Tripling Bioinformatics Productivity

© 2013 SIB

Network of SPARQL endpoints

• Like a social network– value increases the more members there are

Page 39: Biohackathon2013: Tripling Bioinformatics Productivity

4242