Top Banner
The BiSciCol Project Linking Information for Biodiversity Scientists John Deck, UC Berkeley : Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, , Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Wh
10

BiSciCol: Linking Information for Biodiversity Scientists

Dec 14, 2014

Download

Technology

John Deck

Describes the need for better ontologies, and better identifier schemes in the quest for breaking down the walled gardens of biodioversity science.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BiSciCol: Linking Information for Biodiversity Scientists

The BiSciCol ProjectLinking Information for Biodiversity ScientistsJohn Deck, UC Berkeley

BiSciCol Team: Reed Beaman, Nico Cellinese, Jonathan Coddington, Tom Conlin, Neil Davies, John Deck, Rob Guralnick, Bryan P. Heidorn, Chris Meyer, Tom Orrell, Rich Pyle, Brian Stucky, Rob Whitton

Page 2: BiSciCol: Linking Information for Biodiversity Scientists

Adapted from The Economist, by David Simonds

The Biodiversity Data Integration Challenge

Page 3: BiSciCol: Linking Information for Biodiversity Scientists

“I’m here to fight for truth, justice, and the American way.” – Superman

Ontologies, vocabularies, and standards help provide a common understanding of the structure of information, allowing us to break data down to its fundamental parts.

Page 4: BiSciCol: Linking Information for Biodiversity Scientists

“Your identity is your most valuable possession. Protect it. And if anything goes wrong, use your powers.” - Elastigirl

Identifiers allow us to tag, track, or reference any object or process. They must be awesome: persistent, unique, resolvable.

Page 5: BiSciCol: Linking Information for Biodiversity Scientists

Spreadsheets / DwC Archives / Raw Data

Re-assemble and integrate

Assign awesome identifiers

Break down to fundamental parts

The BiSciCol Strategy

Page 6: BiSciCol: Linking Information for Biodiversity Scientists

A Data Integration Experiment:Link records between VertNet and Genbank using the Darwin Core Triplet (InstitutionCode : CollectionCode : CatalogNumber)• 1,400,000 VertNet Records• 460,739 Genbank records (filtered by VertNet

institutions)Question: What % of harvested Genbank records could be linked to VertNet voucher specimen records using the Darwin Core Triplet?

Back to Reality …

Less than 1%!

Page 7: BiSciCol: Linking Information for Biodiversity Scientists

NONE of the identifiers (that we found) employ strategies to ensure truly long-term persistence, decoupling metadata from the identifier itself.

Identifier Challenges

Darwin Core triplets (at least as currently specified in standards, and implemented) do not do well for linking data.

Interim SolutionsFix DwC Triplets standards/validation (that’s you Genbank), build a Triplet resolver

PURL

Awesome Solutions

Page 8: BiSciCol: Linking Information for Biodiversity Scientists

Ontologies, vocabularies, standards Biological Collections Ontology (http://code.google.com/p/bco)Genomic, Biodiversity, and Ecological standards alignment

*+BCIDsFree, persistent, scalable, resolvable and awesome identifiers for biodiversity data, built on CDL’s EZID system (http://biscicol.org/bcid/)

BiSciCol Strategies to Address the

Biodiversity Data Integration Challenge

*Triplifier Chunks raw data into fundamental parts then re-assembles as RDF and integrates with other data (http://biscicol.org/triplifier/)

*Learn more about these projects at the Software Bazaar+More about BCIDs integrating with VertNet on Day 2

Page 9: BiSciCol: Linking Information for Biodiversity Scientists
Page 10: BiSciCol: Linking Information for Biodiversity Scientists

Ontology / Vocabulary Challenges

Need to clarify assumptions behind concepts• Individual / Material Sample / Specimen / Population• Different interpretations x-domains: MIxS, INSDC,

DwC, OBI

Solutions:• Continually improve clarity in definitions• Work towards more robust standards governance frameworks• Implement test beds and better understand use cases

Varying degrees of formalism• Checklists, spreadsheets, RDF, OBO, OWL

Insufficient support for standards organizations• Consisting of tenuous structures maintained by informal

networks of active volunteers