Top Banner
Supported by the NIH grant 1U24AI11796601 to UCSD PI , CoInvestigators at: Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Philippe Rocca-Serra Oxford e-Research Centre, University of Oxford, UK Smart Descriptions & Smarter Vocabularies (SDSVoc) 30 November-1 December 2016, CWI Amsterdam Science Park The model: dataset descriptions for data discovery in DataMed
10

The DATS model: datasets descriptions for data discovery in DataMed

Jan 24, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The DATS model: datasets descriptions for data discovery in DataMed

Supported  by  the  NIH  grant  1U24  AI117966-­01  to  UCSDPI   , Co-­Investigators  at:

Alejandra Gonzalez-Beltran, Susanna-Assunta Sansone, Philippe Rocca-Serra

Oxford e-Research Centre, University of Oxford, UK

Smart Descriptions & Smarter Vocabularies (SDSVoc)

30 November-1 December 2016, CWI Amsterdam Science Park

The model:dataset descriptions for

data discovery in DataMed

Page 2: The DATS model: datasets descriptions for data discovery in DataMed

Like  JATS (Journal  Article  Tag  Suite)  is  used  by  PubMed to  index  literature,  DATS (DatA Tag  Suite)  is  needed  for  a  scalable way  to  

index  data  sources  in  the  DataMed prototype

http://datamed.org

NIH  BD2K  Data  

DiscoveryIndex

prototype

Biomedicaland  healthcaredatasets

Page 3: The DATS model: datasets descriptions for data discovery in DataMed

open community-driven development & documentation

open community-driven development & documentation

http://biocaddie.org/workgroup-­3-­group-­links

http://github.com/biocaddie/WG3-­MetadataSpecifications

http://tiny.cc/datswebinar

Page 4: The DATS model: datasets descriptions for data discovery in DataMed

v Enabling  discoverability:  find and  access datasets  available  in  multiple  

repositories

v Focusing  on  surfacing  key metadata  descriptors,  such  as  

² information  and  relations  between  datasets,  creators,  publication,  

funding  sources,  nature  of  biological   signal and  perturbation  etc.

v Not the  perfect  model  to  represent  the  experimental  details

² the  level  of  detail  and  metadata  needed  to  ensure  interoperability  

and  reusability  are  left  to  the  indexed  databases

² We  have  aimed  to  have  maximum coverage  of  use  cases  with  minimalnumber  of  data  elements  and  relations

² Only  very  few  properties  are  required

² Follow  Best  Practices  for  Data  on  the  Web

What is ‘ remit?What is ‘ remit?

Page 5: The DATS model: datasets descriptions for data discovery in DataMed

Metadata  elements  identified  by  combining  the  two  complementary  approaches

USE  CASES:  top-­down  approach SCHEMAS:  bottom-­up  approach

The development process in a nutshellThe development process in a nutshell

Model  serialized  as  JSON  schemas and mapping  to  schema.org(v1.0,  v1.1,  v2.0,  v2.1)

Page 6: The DATS model: datasets descriptions for data discovery in DataMed

Using an existing model?Using an existing model?

v schema.orgv DataCitev RIF-CS (Registry Interchange Format – Collection and Services)v W3C HCLS dataset descriptions (mapping of many models including Dublin Core,

DCAT, PROV, VoID, VoID-ext)v Project Open Metadata (used by HealthData.gov)

v ISA (Investigation/Study/Assay)v BioProjectv BioSample

v MiNIMLv PRIDE-mlv MAGE-tabv GA4GH metadata schemav SRA xmlv CDISC SDM / element of BRIDGE model

Generic  Models

Life  Science  /  BioMedicalModels

Considered  multiple  models,  mapped/analyze  these  ones:

bottom-­up  approach

Page 7: The DATS model: datasets descriptions for data discovery in DataMed

Convergence  of  elements  extracted  from  competencyquestionsand  existing  (generic  and  biomedical)  

data models(incl.  DataCite,  DCAT,  schema.org,  HCLS  dataset,  RIF-­CS,  ISA-­Tab,  SRA-­xml  etc.)

model for scalable indexingmodel for scalable indexing

Adoption

of  elements  extracted  from

and  from

core  entities  

extended  entities

plus  elements  from  other  models  (e.g.  

dataset/distribution/catalog  from  DCAT)

Page 8: The DATS model: datasets descriptions for data discovery in DataMed

Serializations and use of schema.orgSerializations and use of schema.org

v DATS  model  represented  as  JSON  schemas,  instances  as:² JSON*  format,  and  ² JSON-­LD**  with  vocabulary  from  schema.org

² serializations  in  other  formats  and  with  other  vocabularies  can  also  be  done,  as  /  if  needed

v Benefits  for  DataMed  and  databases  indexed  by  DataMedv Increased  visibility   (by  both  popular  search  engines),  accessibility  (via  common  query  interfaces)  and  possibly  improve  ranking

v Use  and  extensions  of  schema.org² Submitted  to  their  tracker  missing  DATS  elements² Coordinating  via  the                bioschemas.org initiative  (ELIXIR  is  also  part  of)  the  extension  of  schema.org for  life  science

*  JavaScript  Object  Notation**  JavaScript  Object  Notation  for  Linked  Data

Page 9: The DATS model: datasets descriptions for data discovery in DataMed

Other  adopters  exporting

DATS  in  their  APIs

To  evaluate  DATS  model  capabilities

Work  in  progress:documentation  and  curation guidelines  for    

adopters  

Implementations and documentation Implementations and documentation

Page 10: The DATS model: datasets descriptions for data discovery in DataMed

Thanks!bioCADDIE Working  Groups  https://biocaddie.org/group/working-­groups