NISO Working Group Connection LIVE! Research Data Metrics Landscape: An update from the NISO Altmetrics Working Group B: Output Types & Identifiers Monday, November 16, 2015 Presenters: Kristi Holmes, PhD, Director, Galter Health Sciences Library, Northwestern University Mike Taylor, Senior Product Manager, Informetrics, Elsevier Philippe Rocca-Serra, Ph.D., Technical Project Leader, Oxford Tom Demeranville, THOR Senior Project Officer & ORCiD Software Engineer Martin Fenner, Technical Director, DataCite Dr. Sarah Callaghan, Senior Researcher and Project Manager, British Atmospheric Data Centre Dr. Melissa Haendel, Associate Professor, Ontology Development Group, OHSU Library, Dept of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University http:// www.niso.org /news/events/2015/ wg_connections_live / altmetrics_wgb /
128
Embed
NISO Working Group Connection Live! Research Data Metrics Landscape: An Update from the NISO Altmetrics Working Group B: Output Types & Identifiers
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
NISO Working Group Connection LIVE!Research Data Metrics Landscape:
An update from the NISO Altmetrics Working Group B: Output Types & Identifiers
Philippe Rocca-Serra, Ph.D., Technical Project Leader, Oxford
Tom Demeranville, THOR Senior Project Officer & ORCiD Software Engineer
Martin Fenner, Technical Director, DataCite
Dr. Sarah Callaghan, Senior Researcher and Project Manager, British Atmospheric Data Centre
Dr. Melissa Haendel, Associate Professor, Ontology Development Group, OHSU Library, Dept of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University
aggregate DataOne usage log files from DataOne member nodes
parse logs, applying COUNTER rules•
•double-click intervals whitelist user agents
two versions of usage stats•
•COUNTER-compliantpartial compliant (include some bots)
Average %
of not
filteredsince 2005COUNTER 63.57%
Partial 63.59%
this past yearCOUNTER 44.88%
Partial 47.05%
Usage Stats
Future Work
•Collect data citations from CrossRef
•Analyze usage statistics in more detail and provide input to COUNTER and NISO
•Analyze network graph, e.g. linked datasets and second order citations
•Turn research project into service, including integration of client applications for search and reporting
Introducing the Metadata Model v1
Philippe Rocca-Serra PhD,University of Oxford e-Research Centre
on behalf of WG3 Metadata WG
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
A trans-NIH funding initiative established to enable biomedical research as a digital
research enterprise
• Facilitate broad use of biomedical digital assets by making them discoverable, accessible, and citable ->
• Conduct research and develop the methods, software, and tools needed to analyze biomedical Big Data ->
catalog to enable researchers to find, cite research datasets
ease the use community standards to annotate datasets
Lucila Ohno-Machado (PI)Jeff Grethe
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Pilot applications that ‘dock’ with the prototype and community-driven activities via Working Groups:1. BD2K Centers of Excellence Collaboration2. Data Identifiers Recommendation3. Metadata Specifications 4. Use Cases and Testing Benchmarks5. Dataset Citation Metrics6. Criteria for Being Included in the DDI7. Machine Actionable Licenses8. Ranking Algorithm9. End User Evaluation Criteria10. Repository Collaboration11. Outreach Meeting: Repository Operators12. Standard-driven Curation Best Practices13. Evaluation of Harvesting and NLP Pilot Projects
All this by August 2017!
Joint effort with BD2K Center for Expanded Data Annotation and Retrieval (CEDAR)
Synergies with BD2K cross-centers Metadata WG (co-chaired by M Musen/CEDAR, G Alter/bioCADDIE) and ELIXIR activities
WG3 Metadata - Goals
Define a set of metadata specifications that support intended capability of the Data Discovery Index prototype - being designed by the bioCADDIE Core Development Team - as outlined in the White Paper
Core metadata, designed to be future-proofed for progressive extensions (phase 1: May-July 2015) Followed by test and implementation phase
Domain specific metadata for more specialized data types (phase 2)
Use cases and the competency questions have been used throughout the process To define the appropriate boundaries and level of granularity:
which queries will be answered in full, which only partially, and which are out of scope
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
WG3 Metadata – work to date
with contributions, comments from several WG 3 members and colleagues, in particular: Joan Starr, George Alter, Ian Fore, Kevin Read, Stian Soyland-Reyes, Muhammad Amith, Michel Dumontier…
By:
Contains lists of material reviewed• data discovery initiatives and metadata initiatives• existing meta-models for representing metadata elements
Outlines the approach used to identify metadata descriptors • Via use cases and competency questions (top-down
approach)• Mapping generic and life science-specific metadata
schemas (bottom-up approach) Listed in the BioSharing collection for bioCADDIE
The results of both approaches has been compared and converged on the core set of metadata
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
• schema.org• datacite• hcls dataset descriptors• biosample• geo miniml• prideml• isatab/magetab• ga4gh metadata schema• sra xml• bioproject• cdisc sdm / element of bridge modelSupported by the NIH grant 1U24 AI117966-01 to the University of
California, San Diego
Bottom-up approach: survey of existing models
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Selected competency questions representative set from use cases workshop, white paper, submitted by the
community and from Phil Bourne questions have been abstracted and key metadata elements have been
highlighted and color-coded and categorized as the set of core and extended metadata elements are defined, it will
become clearer which questions the Data Discovery Index will not be able to answers if full and which only in part.
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Use Cases and Derived Metadata
Selected competency questions representative set from use cases workshop, white paper, submitted by the
community and from Phil Bourne questions have been abstracted and key metadata elements have been
highlighted and color-coded and categorized as the set of core and extended metadata elements are defined, it will
become clearer which questions the Data Discovery Index will not be able to answers if full and which only in part.
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Use Cases and Derived Metadata
Processing use cases
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
All use cases on equal footing
Term BinningMaterialProcessInformationProperty
Relation identification
Core metadata elements and initial model the result of the combined approaches has delivered a set of core metadata
elements and progressively these will/could be extended to domain specific ones, in phase two, as needed
we aim to have maximum coverage of use cases with minimal number of data elements, but we do foresee that not all questions can be answered in full
Initial Set of Metadata Elements
Initial Set of Metadata Elements
Everything is on github
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
What’s next ?
With this work phase 1 has been completed We have entered the evaluation phase
the model will be implemented and tested by the bioCADDIE Development Team with a number of data sources
the results will inform the activities in phase 2, where the metadata elements and the model may be revised, simplified and/or enriched, as needed
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Take Home Message
• primary goal: provide a general purpose metadata
schema allow harvesting of key experimental and
data descriptors from a variety of resources and
enable indexing to support data discovery
– relations between authors, datasets, publication
and funding sources
– nature of biological signal, nature of perturbation,
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Outstanding issues
• prioritizing the use cases
• defining mechanisms to deal with domain specific,
granular data
• moving into phase2 and devising data ingesters
– ETL activities
– interact with other modeling efforts
• incorporate feedback from users and developers
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
Question Time
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
orcid.org
Contact Info: p. +1-301-922-9062 a. 10411 Motor City Drive, Suite 750, Bethesda, MD 20817 USA
ORCID, Metrics andProject THOR Tom Demeranville Senior Technical Officier – Project THOR NISO Webinar, November 2015
Start Here
What is ORCID?
orcid.org
16 November 2015
55
ORCID is an infrastructure that provides unique Person Identifiers. ORCID is a hub for linking identifiers for people with their activities. ORCID is researcher centric with 1.7 million registered identifiers.ORCID records are managed by the researcher themselves. ORCID is open source, community governed and non-profit. ORCID has a public API that allows querying of non-private data. ORCID has a member API that enables updating and notifications. ORCID IDs are associated with over 4 million unique DOIs
347 members, 4 national consortia,over 200 integrations
research inst 68%
publisher 12%
funder 5%
9%
association 6%
repository MEA 3%
orcid.org
16 November 2015
56
Europe 58%Latin
America 1%
North Ameri
ca 26%
Pacific 7%
Asia 5%
What ORCID isn’t
orcid.org
16 November 2015
57
ORCID is not a CRIS systemORCID is not a researcher profile system ORCID is not a research activity metadata store
Research outputs
orcid.org
16 November 2015
58
• ORCID includes links to publications, patents, datasets, software and more.
• ORCID uses the CASRAI Output vocabulary for work types
• ORCID references over 20 other output identifiers (more are being added!)
Otherresearcher activities
orcid.org
16 November 2015
59
• Peer review• Education• Employment
ORCID and Metrics
orcid.org
16 November 2015
60
ORCID doesn’t track metrics – it’s not our focus
ORCID is an enabling
infrastructure ORCID improves
robustness of metrics
ORCID and Metrics
orcid.org
16 November 2015
61
• ORCID improves the quality of research information and makes gathering it and disseminating it easier.
• Other services use ORCID IDs to improve their data• ORCID IDs are found in DOI metadata, funder
systems, publishers, CRIS systems, national reporting frameworks and more
• Institutions can discover researcher curated standard and non-standard outputs or be notified when added
Project THOR
http://project-thor.eu
EC funded H2020 2.5 year project
Establish seamless integration between articles, data, and researchers across the research lifecycle
Make persistent identifier use for people and research artefacts the default
Research - Deciding what needs to be done Integration - Doing what needs to be done Outreach - Getting others involved Sustainability - Making sure it lasts
Community driven consensus on requirements is needed.
We need a way forward.
THOR will help by convening meetings with all interested parties in the community, including research institutions, funders, datacentres, publishers, standards bodies, existing organisation identifier and other identifier providers.
Contact Info: p. +1-301-922-9062 a. 10411 Motor City Drive, Suite 750, Bethesda, MD 20817 USA
ORCID, Metrics andProject THOR Tom Demeranville Senior Technical Officier – Project THOR NISO Webinar, November 2015
Start Here
What is ORCID?
orcid.org
16 November 2015
70
ORCID is an infrastructure that provides unique Person Identifiers. ORCID is a hub for linking identifiers for people with their activities. ORCID is researcher centric with 1.7 million registered identifiers.ORCID records are managed by the researcher themselves. ORCID is open source, community governed and non-profit. ORCID has a public API that allows querying of non-private data. ORCID has a member API that enables updating and notifications. ORCID IDs are associated with over 4 million unique DOIs
347 members, 4 national consortia,over 200 integrations
research inst 68%
publisher 12%
funder 5%
9%
association 6%
repository MEA 3%
orcid.org
16 November 2015
71
Europe 58%Latin
America 1%
North Ameri
ca 26%
Pacific 7%
Asia 5%
What ORCID isn’t
orcid.org
16 November 2015
72
ORCID is not a CRIS systemORCID is not a researcher profile system ORCID is not a research activity metadata store
Research outputs
orcid.org
16 November 2015
73
• ORCID includes links to publications, patents, datasets, software and more.
• ORCID uses the CASRAI Output vocabulary for work types
• ORCID references over 20 other output identifiers (more are being added!)
Otherresearcher activities
orcid.org
16 November 2015
74
• Peer review• Education• Employment
ORCID and Metrics
orcid.org
16 November 2015
75
ORCID doesn’t track metrics – it’s not our focus
ORCID is an enabling
infrastructure ORCID improves
robustness of metrics
ORCID and Metrics
orcid.org
16 November 2015
76
• ORCID improves the quality of research information and makes gathering it and disseminating it easier.
• Other services use ORCID IDs to improve their data• ORCID IDs are found in DOI metadata, funder
systems, publishers, CRIS systems, national reporting frameworks and more
• Institutions can discover researcher curated standard and non-standard outputs or be notified when added
Project THOR
http://project-thor.eu
EC funded H2020 2.5 year project
Establish seamless integration between articles, data, and researchers across the research lifecycle
Make persistent identifier use for people and research artefacts the default
Research - Deciding what needs to be done Integration - Doing what needs to be done Outreach - Getting others involved Sustainability - Making sure it lasts
Community driven consensus on requirements is needed.
We need a way forward.
THOR will help by convening meetings with all interested parties in the community, including research institutions, funders, datacentres, publishers, standards bodies, existing organisation identifier and other identifier providers.
NISO Working Group Connections LIVE!Research Data Metrics Landscape:
An update from the NISO Altmetrics Working Group B: Output Types & IdentifiersMonday, November 16 from 11:00 a.m. - 1:00 p.m. (ET)
VO Sandpit, November 2009
The UK’s Natural Environment Research Council (NERC) funds six data centres which between them have responsibility for the long-term management of NERC's environmental data holdings.
We deal with a variety of environmental measurements, along with the results of model simulations in:•Atmospheric science•Earth sciences•Earth observation•Marine Science•Polar Science•Terrestrial & freshwater science, Hydrology and Bioinformatics•Space Weather
Who are we and why do we care about data?
VO Sandpit, November 2009
Data, Reproducibility and Science
Science should be reproducible – other people doing the same experiments in the same way should get the same results.
Observational data is not reproducible (unless you have a time machine!)
Therefore we need to have access to the data to confirm the science is valid! http://www.flickr.com/photos/31333486@N00/1893012324/
sizes/o/in/photostream/
VO Sandpit, November 2009
It used to be “easy”…
Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665
The Scientific Papers of William Parsons, Third Earl of Rosse 1800-1867
…but datasets have gotten so big, it’s not useful to publish them in hard copy anymore
VO Sandpit, November 2009
Hard copy of the Human Genome at the Wellcome Collection
VO Sandpit, November 2009
Creating a dataset is hard work!
"Piled Higher and Deeper" by Jorge Chamwww.phdcomics.com
Managing and archiving data so that it’s understandable by other researchers is difficult and time consuming too.
We want to reward researchers for putting that effort in!
VO Sandpit, November 2009
Most people have an idea of what a publication is
VO Sandpit, November 2009
Most people have an idea of what a publication is
VO Sandpit, November 2009
Most people have an idea of what a publication is
VO Sandpit, November 2009
Most people have an idea of what a publication is
VO Sandpit, November 2009
Some examples of data (just from the Earth Sciences)
1. Time series, some still being updated e.g. meteorological measurements
2. Large 4D synthesised datasets, e.g. Climate, Oceanographic, Hydrological and Numerical Weather Prediction model data generated on a supercomputer
3. 2D scans e.g. satellite data, weather radar data
4. 2D snapshots, e.g. cloud camera5. Traces through a changing medium,
e.g. radiosonde launches, aircraft flights, ocean salinity and temperature
6. Datasets consisting of data from multiple instruments as part of the same measurement campaign
Dataset: "Recorded information, regardless of the form or medium on which it may be recorded including writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow, charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data."
(from the U.S. National Institutes of Health (NIH) Grants Policy Statement via DataCite's Best Practice Guide for Data Citation).
In my opinion a dataset is something that is:•The result of a defined process•Scientifically meaningful•Well-defined (i.e. clear definition of what is in the dataset and what isn’t)
Compliance with NERC data management policy. Reflects how many data sets NERC has. The number of dataset discovery records visible from the NERC data discovery service.
Web site visits Quarterly
BADC: 61,600NEODC: 10,200
Active use and visibility of the data centre. Site visits from standard web log analysis systems, such as webaliser. Sensible web crawler filters should have been applied.
Active use and visibility of the data centre. Queries marked as resolved within the quarter. A query is a request for information, a problem or ad hoc data request.
We’re working with DataCite and Thompson Reuters to get data
citation counts.
VO Sandpit, November 2009
Altmetrics and social media for data?
Mainly focussing on citation as a first step, as it’s most commonly accepted by researchers.
We have a social media presence @CEDAnews
- Mainly used for announcements about service availability
We definitely want ways of showing our funders that we provide a good service to our users and the research community.
And we want to be able to tell our depositors what
impact their data has had!
VO Sandpit, November 2009
RDA/WDS WG Bibliometrics Survey Results: Mostly Expected
Citations are preferred metrics, downloads next.
Standards are missing.Culture change is needed.
Nothing
Data citation counts
Downloads
Social media (likes/shares/tweets)
Mentions in peer-reviewed papers
Hits in search engines
Mentions in blogs
Bookmarks in Zotero and/or Mendeley
Other (please specify)
0 10 20 30 40 50 60 70
31.5%
68.5%
Are the methods you use to evaluate impact adequate for
your needs?
YesNo
What do you currently use to evaluate the impact of data?
VO Sandpit, November 2009
Other projects in the data metrics space
1. CASRAI data level metrics 2. PLOS Making Data Count 3. NISO altmetrics 4. Jisc Giving Researchers Credit for their Data
VO Sandpit, November 2009
Next steps for Bibliometrics for Data WG
Will be based on:• WG survey results (presented RDA P4 and P5)• Spreadsheet of metrics being collected by repositories - Still open
for contributions! http://bit.ly/1MpyW4K • Shared results from other projects – understanding the challenges
and answering the questions posed in the case statement• Preliminary analysis of data DOI resolutions• Supporting and evaluating tools from other projects• Preliminary guidance for the community - “minimal” rather than
“best” practice – get people discussing the issues and coming up with solutions!
Outputs: binary redistribution package (installer), algorithm, data analytic software tool, analysis scripts, data cleaning, APIs, codebook (for content analysis), source code, software to make metadata for libraries archives and museums, data analytic software tool, source code, program codes (for modeling), commentary in code(thinking of open source-need to attribute code authors and commentator/enhancers/hackers, who can document what they did and why), computer language (a syntax to describe a set of operations or activities), software patch (set of changes to code to fix bugs, add features, etc.), digital workflow (automated sequence of programs, steps to an outcome), software library (non-stand alone code that can be incorporated into something larger), software application (computer code that accomplishes something)
VIVO-ISF: Suite of ontologies that integrates and extends community standards
VO Sandpit, November 2009
Credit extends beyond the original contribution
Stacy creates mouse1
Kristi creates mouse2
Karen uses performs RNAseq analysis on mouse1 and
mouse2 to generate dataset3, which she subsequently curates and analyzes
Karen writes publication pmid:12345 about the results of her analysis
Karen explicitly credits Stacy as an author but not Kristi.
VO Sandpit, November 2009
Credit is connected
Credit to Stacy is asserted, but credit to Kristi can be inferred
VO Sandpit, November 2009
Introducing openRIF
The Open Research Information Framework
openRIF
SciENcv
eagle-i
VIVO-ISF
VO Sandpit, November 2009
Ensuring an openRIF that meets community needs
Data Entry Discovery
Interoperability
A domain configurable suite of ontologies to enable interoperability across systems
A community of developers, tools, data providers, and end-users
VO Sandpit, November 2009
Developing a computable research ecosystem
Research information is scattered amongst:Research networking toolsCitation databases (e.g., PubMED)Award databases (e.g., NIH Reporter)Curated archives (e.g., GenBank)Locked up in text (the research literature)
Map SciENcv data model to VIVO-ISF/openRIF
Enable bi-directional data exchangeIntegrate SciENcv, ORCID data into