Top Banner
© Trustees of Indiana University Released under Creative Commons 3.0 unported license; license terms on last slide. Building a Data Discovery Network for Sustainability Science Robert H. McDonald Deputy Director Data to Insight (D2I) Center Associate Dean – IU Libraries Indiana University [email protected] | @ mcdonald @SEADdatanet Presented at the VIVO 2012 Conference Miami, FL– August 24, 2012 Available from: http://slidesha.re/ Q9q8VW ttp://slidesha.re/Q9q8VW
54

Building a Data Discovery Network for Sustainability Science

Nov 17, 2014

Download

Education

Robert McDonald

This is the slidedeck for my SEAD presentation at the 3rd International VIVO Conference held on August 24, 2012 at Miami, FL.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building a Data Discovery Network for Sustainability Science

© Trustees of Indiana UniversityReleased under Creative Commons 3.0 unported license; license terms on last slide.

Building a Data Discovery Network for Sustainability

ScienceRobert H. McDonald

Deputy Director Data to Insight (D2I) CenterAssociate Dean – IU Libraries

Indiana [email protected] | @mcdonald @SEADdatanet

Presented at the VIVO 2012 Conference

Miami, FL– August 24, 2012Available from: http://slidesha.re/Q9q8VWhttp://slidesha.re/Q9q8VW

Page 2: Building a Data Discovery Network for Sustainability Science

NSF DataNet Program

Motivation: “… one of the major challenges of this scientific generation: how to develop the new methods, management structures and technologies to manage the diversity, size, and complexity of current and future data sets and data streams.”

Response: DataNet creates “a set of exemplar national and global data research infrastructure organizations” to address this challenge.

Page 3: Building a Data Discovery Network for Sustainability Science

Current NSF DataNet Projects

• SEAD– http://sead-data.net

• DataOne– http://www.dataone.org

• DataNet Federation Consortium– http://datafed.org

• Terra Populous– https://www.pop.umn.edu/terra_pop

Page 4: Building a Data Discovery Network for Sustainability Science

Sustainable Environment Actionable Data (SEAD) - DataNet

• SEAD Strategy― Serve scientists and

researchers in the “long tail” of science

― Leverage social media for discovery of data, interest, and expertise

― Move data curation upstream in the data life cycle of science

― Take advantage of existing domain and institutional infrastructures (Institutional Repositories, ICPSR) for long-term preservation

SEAD Partners - http://sead-data.net

Page 5: Building a Data Discovery Network for Sustainability Science

SEAD TEAMS

Margaret Hedstrom-PI, Ann Zimmerman-Co-PI, Karen Woollams, George Alter (ICPSR), Bryan Beecher (ICPSR), Jude Yew

Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Robert Ping

James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd

Praveen Kumar-Co-PI, Md Aktaruzzaman, Terry McLaren (NCSA), Rob Kooper (NCSA), Luigi Marini (NCSA)

Michigan

Indiana

Rensselaear

Illinois

Page 6: Building a Data Discovery Network for Sustainability Science

SEAD 18 month Pilot Phase• Domain Engagement:

– National Center for Earth Systems Dynamics (NCESD), Illinois River Basin Observatory

– Requirements, Use Cases, Prioritization of Data Types and Services• Active and Social Curation

– Pilot Active Content Repository, VIVO deployments– Exemplar services for Data Ingest, Discovery, Re-use, Curation

(Tupelo/Medici)• CI for Long-term Access (Virtual Archive)

– Data model, protocol design/development– Pilot Federated Repository infrastructure

• Education, Outreach, and Training– Post-doc mentoring– Web site, training materials, meetings, workshops, …

• Project Oversight– Management, reporting, committees– Business model development

Page 7: Building a Data Discovery Network for Sustainability Science

Sustainability Science

7

Science

Technology

Economics

Poverty & Justice

Policy

Cooperation

Page 8: Building a Data Discovery Network for Sustainability Science

Data challenges• Heterogeneity

of all kinds• Multiple scales• Multidisciplinar

y• Many small

datasets

Page 9: Building a Data Discovery Network for Sustainability Science

The long tail of scientific research

• Small and derived data sets• Heterogeneous data• Multiple sources of data• Short-lived data with long-term

value• Value of data grows when

combined & integrated

Page 10: Building a Data Discovery Network for Sustainability Science
Page 11: Building a Data Discovery Network for Sustainability Science

SEAD notions of defined Data Phases

• Phases of data lifecycle acknowledge and accommodate the difference between public data and data still in work by a researcher.

• Research Data Phase: data set is research data collection, owned by individual and under their control. – Data need not be licensed at this time because it is not

ready for broader release – Data need not have permanent IDs because still work in

progress – Corresponds to first existence in Active Curation Repository

• Published Phase: Owner of research data collection determines that dataset is ready for publication– License terms set– Persistent ID – Made available as part of public profile in VIVO– Activated by user-controlled publish event

Page 12: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

Active and Social Curation OAIS Repository FederationCuration Boundary

UserContributor

Active Content

Repository

VIVO/Linked Data

Page 13: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

SEAD Active Data Systems

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

A standardized data model and federation capability over OAIS-Standard Institutional Repositories

Active and Social Curation

Page 14: Building a Data Discovery Network for Sustainability Science
Page 15: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

A robust, replicated distributed file system used as a large-scale backing store

Page 16: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Scholarly Communication

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant) Preservation

Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

An Active Content Repository based on standard global IDs and semantic web technologies - to collect and integrate data, metadata, and provenance information from multiple sources.

ContentContentContentContent

Lustre File System

DC:CreatorOPM:wasDerivedFromSWAN:isEvidenceFor…

Page 17: Building a Data Discovery Network for Sustainability Science
Page 18: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

Active and Social Curation OAIS Repository FederationCuration Boundary

UserContributor

Active Content

Repository

VIVO/Linked Data

SEAD will run a VIVO instance and may harvest Linked Data from other sources

VIVO Application: Open Source federatable Researcher Information – people, papers, projects, centers, fields, etc.

Page 19: Building a Data Discovery Network for Sustainability Science
Page 20: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-OREIngest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

Dissemination Packages

Wide-Area File System

UserContributor

Active and Social Curation Services supporting automated and interactive use of SEAD- leveraging standard web application/web service toolkits and virtual machine infrastructure

Page 21: Building a Data Discovery Network for Sustainability Science

CI Technical Approach

SEAD CI Technical Approach

Appraisal and

Selection SEAD Trusted Digital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Active Content Repository

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Curation and Preservation Services also leveraging standard web application/web service toolkits and virtual machine infrastructure

Active and Social Curation

Page 22: Building a Data Discovery Network for Sustainability Science

Appraisal and

SelectionDigital Repository Federation (OAIS compliant)

Scholarly Communication

Preservation Actions

SEAD Data Curation Lifecycle Elements

Compound Objects - OAI-ORE

Dissemination Packages

Ingest, AIPs

Data Acquisition, Analysis and Simulation

Search, Browse,

Annotation, Visualization

Tools

Metadata Management

DDI3. METS, PREMIS, MODS, DC, SensorML,

OGC, …

Automated Curation Workflow/Rule

Engine

Operates on Metadata, Content Objects and Trigger

Events

Access Mechanisms and E-Scholarship Services

Migration and

Emulation Tools

Use, Reuse, Repurposing

Tools

Wide-Area File System

Ingest scripts: fixity, integrity, authentication, transformation

OAIS Repository FederationCuration Boundary

UserContributor

Active and Social Curation

Active Content

Repository

VIVO/Linked Data

Page 23: Building a Data Discovery Network for Sustainability Science

SEAD Active/Social Curation Repository

Page 24: Building a Data Discovery Network for Sustainability Science
Page 25: Building a Data Discovery Network for Sustainability Science
Page 26: Building a Data Discovery Network for Sustainability Science
Page 27: Building a Data Discovery Network for Sustainability Science
Page 28: Building a Data Discovery Network for Sustainability Science
Page 29: Building a Data Discovery Network for Sustainability Science
Page 30: Building a Data Discovery Network for Sustainability Science
Page 31: Building a Data Discovery Network for Sustainability Science
Page 32: Building a Data Discovery Network for Sustainability Science
Page 33: Building a Data Discovery Network for Sustainability Science
Page 34: Building a Data Discovery Network for Sustainability Science
Page 35: Building a Data Discovery Network for Sustainability Science
Page 36: Building a Data Discovery Network for Sustainability Science
Page 37: Building a Data Discovery Network for Sustainability Science

SEAD VIVO: RIS2N3

Page 38: Building a Data Discovery Network for Sustainability Science

SEAD Virtual Archive

Page 39: Building a Data Discovery Network for Sustainability Science

Faceted search(Solr-based)

Facets

Page 40: Building a Data Discovery Network for Sustainability Science

Search Result

Page 41: Building a Data Discovery Network for Sustainability Science

A dataset or file looks like this

Page 42: Building a Data Discovery Network for Sustainability Science

Geospatial search(from Postgres index)

Page 43: Building a Data Discovery Network for Sustainability Science

Geospatial search results

Page 44: Building a Data Discovery Network for Sustainability Science

Login for data upload

Page 45: Building a Data Discovery Network for Sustainability Science

Upload file

Files from Medici can also be added

Page 46: Building a Data Discovery Network for Sustainability Science

Create collection (can have multiple files)

Page 47: Building a Data Discovery Network for Sustainability Science

Upload complete

Page 48: Building a Data Discovery Network for Sustainability Science

Data ingested to DSpace (Mississippi example)

Page 49: Building a Data Discovery Network for Sustainability Science

SEAD Virtual Archive Architecture

SEAD Ingest

Client /UI

IRDSpace

SIP(Data+

Metadata)

Ack

Data Validation

(Fixity, Virus)

Preservation Metadata Generation (Events)

Feature Extraction from Data

Solr Index

PostgreSQL Index

SIP breakdown

AIP

+Data

Geospatial +Temporal Metadata

Core Property+ Domain

Metadata

Obtain DOI from DataCite

IU DataCite

ID Server

Store data object, its metadata object, and its relationship record (latter as RDF) in IR as a collection

Register DOI with

VIVO

VIVO server

register metadata to SOLR and PostgreS for rapid retrieval of metadata

Page 50: Building a Data Discovery Network for Sustainability Science

Key Questions for SEAD Prototype

• What could SEAD capture when?• How can SEAD provide direct

value to data producers, users, and curators?

• How can web 2.0/3.0 and social computing lower barriers and reduce/realign costs?

Page 51: Building a Data Discovery Network for Sustainability Science

Towards A Shared Data Future

Source: EU HLEG Report on Data Deluge: Riding the Wave, pg 31, 2010

Trus

t

Dat

a Cu

ratio

n

Data Generators Users

Community Support Services

Common Data Services

User functionalities, data capture & transfer, virtual research environments

Data discovery & navigation, workflow generation, annotation, interpretability

Persistent storage, identification, authenticity (provenance), workflow execution, data mining

Page 52: Building a Data Discovery Network for Sustainability Science

Data Interoperability

• NSF OCI: DataNet and INTEROP now DIBBs

• EUDAT• Data Web Forum• IETF Research Data Identifier BOF• Upcoming Oct. US Meeting of

DataNet, INTEROP, Data Web Forum

Page 53: Building a Data Discovery Network for Sustainability Science

AcknowledgementsSEAD is funded by the National Science Foundation under cooperative agreement #OCI0940824

http://sead-data.net

• For more on SEAD go to:• http://sead-data.net

• Follow us on Twitter @SEADdatanet

Page 54: Building a Data Discovery Network for Sustainability Science

License terms• Please cite as: McDonald, R.H. et. al. Building a Data Discovery Network

for Sustainability Science. 3rd International VIVO Conference, Miami, FL, 24 August 2012. Available from: [http://slidesha.re/Q9q8VW]

• Thanks to Margaret Hedstrom, who’s guided the team through the (really) lengthy review process and to Jim Myers, Beth Plale, Praveen Kumar, Terry McLaren, Luigi Marini, Kavitha Chandrasekar and others who provided content for this presentation.

• The concepts and software being leveraged in SEAD represent the work of a broad range of people over multiple years – their contributions have been critical to launching SEAD.

• Items indicated with a © are under copyright and used here with permission. Such items may not be reused without permission from the holder of copyright except where license terms noted on a slide permit reuse.

• This document is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share – to copy, distribute and transmit the work and to remix – to adapt the work under the following conditions: attribution – you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.