Supporting Data-Rich Research on Many Fronts

Suppor&ng Data-‐Rich Research on Many Fronts

21 May 2 0 1 2

U n i v e r s i t y o f C a l i f o r n i a C u r a & o n C e n t e r

C a l i f o r n i a D i g i t a l L i b r a r y

California Digital Library

CDL supports the research lifecycle

•  Collec&ons

•  Digital Special Collec&ons

•  Discovery & Delivery •  Publishing Group

•  UC Cura&on Center (UC3)

Serving the University of California

•  10 campuses

•  360K students, faculty, and staff

•  100’s of museums, art galleries, observatories, marine centers, botanical gardens

•  5 medical centers

•  5 law schools

•  3 Na&onal Laboratories

California Digital Library (CDL)

Our environment circa 2002-‐2008

Focus on preserva&on

For memory organiza&ons

Infrastructure: sta&c

Services: hosted

Content: museum & library

Sustainability: ?

Our environment since 2008

  cura%on (lifecycle)   and now data producers + cloud, VM, bitbucket

  + partnered, self-‐serve   + research, web crawls cost recovery, pay once

Focus on preserva&on

For memory organiza&ons

Infrastructure: sta&c

Services: hosted

Content: museum & library

Sustainability: ?

Today’s journey Data service basics at CDL

• Stable storage (Merri\) • Stable iden&fiers (EZID) • Data cita&on (DataCite) • Management (DMPTool) • Preserva&on cost modeling

... that enable • Federa&on (DataONE) • Data papers • Capture (WAS web archiving) • Excel add-‐in (DCXL)

The scien&fic record is at risk

Data dissemina&on is rare, risky, expensive, labor-‐intensive, domain-‐specific, and receives li\le credit as research output

Global Change Galac&c Change

The changing landscape

•  Ever increasing number, size, and diversity of content

•  Ever increasing diversity of partners, and stakeholders

•  Decreasing resources •  Inevitability of disrup&ve change

– Technology –  Ins&tu&onal mission

T IME

RESOURC

ES

Stable storage: Merri\ repository •  Cura&on repository open to the UC

community and beyond

•  Discipline / content agnos&c •  Micro-‐services architecture

•  Easy-‐to-‐use UI or API •  Hosted or locally deployed

Primary FuncAons

1. Deposit

2. Manage (metadata, versions, etc)

3. Access (expose)

4. Share (with other researchers)

5. Preserve

EZID: Long term iden%fiers made easy

•  Precise iden&fica&on of a dataset (DOI or ARK)

•  Credit to data producers and data publishers

•  A link from the tradi&onal literature to the data (DataCite)

•  Exposure and research metrics for datasets (Web of Knowledge, Google)

Primary FuncAons

1. Create persistent iden&fiers

2. Manage iden&fiers (and associated metadata) over &me

3. Resolve iden&fiers

Take control of the management and distribu%on of your research, share and get credit for it, and build your reputa%on through its collec%on and documenta%on

Discovery: DataCite consor&um •  Technische Informa&onsbibliothek (TIB),

Germany

•  Australian Na&onal Data Service (ANDS)

•  The Bri&sh Library

•  California Digital Library, USA

•  Canada Ins&tute for Scien&fic and

Technical Informa&on (CISTI)

•  L’Ins&tut de l’Informa&on Scien&fique

et Technique (INIST), France

•  Library or the ETH Zürich

•  Library of TU Delk, The Netherlands

•  Office of ScienAfic and Technical

InformaAon, US Department of Energy

•  Purdue University, USA

•  Technical Informa&on Center of

Denmark

DMPTool

•  Connect researchers to resources to create a data management plan

•  NSF and directorates, NIH, NEH, IMLS, founda&ons plus

•  Customizable

Mee&ng funding agencies data management plan requirements

Primary FuncAons

1. Step-‐by-‐step “wizard”

2. Templates and examples

3. Links to ins&tu&onal resources and agency informa&on

4. Plan publica&on and sharing

Number of Plans Created Oct 2011 – Feb 2012

Cost Model 1: Pay as you go

•  Billed/paid annually

–  Costs for archival System (A ), Workflows (W ), Content Types (C ), Monitoring (M ), and Interven%ons (V ) are considered common goods, and are appor&oned equally across all n Producers (P )

•  Model components are represented by two terms: the number of units and the per-‐unit cost, e.g., k ·S

–  Storage cost (S ) accounted on a per-‐Producer basis

{P if year = 0 0 if year > 0

Model 2: Pay once, preserve for “T” years

•  Paid-‐up price for fixed term T

–  A func&on of r, the annual investment return, and d, the annual decrease in unit cost of preserva&on

–  G is the cost of providing a year’s preserva&on service; G0 includes the added first year expense of Producer engagement and registra&on

–  Sepng T = ∞ calculates the price for “forever”

Member Nodes

•  diverse ins&tu&ons •  serve local community

•  provide resources for managing their data

New distributed framework CoordinaAng Nodes

•  retain complete metadata catalog

•  subset of all data •  perform basic indexing •  provide network-‐wide services

•  ensure data availability (preserva&on)

•  provide replica&on services

Flexible, scalable, sustainable network

Tradi&onal ar&cles vs data papers

The collec&ve data product

Need to save data + processing

Algorithms + Data Structures = Programs

Vision for a “data paper”

•  Wrap the unfamiliar in a familiar façade

•  A “data paper” is minimally a cover sheet and a set of links to archived ar&facts

•  Cover sheet contains familiar elements: &tle, date, authors, abstract, and persistent iden&fier (DOI, ARK, etc.)

•  Just enough to permit basic exposure and discovery

– Building a basic data cita&on –  Indexing by services such as Web of Science, Google Scholar

–  Ins&lling confidence in the iden&fier’s stability

43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutions

Developed with LoC support by CDL, UNT, and others

What are people using WAS for? Archiving at-‐risk government websites and publica&ons

Archiving their own university domains Building web archives to complement library collec&ons

Documen&ng web coverage of significant events

•  Excel is the database of choice for many researchers

•  Make it easy to share, archive, and publish data •  Keep up to date at dcxl.cdlib.org

Surveyed users and found: •  Most researchers are unaware of preserva&on op&ons

•  Documenta&on prac&ces are poor •  Excel is just one tool in workflows

Data cura%on for Excel

Primary FuncAons

1. An Excel add-‐in and web applica&on

2. Metadata descrip&on (through extrac&on and augmenta&on)

3. Check for good data prac&ces

3. Transfer to repository

A data cura&on approach at CDL

•  New “data paper” publishing model [GBMF] •  DataCite consor&um and cita&on standards •  Other fronts:

•  DataONE global data network [NSF] •  Merri\: general-‐purpose data repository •  EZID: scheme-‐agnos&c & de-‐coupled crea&on, resolu&on, and management of persistent ids

•  Data management plan generator •  Web archiving service [Library of Congress] •  Open-‐source Excel add-‐in [MS Research & GBMF]

Ques&ons?

[email protected]

California Digital Library

h\p://www.cdlib.org/

Supporting Data-Rich Research on Many Fronts

Business

data services

data publishers

data structures

data availability preservaon

data datacite exposure

data producers infrastructure

risk data disseminaon

collecve data product