Suppor&ng DataRich Research on Many Fronts 21 May 2012 University of California Cura&on Center California Digital Library
May 13, 2015
Suppor&ng Data-‐Rich Research on Many Fronts
21 May 2 0 1 2
U n i v e r s i t y o f C a l i f o r n i a C u r a & o n C e n t e r
C a l i f o r n i a D i g i t a l L i b r a r y
California Digital Library
CDL supports the research lifecycle
• Collec&ons
• Digital Special Collec&ons
• Discovery & Delivery • Publishing Group
• UC Cura&on Center (UC3)
Serving the University of California
• 10 campuses
• 360K students, faculty, and staff
• 100’s of museums, art galleries, observatories, marine centers, botanical gardens
• 5 medical centers
• 5 law schools
• 3 Na&onal Laboratories
California Digital Library (CDL)
Our environment circa 2002-‐2008
Focus on preserva&on
For memory organiza&ons
Infrastructure: sta&c
Services: hosted
Content: museum & library
Sustainability: ?
Our environment since 2008
cura%on (lifecycle) and now data producers + cloud, VM, bitbucket
+ partnered, self-‐serve + research, web crawls cost recovery, pay once
Focus on preserva&on
For memory organiza&ons
Infrastructure: sta&c
Services: hosted
Content: museum & library
Sustainability: ?
Today’s journey Data service basics at CDL
• Stable storage (Merri\) • Stable iden&fiers (EZID) • Data cita&on (DataCite) • Management (DMPTool) • Preserva&on cost modeling
... that enable • Federa&on (DataONE) • Data papers • Capture (WAS web archiving) • Excel add-‐in (DCXL)
The scien&fic record is at risk
Data dissemina&on is rare, risky, expensive, labor-‐intensive, domain-‐specific, and receives li\le credit as research output
Global Change Galac&c Change
The changing landscape
• Ever increasing number, size, and diversity of content
• Ever increasing diversity of partners, and stakeholders
• Decreasing resources • Inevitability of disrup&ve change
– Technology – Ins&tu&onal mission
T IME
RESOURC
ES
Stable storage: Merri\ repository • Cura&on repository open to the UC
community and beyond
• Discipline / content agnos&c • Micro-‐services architecture
• Easy-‐to-‐use UI or API • Hosted or locally deployed
Primary FuncAons
1. Deposit
2. Manage (metadata, versions, etc)
3. Access (expose)
4. Share (with other researchers)
5. Preserve
EZID: Long term iden%fiers made easy
• Precise iden&fica&on of a dataset (DOI or ARK)
• Credit to data producers and data publishers
• A link from the tradi&onal literature to the data (DataCite)
• Exposure and research metrics for datasets (Web of Knowledge, Google)
Primary FuncAons
1. Create persistent iden&fiers
2. Manage iden&fiers (and associated metadata) over &me
3. Resolve iden&fiers
Take control of the management and distribu%on of your research, share and get credit for it, and build your reputa%on through its collec%on and documenta%on
Discovery: DataCite consor&um • Technische Informa&onsbibliothek (TIB),
Germany
• Australian Na&onal Data Service (ANDS)
• The Bri&sh Library
• California Digital Library, USA
• Canada Ins&tute for Scien&fic and
Technical Informa&on (CISTI)
• L’Ins&tut de l’Informa&on Scien&fique
et Technique (INIST), France
• Library or the ETH Zürich
• Library of TU Delk, The Netherlands
• Office of ScienAfic and Technical
InformaAon, US Department of Energy
• Purdue University, USA
• Technical Informa&on Center of
Denmark
DMPTool
• Connect researchers to resources to create a data management plan
• NSF and directorates, NIH, NEH, IMLS, founda&ons plus
• Customizable
Mee&ng funding agencies data management plan requirements
Primary FuncAons
1. Step-‐by-‐step “wizard”
2. Templates and examples
3. Links to ins&tu&onal resources and agency informa&on
4. Plan publica&on and sharing
Number of Plans Created Oct 2011 – Feb 2012
Cost Model 1: Pay as you go
• Billed/paid annually
– Costs for archival System (A ), Workflows (W ), Content Types (C ), Monitoring (M ), and Interven%ons (V ) are considered common goods, and are appor&oned equally across all n Producers (P )
• Model components are represented by two terms: the number of units and the per-‐unit cost, e.g., k ·S
– Storage cost (S ) accounted on a per-‐Producer basis
{P if year = 0 0 if year > 0
Model 2: Pay once, preserve for “T” years
• Paid-‐up price for fixed term T
– A func&on of r, the annual investment return, and d, the annual decrease in unit cost of preserva&on
– G is the cost of providing a year’s preserva&on service; G0 includes the added first year expense of Producer engagement and registra&on
– Sepng T = ∞ calculates the price for “forever”
Member Nodes
• diverse ins&tu&ons • serve local community
• provide resources for managing their data
New distributed framework CoordinaAng Nodes
• retain complete metadata catalog
• subset of all data • perform basic indexing • provide network-‐wide services
• ensure data availability (preserva&on)
• provide replica&on services
Flexible, scalable, sustainable network
Tradi&onal ar&cles vs data papers
The collec&ve data product
Need to save data + processing
Algorithms + Data Structures = Programs
Vision for a “data paper”
• Wrap the unfamiliar in a familiar façade
• A “data paper” is minimally a cover sheet and a set of links to archived ar&facts
• Cover sheet contains familiar elements: &tle, date, authors, abstract, and persistent iden&fier (DOI, ARK, etc.)
• Just enough to permit basic exposure and discovery
– Building a basic data cita&on – Indexing by services such as Web of Science, Google Scholar
– Ins&lling confidence in the iden&fier’s stability
43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutions
Developed with LoC support by CDL, UNT, and others
What are people using WAS for? Archiving at-‐risk government websites and publica&ons
Archiving their own university domains Building web archives to complement library collec&ons
Documen&ng web coverage of significant events
• Excel is the database of choice for many researchers
• Make it easy to share, archive, and publish data • Keep up to date at dcxl.cdlib.org
Surveyed users and found: • Most researchers are unaware of preserva&on op&ons
• Documenta&on prac&ces are poor • Excel is just one tool in workflows
Data cura%on for Excel
Primary FuncAons
1. An Excel add-‐in and web applica&on
2. Metadata descrip&on (through extrac&on and augmenta&on)
3. Check for good data prac&ces
3. Transfer to repository
A data cura&on approach at CDL
• New “data paper” publishing model [GBMF] • DataCite consor&um and cita&on standards • Other fronts:
• DataONE global data network [NSF] • Merri\: general-‐purpose data repository • EZID: scheme-‐agnos&c & de-‐coupled crea&on, resolu&on, and management of persistent ids
• Data management plan generator • Web archiving service [Library of Congress] • Open-‐source Excel add-‐in [MS Research & GBMF]