Top Banner
Suppor&ng DataRich Research on Many Fronts 21 May 2012 University of California Cura&on Center California Digital Library
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supporting Data-Rich Research on Many Fronts

Suppor&ng  Data-­‐Rich  Research  on  Many  Fronts  

21  May   2 0 1 2  

U n i v e r s i t y   o f   C a l i f o r n i a   C u r a & o n   C e n t e r  

C a l i f o r n i a   D i g i t a l   L i b r a r y  

Page 2: Supporting Data-Rich Research on Many Fronts

California  Digital  Library  

CDL  supports  the  research  lifecycle    

•  Collec&ons  

•  Digital  Special  Collec&ons  

•  Discovery  &  Delivery  •  Publishing  Group  

•  UC  Cura&on  Center  (UC3)  

Serving  the  University  of  California  

•  10  campuses  

•  360K  students,  faculty,  and  staff  

•  100’s  of  museums,  art  galleries,  observatories,  marine  centers,  botanical  gardens  

•  5  medical  centers  

•  5  law  schools  

•  3  Na&onal  Laboratories  

Page 3: Supporting Data-Rich Research on Many Fronts

California  Digital  Library  (CDL)  

Page 4: Supporting Data-Rich Research on Many Fronts

Our  environment  circa  2002-­‐2008  

Focus  on  preserva&on  

For  memory  organiza&ons  

Infrastructure:  sta&c  

Services:  hosted  

Content:  museum  &  library  

Sustainability:  ?  

Page 5: Supporting Data-Rich Research on Many Fronts

Our  environment  since  2008  

   cura%on  (lifecycle)       and  now  data  producers      +  cloud,  VM,  bitbucket    

     +  partnered,  self-­‐serve       +  research,  web  crawls      cost  recovery,  pay  once  

Focus  on  preserva&on  

For  memory  organiza&ons  

Infrastructure:  sta&c  

Services:  hosted  

Content:  museum  &  library  

Sustainability:  ?  

Page 6: Supporting Data-Rich Research on Many Fronts

Today’s  journey  Data  service  basics  at  CDL  

• Stable  storage  (Merri\)  • Stable  iden&fiers  (EZID)  • Data  cita&on  (DataCite)  • Management  (DMPTool)  • Preserva&on  cost  modeling  

...  that  enable  • Federa&on  (DataONE)  • Data  papers  • Capture  (WAS  web  archiving)  • Excel  add-­‐in  (DCXL)  

Page 7: Supporting Data-Rich Research on Many Fronts

The  scien&fic  record  is  at  risk  

Data  dissemina&on  is  rare,  risky,  expensive,  labor-­‐intensive,  domain-­‐specific,  and  receives  li\le  credit  as  research  output  

Global  Change   Galac&c  Change  

Page 8: Supporting Data-Rich Research on Many Fronts

The  changing  landscape  

•  Ever  increasing  number,  size,  and  diversity  of  content  

•  Ever  increasing  diversity  of  partners,  and  stakeholders  

•  Decreasing  resources  •  Inevitability  of  disrup&ve  change  

– Technology  –  Ins&tu&onal  mission  

T IME  

RESOURC

ES  

Page 9: Supporting Data-Rich Research on Many Fronts

Stable  storage:    Merri\  repository  •  Cura&on  repository  open  to  the  UC  

community  and  beyond  

•  Discipline  /  content  agnos&c    •  Micro-­‐services  architecture  

•  Easy-­‐to-­‐use  UI  or  API  •  Hosted  or  locally  deployed  

Primary  FuncAons  

1.  Deposit    

2.  Manage  (metadata,  versions,  etc)  

3.  Access  (expose)  

4.  Share  (with  other  researchers)  

5.  Preserve  

Page 10: Supporting Data-Rich Research on Many Fronts

EZID:  Long  term  iden%fiers  made  easy  

•  Precise  iden&fica&on  of  a  dataset  (DOI  or  ARK)  

•  Credit  to  data  producers  and  data  publishers  

•  A  link  from  the  tradi&onal  literature  to  the  data  (DataCite)  

•  Exposure  and  research  metrics  for  datasets  (Web  of  Knowledge,  Google)  

Primary  FuncAons  

1.  Create  persistent  iden&fiers  

2.  Manage  iden&fiers  (and  associated  metadata)  over  &me  

3.  Resolve  iden&fiers  

Take  control  of  the  management  and  distribu%on  of  your  research,  share  and  get  credit  for  it,  and  build  your  reputa%on  through  its  collec%on  and  documenta%on  

Page 11: Supporting Data-Rich Research on Many Fronts

Discovery:  DataCite  consor&um  •  Technische  Informa&onsbibliothek  (TIB),  

Germany  

•  Australian  Na&onal  Data  Service  (ANDS)  

•  The  Bri&sh  Library  

•  California  Digital  Library,  USA  

•  Canada  Ins&tute  for  Scien&fic  and  

Technical  Informa&on  (CISTI)  

•  L’Ins&tut  de  l’Informa&on  Scien&fique  

et  Technique  (INIST),  France  

•  Library  or  the  ETH  Zürich  

•  Library  of  TU  Delk,  The  Netherlands  

•  Office  of  ScienAfic  and  Technical  

InformaAon,  US  Department  of  Energy  

•  Purdue  University,  USA  

•  Technical  Informa&on  Center  of  

Denmark  

Page 12: Supporting Data-Rich Research on Many Fronts

DMPTool  

•  Connect  researchers  to  resources  to  create  a  data  management  plan  

•  NSF  and  directorates,  NIH,  NEH,  IMLS,  founda&ons  plus  

•  Customizable  

Mee&ng  funding  agencies  data  management  plan  requirements  

Primary  FuncAons  

1.  Step-­‐by-­‐step  “wizard”  

2.  Templates  and  examples  

3.  Links  to  ins&tu&onal  resources  and  agency  informa&on  

4.  Plan  publica&on  and  sharing  

Page 13: Supporting Data-Rich Research on Many Fronts

Number  of  Plans  Created    Oct  2011  –  Feb  2012  

Page 14: Supporting Data-Rich Research on Many Fronts

Cost  Model  1:  Pay  as  you  go  

•  Billed/paid  annually  

–  Costs  for  archival  System  (A ),  Workflows  (W ),  Content  Types  (C ),  Monitoring  (M ),  and  Interven%ons  (V )  are  considered  common  goods,  and  are  appor&oned  equally  across  all  n  Producers  (P )  

•  Model  components  are  represented  by  two  terms:  the  number  of  units  and  the  per-­‐unit  cost,  e.g.,  k ·S

–  Storage  cost  (S )  accounted  on  a  per-­‐Producer  basis  

{P  if  year = 0  0      if  year > 0

Page 15: Supporting Data-Rich Research on Many Fronts

Model  2:  Pay  once,  preserve  for  “T”  years  

•  Paid-­‐up  price  for  fixed  term  T    

–  A  func&on  of  r,  the  annual  investment  return,  and  d,  the  annual  decrease  in  unit  cost  of  preserva&on  

–  G    is  the  cost  of  providing  a  year’s  preserva&on  service;        G0  includes  the  added  first  year  expense  of  Producer  engagement  and  registra&on  

–  Sepng  T  =  ∞  calculates  the  price  for  “forever”  

Page 16: Supporting Data-Rich Research on Many Fronts

Member  Nodes  

•     diverse  ins&tu&ons  •     serve  local  community  

•     provide  resources  for  managing  their  data  

New  distributed  framework  CoordinaAng  Nodes  

•  retain  complete  metadata  catalog    

•  subset  of  all  data  •  perform  basic  indexing  •  provide  network-­‐wide  services  

•  ensure  data  availability  (preserva&on)      

•  provide  replica&on  services  

Flexible,  scalable,  sustainable  network  

Page 17: Supporting Data-Rich Research on Many Fronts

Tradi&onal  ar&cles  vs  data  papers  

Page 18: Supporting Data-Rich Research on Many Fronts

The  collec&ve  data  product  

Page 19: Supporting Data-Rich Research on Many Fronts

Need  to  save  data  +  processing  

Algorithms  +  Data  Structures  =  Programs    

Page 20: Supporting Data-Rich Research on Many Fronts

Vision  for  a  “data  paper”    

•  Wrap  the  unfamiliar  in  a  familiar  façade  

•  A  “data  paper”  is  minimally  a  cover  sheet  and  a  set  of  links  to  archived  ar&facts    

•  Cover  sheet  contains  familiar  elements:  &tle,  date,  authors,  abstract,  and  persistent  iden&fier  (DOI,  ARK,  etc.)  

•  Just  enough  to  permit  basic  exposure  and  discovery  

– Building  a  basic  data  cita&on    –  Indexing  by  services  such  as  Web  of  Science,  Google  Scholar  

–  Ins&lling    confidence  in  the  iden&fier’s    stability    

Page 21: Supporting Data-Rich Research on Many Fronts

43 public archives 120+ archives total 58K crawls 7,500 + sites 600 million + URLs 40+ TB 24 institutions

Developed with LoC support by CDL, UNT, and others

Page 22: Supporting Data-Rich Research on Many Fronts

What  are  people  using  WAS  for?  Archiving  at-­‐risk  government  websites  and  publica&ons  

Archiving  their  own  university  domains  Building  web  archives  to  complement  library  collec&ons  

Documen&ng  web  coverage  of  significant  events  

Page 23: Supporting Data-Rich Research on Many Fronts

•  Excel  is  the  database  of  choice  for  many  researchers  

•  Make  it  easy  to  share,  archive,    and  publish  data  •  Keep  up  to  date  at  dcxl.cdlib.org  

Surveyed  users  and  found:  •  Most  researchers  are  unaware  of  preserva&on  op&ons  

•  Documenta&on  prac&ces  are  poor  •  Excel  is  just  one  tool  in  workflows  

Data  cura%on  for  Excel  

Primary  FuncAons  

1.  An  Excel  add-­‐in  and  web  applica&on  

2.  Metadata  descrip&on  (through  extrac&on  and  augmenta&on)  

3.  Check  for  good  data  prac&ces  

3.  Transfer  to  repository    

Page 24: Supporting Data-Rich Research on Many Fronts

A  data  cura&on  approach  at  CDL  

•  New  “data  paper”  publishing  model  [GBMF]  •  DataCite  consor&um  and  cita&on  standards  •  Other  fronts:  

•  DataONE  global  data  network  [NSF]  •  Merri\:  general-­‐purpose  data  repository  •  EZID:  scheme-­‐agnos&c  &  de-­‐coupled  crea&on,  resolu&on,  and  management  of  persistent  ids  

•  Data  management  plan  generator  •  Web  archiving  service  [Library  of  Congress]  •  Open-­‐source  Excel  add-­‐in  [MS  Research  &  GBMF]  

Page 25: Supporting Data-Rich Research on Many Fronts

Ques&ons?  

[email protected]  

California  Digital  Library  

h\p://www.cdlib.org/