Top Banner
ADDRESSING THE NEXT CHALLENGES IN DATA SHARING: LARGESCALE DATA AND SENSITIVE DATA Mercè Crosas, Ph.D. Chief Data Science and Technology Officer Ins=tute for Quan=ta=ve Social Science Harvard University @mercecrosas
16

Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud&...

Sep 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

ADDRESSING  THE  NEXT  CHALLENGES  IN  DATA  SHARING:  LARGE-­‐SCALE  DATA  AND  SENSITIVE  DATA  

Mercè  Crosas,  Ph.D.  Chief  Data  Science  and  Technology  Officer  Ins=tute  for  Quan=ta=ve  Social  Science  Harvard  University  @mercecrosas    

Page 2: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Data  sharing:    good  for  you  and  good  for  the  world  

Researchers   Get  credit  for  their  data  

Publishers  and  Journals   Verify  published  work  

Federal  funding  agencies   Make  public  assets  accessible  

Science   Validate,  reuse  and  extend  previous  work  

Page 3: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Data  Sharing  (or  Publishing)  

A  formal  data  cita=on  •  Reference  •  Access  (persistent  iden=fier)  

Informa=on  about  the  data  (metadata)  •  Discovery  •  Use   A  trusted  data  

repository  •  Access  (long-­‐term  archival)  

Data  Sharing  needs  to  support  data  discovery,  referencing,  access,  and  reuse      

 

Page 4: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

                                                                                                                                       dataverse.org  

 Open-­‐source  soVware  developed  at  Harvard’s  IQSS  since  2006  

Used  to  share,  publish,  cite  and  archive  research  data  Installed  in  12  sites  world  wide  

Serving  100s  of  universi=es  and  organiza=ons  

Page 5: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Harvard  Dataverse:  dataverse.harvard.edu  Started  as  a  community  repository  for  Social  Science  Now  open  to  all  research  fields  and  all  researchers  

More  than  1300  dataverses  More  than  59,000  datasets  

More  than  1,400,000  downloads        

Page 6: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Data  Sharing  with  Dataverse    

Now    

•  No  sensi=ve  data  •  Seldom  versioning  •  Datasets  up  to  ~GB  

The  Next  5  Years  

•  Highly-­‐sensi=ve  data  •  Streaming  or  frequently  

updated  data  •  Datasets  >  GBs,  TBs,  PBs  

–  Thousands  of  files  per  dataset    –  Large  dataset  in  a  Big  Data,  

NoSQL  storage  (MongoDB,  Cassandra,  Lucene)  

Page 7: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Large-­‐scale  data  sharing  needs  to  con=nue  suppor=ng  discovery,  referencing,  access  and  reuse.      

Page 8: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Adhering  to  the  same  high  standards  for  large-­‐scale  data    

•  Metadata  for  discovery:  –  cita=on  metadata  –  domain-­‐specific  descrip=ve  metadata  –  file-­‐level  or  variable  metadata  

•  Data  cita=on  for  reference  and  access:  –  for  en=re  dataset  and  for  subsets  of  the  dataset  (based  on  =me  of  retrieval  or  variables  selected)  

•  Fast  queries,  data  explora=on  and  visualiza=ons  for  reuse:  –   might  not  be  able  to  download  en=re  dataset  

Page 9: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Data  retrieval,  explora=ons  and  visualiza=ons  of  large-­‐scale  datasets  require  data  repositories  be  closer  to  compu=ng  resources.  

Page 10: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Current  collabora=ons  to  address  the  next  challenges  in  data  sharing  

SB  Grid  Data  Repository  (HMS,  IQSS)   Social  Science  Big  Data  (IQSS)  

Data  Provenance  (SEAS,  IQSS)  

Privacy  Tools  to  share  sensi=ve  data  (SEAS,  Berkman,  Privacy  Lab,  IQSS,  MIT)                    

Page 11: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Sharing  and  Preserving  Large  Structural  Biology  Data  

Funded  by    hhps://data.sbgrid.org/  

Page 12: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Structural  Biology  Primary  Data  

1  Dataset  is  180-­‐360  images  of  X-­‐ray  diffrac=on  data,  3.5-­‐7  GB;  ~  1TB  per  dataset,  with  a  total  up  to  100  PBs  

Integra=on  with  Dataverse:      ●  Long-­‐term  access  ●  Formal  Data  Cita=on  ●  Standard  Metadata  ●  Data  Explora=on  (OME)  ●  Preserva=on,  with  copies  

in  mul=ple  sites  (following  dataPASS  approach)  

Page 13: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Dataverse  on  the  Massachusehs  Open  Cloud  (MOC):  Compu=ng  closer  to  data  storage  

Current  Architecture   On  the  MOC  

Network  File  System  (data  files)  

UI  Layer    (PrimeFaces,  js)  

Applica=on  Logic    (Java  EE)    

A  P  I  

PostgreSQL  (user  data,  metadata)  

Solr  (Index)  

RServe  (R  ingest,  analysis)  

COMPUTE  SERVICES  (R,  Python,  Spark,  

Hadoop,  …)   CINDER    block  storage  

SWIFT  object  storage  

UI  Layer    (PrimeFaces,  js)  

Applica=on  Logic    (Java  EE)    

A  P  I  

PostgreSQL  (user  data,  metadata)  

Solr  (Index)  

Dataverse  

Page 14: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Sharing  Sensi=ve  Data  with  Confidence:  DataTags  System  

DataTag:  A  set  of  security  features  and  access  requirements  for  file  handling  Sweeney,  Crosas,  Bar-­‐Sinai,  2015,  “Sharing  Sensi=ve  Data  with  Confidence:  The  DataTags  System”  Technology  Science  

Page 15: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

Data  Sharing  Workflow    for  Sensi=ve  Data    

Sensi=ve  Dataset  

Sensi=ve  Dataset  

Direct  Access  

Privacy  Preserving  Access  

hhp://datatags.org  hhp://privacytools.seas.harvard.edu  

Authorized  Signed  DUA  

Page 16: Mercè&Crosas,Ph.D.& Chief&DataScience&and ... - datatags.orgdatatags.org/files/datatags/files/niso-largedatasets-crosas.pdf · Dataverse&on&the&Massachusehs&Open&Cloud& (MOC):&Compu=ng&closer&to&datastorage&

THANKS    

Piotrek  Sliz  (SBGrid,  HMS),  Latanya  Sweeney  (Data  Privacy  Lab,  Harvard),  Dataverse  team  (IQSS,  Harvard)  @mercecrosas