Persistent Archives: Long-term sustainability of data based on policy and data virtualization Arcot (Raja) Rajasekar University of North Carolina at Chapel Hill [email protected]http://irods.diceresearch.org NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” (2008-2012) NSF SDCI 0721400 “Data Grids for Community Driven Applications” (2007-2010)
26
Embed
Slide 1 - Tetherless World Constellation - Tetherless World Wiki
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Persistent Archives: Long-term sustainability of data based on
policy and data virtualization
Arcot (Raja) RajasekarUniversity of North Carolina at Chapel Hill
NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” (2008-2012) NSF SDCI 0721400 “Data Grids for Community Driven Applications” (2007-2010)
Topics• Data Grids for Preservation & Sharing
– Brief Intro– Why are they suitable for deploying scalable persistent archives?– iRODS as an exemplar Data Grid
• Two Examples:– DIGARCH: Preservation of Multi-media Collection– TPAP: NARA Testbed of Persistent Archives
Data Preservation Challenges• Data driven research generates massive data collections
– Data sources are remote and distributed– Collaborators are remote– Wide variety of data types: observational data, experimental data, simulation
data, real-time data, office products, web pages, multi-media• Collections contain millions of files
– Logical arrangement is needed for distributed data – Discovery requires the addition of descriptive metadata
• Long-term retention requires migration of output into a reference collection– Automation of administrative functions is essential to minimize long-term labor
support costs– Creation of representation information for describing file context– Validation of assessment criteria (authenticity, integrity)
What is a Data Grid?
• Geographically distributed heterogeneous resources that are managed autonomously
• Active with data resources being added and removed• Users like to share/discover data using contextual
information
4
What is a Data Grid?• Data Grid – a network of data resources that is presented as a
single, accessible collection of data.• Data Grid – provisions for associating metadata & annotations• Data Grid – enables discovery, access & server-side processing• Metadata-based data virtualization• Policy Virtualization
5
MetadataMetadata
Why Data Grids?• Data Virtualization: Shared Collections Concept
– Common Abstract Name Spaces: physical-independence• Data objects and collections : logical names• Users/collaborators : global user name space• Shared resources & uniform access : location & protocol transparency• Common typing conventions for objects & actions
– Provide technology independence• Platform & Vendor-ndependence• High scalability
– Need discovery metadata• Descriptive attributes for each name space• System & Domain-specific information
Why Data Grids?• Policy- Virtualization: Automate Operations
– Carolina Digital Repository at University of North Carolina– Duke Medical Archive
• Regional data grids– RENCI data grid linking 7 engagement centers in North Carolina– HASTAC data grid linking humanities collections across 9 UC campuses
• National data grids– NARA Transcontinental Persistent Archive Prototype – NSF Temporal Dynamics of Learning Center data grid– NSF Ocean Observatories Initiative data grid– NASA Center for Computational Sciences archive– JPL Planetary Data System data grid
• International data grids– Australian Research Collaboration Service - ARCS– French National Library
User Interfaces• C library calls - Application level• Unix shell commands - Scripting languages• Java I/O class library (JARGON) - Web services• SAGA - Grid API• Web browser (Java-python) - Web interface• Windows browser - Windows interface• WebDAV - iPhone interface• Fedora digital library middleware - Digital library middleware• Dspace digital library - Digital library services• Parrot - Unification interface• Kepler workflow - Scientific workflow• Fuse user-level file system - Unix file system
Case 1: NARA TPAP• National Archives Electronic Records Administration
Research Program (funded thru NSF)• Transcontinental Persistent Archive Prototype
– Use federation of data grid technology to build a preservation environment
– Conduct research on preservation concepts• Infrastructure independence• Enforcement of preservation properties• Validation of assessment criteria• Automation of administrative processes• Show technology migration
– Demonstrate preservation on selected NARA digital holdings
National Archives and Records Administration National Archives and Records Administration Transcontinental Persistent Archive PrototypeTranscontinental Persistent Archive Prototype
U Md UCSD
MCAT MCAT
Georgia Tech
MCAT
Federation of Seven
Independent Data Grids
NARA II
MCAT
NARA I
MCAT
Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.
Rocket Center
MCAT
U NC
MCAT
ISO MOIMS-repository assessment criteria• We are developing 150 rules that implement the
assessment criteria• Examples:90 Verify descriptive metadata and source
against SIP template and set SIP compliance flag
91 Verify descriptive metadata against semantic term list
92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)
93 Verify consistency of preservation metadata after hardware change or error
• Case Study 2: DIGARCH
• Preservation of Video Files – By Integrating a Video Production Pipeline– With a Preservation Workflow
Digital Preservation Lifecycle ManagementBuilding a demonstration prototype for the preservation of large-scale multi-media collections
San Diego Supercomputer Center, Univ. of California,
San DiegoArcot Rajasekar
(PI)Richard MarcianoReagan MooreChien-Yi Hou
Francine Berman (co-PI)
UCSD-TV, Univ. of California, San DiegoLynn Burstan (co-
PI)Steve AndersonMellisa McEwenBee Bornheimer
UCTV-BerkeleyHarry Kreisler
UCSD Libraries, Univ. of California, San Diego
Brian Schottlaender (co-PI)
Luc DeClerckBrad WestbrookArwen Hutt
Ardys KozbialChris FrymannVivian Chu
Our Proposal• Design and Development of a Prototype for
Preserving Digital Video Collections– Management of Authenticity, Integrity,
and Infrastructure Independence– Preservation Life-cycle meshing
seamlessly with the content production• Minimal impact to production life-cycle
– Workflow system that automates accession, description, organization and preservation of video and associated contents
• Metadata definition, extraction and ingestion
• Long-term retention and technology migration
– At risk Collection: ‘Conversation with History’ video collection• Video, audio, text transcripts, web-based material
• Databases of administrative and descriptive metadata
• Derived products
Exemplar Collection• Conversation with History - UCTV - from 1982
– Hour-long interviews with internationally prominent individuals
– Institute of International Affairs, UC Berkeley– Available in 15 million homes nationwide via UCTV– 40 program segments annually– Web-site for downloading older segments– Among UCTVs most accessed on-line programs– Programs used in educational material
Preservation Processes• Generation of a Globally Unique Identifier (GUID) for each interview
session • Retrieval of the original video session• Retrieval of each of the segments associated with a video session• Retrieval of the transmission scripts for each video segment• Retrieval of the material published on the Web page for each segment• Processing of each Web page to redirect internal URLs into handles
within the preservation logical name space for digital entities• Retrieval of the rights statement for each session • Retrieval of the header associated with each video segment• Retrieval of the trailer associated with each video segment • Retrieval of the administrative, structural, and descriptive
metadata stored in the Filemaker Pro database • Retrieval of the annotations stored with the Web pages• Specification of Preservation Metadata for AIP • Creation of AIPs for the above material• Creation of containers for physically aggregating material for
storage in the preservation environment• Storage of containers within the preservation environment• Specification of preservation management metadata such as access
controls, storage location, and replication
Utility of Data Grids Utility of Data Grids • Logical Universal Identifier• Uniform Access to Distributed Data• Data Discovery through Metadata• Open Policy for Provenance & Data Management• Rule-based workflow for Metadata Extraction & Analysis• Audit trails to capture provenance• Provides a Platform for Data Publication • Provides a means to Uniquely identify datasets• Can be used to enforce metadata requirement - policy• Cross-referencing provenance through workflow-derived data• Provides a means to perform data attribution