Slide 1 - Tetherless World Constellation - Tetherless World Wiki

Persistent Archives: Long-term sustainability of data based on

policy and data virtualization

Arcot (Raja) RajasekarUniversity of North Carolina at Chapel Hill

[email protected]://irods.diceresearch.org

NSF OCI-0848296 “NARA Transcontinental Persistent Archives Prototype” (2008-2012) NSF SDCI 0721400 “Data Grids for Community Driven Applications” (2007-2010)

Topics• Data Grids for Preservation & Sharing

– Brief Intro– Why are they suitable for deploying scalable persistent archives?– iRODS as an exemplar Data Grid

• Two Examples:– DIGARCH: Preservation of Multi-media Collection– TPAP: NARA Testbed of Persistent Archives

Data Preservation Challenges• Data driven research generates massive data collections

– Data sources are remote and distributed– Collaborators are remote– Wide variety of data types: observational data, experimental data, simulation

data, real-time data, office products, web pages, multi-media• Collections contain millions of files

– Logical arrangement is needed for distributed data – Discovery requires the addition of descriptive metadata

• Long-term retention requires migration of output into a reference collection– Automation of administrative functions is essential to minimize long-term labor

support costs– Creation of representation information for describing file context– Validation of assessment criteria (authenticity, integrity)

What is a Data Grid?

• Geographically distributed heterogeneous resources that are managed autonomously

• Active with data resources being added and removed• Users like to share/discover data using contextual

information

4

What is a Data Grid?• Data Grid – a network of data resources that is presented as a

single, accessible collection of data.• Data Grid – provisions for associating metadata & annotations• Data Grid – enables discovery, access & server-side processing• Metadata-based data virtualization• Policy Virtualization

5

MetadataMetadata

Why Data Grids?• Data Virtualization: Shared Collections Concept

– Common Abstract Name Spaces: physical-independence• Data objects and collections : logical names• Users/collaborators : global user name space• Shared resources & uniform access : location & protocol transparency• Common typing conventions for objects & actions

– Provide technology independence• Platform & Vendor-ndependence• High scalability

– Need discovery metadata• Descriptive attributes for each name space• System & Domain-specific information

Why Data Grids?• Policy- Virtualization: Automate Operations

– System-centric Policies & Obligations: • Manage retention, disposition, distribution, replication, integrity,

authenticity, chain of custody, access controls, representation information, descriptive information requirement, logical arrangement, audit trails, authorization, authentication

– Domain-specific Policies:• Identification & Extraction of Metadata• Ingestion Control for Provenance Attribution• Processing of Data on Ingestion

– Creation of multi-resolution images, type-identification, anonymization,…

• Processing of Data on Access– IRB Approval for data access, Data sub-setting, Merging of multiple images,

conversion, redaction, …

Preservation is an Integral Part of the Data Life Cycle

• Organize project data into a shared collection• Publish data in a digital library for use by other

researchers• Enable data-discovery & data-driven analyses• Preserve reference collection for use by future

research initiatives• Associate new collection against prior state-of-the-art

data• Define & Enforce Policies for long-term management

and curation

Exemplar Data Grid: iRODS• Integrated Rule- Oriented Data System

• It is a data grid system – data virtualization

– A distributed file system, based on a client-server architecture.

– Allows users to access files seamlessly across a distributed environment, based upon their attributes & GUID rather than locations

– It replicates, syncs and archives data, connecting heterogeneous resources in a logical and abstracted manner.

• It is a server-side workflow system – policy virtualization

– Actions are coded as functions/scripts (micro-services)

– Micro-services can be chained into Policies (rules)

– Rules are interpreted by a distributed rule engine

– The chains can be triggered on an event and condition (rules)

– Micro-services communicate through parameters, shared contexts, and out-of-band message queues.

Open Policy and Uniform Access

Policy/Rule Examples• Automatically extract metadata for a file with certain types and store in

domain-centric metadata catalog

• Notify owner if a file metadata is missing N days after ingestion

• Automatically “audit” derived datasets – provenance gathering

• Periodically check for integrity of files in a collection and repair them if needed/possible

• Allow users only using certificate-based log in to access files from a collection – multi-lock control

• Automatically migrate a file to “slow” storage location after N days of non-use – storage management

• Automatically replicate a file that falls into a collection into 3 geo-distributed sites – replication strategies

• When too many users from site A are using a file from site B, keep a copy in site A – data placement strategies

• Send a notification when file with certain type of data is ingested.

Overview of iRODS Architecture Overview of iRODS Architecture

UserCan Search, Access, Add and

Manage Data& Metadata

*Access data with Web-based Browser or iRODS GUI or Command Line clients.

Overview of iRODS Data System

iRODS Data Server

Disk, Tape, etc.

iRODS Metadata

CatalogTrack data

iRODS Data System

iRODS Rule Engine

Track policies

integrated Rule-Oriented Data SystemClient Interface Admin Interface

Current State

Rule Invoker

MicroService

Modules

Metadata-based Services

Resources

MicroService

Modules

Resource-based Services

ServiceManager

ConsistencyCheck

Module

RuleModifierModule

ConsistencyCheck

Module

Engine

Rule

Confs

ConfigModifierModule

MetadataModifierModule

MetadataPersistent

Repository

ConsistencyCheck

Module

RuleBase

iRODS Components

RuleRuleEngineEngine

ExecutionExecutionControlControl

MessagingMessagingSystemSystem

ExecutionExecutionEngineEngine

VirtualizationVirtualization

ServerServerSideSideWorkflowWorkflow

PersistentPersistentStateStateinformationinformation

SchedulingScheduling

PolicyPolicyManagementManagement

DataDataTransportTransport

MetadataMetadataCatalogCatalog

iRODS Applications• Institutional repositories

– Carolina Digital Repository at University of North Carolina– Duke Medical Archive

• Regional data grids– RENCI data grid linking 7 engagement centers in North Carolina– HASTAC data grid linking humanities collections across 9 UC campuses

• National data grids– NARA Transcontinental Persistent Archive Prototype – NSF Temporal Dynamics of Learning Center data grid– NSF Ocean Observatories Initiative data grid– NASA Center for Computational Sciences archive– JPL Planetary Data System data grid

• International data grids– Australian Research Collaboration Service - ARCS– French National Library

User Interfaces• C library calls - Application level• Unix shell commands - Scripting languages• Java I/O class library (JARGON) - Web services• SAGA - Grid API• Web browser (Java-python) - Web interface• Windows browser - Windows interface• WebDAV - iPhone interface• Fedora digital library middleware - Digital library middleware• Dspace digital library - Digital library services• Parrot - Unification interface• Kepler workflow - Scientific workflow• Fuse user-level file system - Unix file system

Case 1: NARA TPAP• National Archives Electronic Records Administration

Research Program (funded thru NSF)• Transcontinental Persistent Archive Prototype

– Use federation of data grid technology to build a preservation environment

– Conduct research on preservation concepts• Infrastructure independence• Enforcement of preservation properties• Validation of assessment criteria• Automation of administrative processes• Show technology migration

– Demonstrate preservation on selected NARA digital holdings

National Archives and Records Administration National Archives and Records Administration Transcontinental Persistent Archive PrototypeTranscontinental Persistent Archive Prototype

U Md UCSD

MCAT MCAT

Georgia Tech

MCAT

Federation of Seven

Independent Data Grids

NARA II

MCAT

NARA I

MCAT

Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.

Rocket Center

MCAT

U NC

MCAT

ISO MOIMS-repository assessment criteria• We are developing 150 rules that implement the

assessment criteria• Examples:90 Verify descriptive metadata and source

against SIP template and set SIP compliance flag

91 Verify descriptive metadata against semantic term list

92 Verify status of metadata catalog backup (create a snapshot of metadata catalog)

93 Verify consistency of preservation metadata after hardware change or error

• Case Study 2: DIGARCH

• Preservation of Video Files – By Integrating a Video Production Pipeline– With a Preservation Workflow

Digital Preservation Lifecycle ManagementBuilding a demonstration prototype for the preservation of large-scale multi-media collections

San Diego Supercomputer Center, Univ. of California,

San DiegoArcot Rajasekar

(PI)Richard MarcianoReagan MooreChien-Yi Hou

Francine Berman (co-PI)

UCSD-TV, Univ. of California, San DiegoLynn Burstan (co-

PI)Steve AndersonMellisa McEwenBee Bornheimer

UCTV-BerkeleyHarry Kreisler

UCSD Libraries, Univ. of California, San Diego

Brian Schottlaender (co-PI)

Luc DeClerckBrad WestbrookArwen Hutt

Ardys KozbialChris FrymannVivian Chu

Our Proposal• Design and Development of a Prototype for

Preserving Digital Video Collections– Management of Authenticity, Integrity,

and Infrastructure Independence– Preservation Life-cycle meshing

seamlessly with the content production• Minimal impact to production life-cycle

– Workflow system that automates accession, description, organization and preservation of video and associated contents

• Metadata definition, extraction and ingestion

• Long-term retention and technology migration

– At risk Collection: ‘Conversation with History’ video collection• Video, audio, text transcripts, web-based material

• Databases of administrative and descriptive metadata

• Derived products

Exemplar Collection• Conversation with History - UCTV - from 1982

– Hour-long interviews with internationally prominent individuals

– Institute of International Affairs, UC Berkeley– Available in 15 million homes nationwide via UCTV– 40 program segments annually– Web-site for downloading older segments– Among UCTVs most accessed on-line programs– Programs used in educational material

Pre-Interview Interview Transcription Post-Interview

Metadata Analysis

Schema Generation

SIP/AIP Definitions

Capture Scripts

Metadata DB Capture

Interview Metadata Capture

Make SIPS

Aggregate AIP/Verify

Store/Replicate/Preserve

TV production Lifecycle

Metadata Definition & Capture Workflow

Persistent Archival Workflow

Broadcast/Transfer

Metadata Validation

Preservation Processes• Generation of a Globally Unique Identifier (GUID) for each interview

session • Retrieval of the original video session• Retrieval of each of the segments associated with a video session• Retrieval of the transmission scripts for each video segment• Retrieval of the material published on the Web page for each segment• Processing of each Web page to redirect internal URLs into handles

within the preservation logical name space for digital entities• Retrieval of the rights statement for each session • Retrieval of the header associated with each video segment• Retrieval of the trailer associated with each video segment • Retrieval of the administrative, structural, and descriptive

metadata stored in the Filemaker Pro database • Retrieval of the annotations stored with the Web pages• Specification of Preservation Metadata for AIP • Creation of AIPs for the above material• Creation of containers for physically aggregating material for

storage in the preservation environment• Storage of containers within the preservation environment• Specification of preservation management metadata such as access

controls, storage location, and replication

Utility of Data Grids Utility of Data Grids • Logical Universal Identifier• Uniform Access to Distributed Data• Data Discovery through Metadata• Open Policy for Provenance & Data Management• Rule-based workflow for Metadata Extraction & Analysis• Audit trails to capture provenance• Provides a Platform for Data Publication • Provides a means to Uniquely identify datasets• Can be used to enforce metadata requirement - policy• Cross-referencing provenance through workflow-derived data• Provides a means to perform data attribution

Slide 1 - Tetherless World Constellation - Tetherless World Wiki

Documents

archives data

realtime data

observational data

simulation data

experimental data

sustainability of data

distributed data discovery

accessible collection