Preservation Metadata Workshop (2) The Hague, the Netherlands 19 June 2014 Titia van der Werf adapted from: Rebecca Guenther , “Metadata for preservation of digital objects: background, functions, and standards” – Preservation Metadata Workshop (1), Hilversum, The Netherlands, 4 March 2014 Preservation Metadata: between theory and practice
65
Embed
Preservation Metadata: between theory and practicePreservation Metadata Workshop (2) The Hague, the Netherlands 19 June 2014 Titia van der Werf adapted from: Rebecca Guenther, “Metadata
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Preservation Metadata Workshop (2) The Hague, the Netherlands 19 June 2014 Titia van der Werf adapted from: Rebecca Guenther, “Metadata for preservation of digital objects: background, functions, and standards” – Preservation Metadata Workshop (1), Hilversum, The Netherlands, 4 March 2014
Preservation Metadata: between theory and practice
OUTLINE
1. General introduction to preservation metadata 2. The PREMIS Data Dictionary 3. A use case: the Preservation Health Check
2
Introduction to preservation metadata
3
metadata Function � Discovery � Access � Management � Control intellectual property
rights � Identification � Certify authenticity � Mark content structure � Indicate status � Describe processes � Etc.
Type � Descriptive � Administrative � Technical � Rights/Access � Structural � Meta-metadata � Etc.
4
digital preservation Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution. Digital information poses its own set of challenges to preservation: • The overwhelming volume of digital information created daily and
the uncontrolled duplication of information; • The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the scholarly record and the cultural record;
• The dependency on software/hardware (incl. incompatible, obscure or proprietary systems)
• The rapid technological change and the danger of obsolescence • The ease of (accidental or malicious) content alteration • Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity
5
digital preservation Digital preservation is part and parcel of the “management and
preservation” tasks and responsibilities of a heritage institution. Digital information poses its own set of challenges to preservation: • The overwhelming volume of digital information created daily and
the uncontrolled duplication of information; • The complexity of digital information (content, structure, context,
presentation, behaviour) and the evolving boundaries of the scholarly record and the cultural record;
Ø The dependency on software/hardware (incl. incompatible, obscure or proprietary systems)
Ø The rapid technological change and the danger of obsolescence
• The ease of (accidental or malicious) content alteration • Doubts about the reliability and integrity of electronic records and
the need to vouch for their authenticity
6
preservation metadata in 2000 “We can then say that the main problem metadata
for long term preservation will help to solve is the problem of technological obsolescence.” (p.4)
Six essential properties of successful digital preservation
metadata and preservation metadata
“Structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource”
METADATA
“Metadata that supports and documents the digital preservation process”
PRESERVATION METADATA
supporting and documenting the digital preservation process • Provenance:
– The chain of custody/ownership of the digital object; info about the depositor; etc.
• Authenticity:
– The documentation of changes affecting the authenticity of the digital object during the preservation process
• Preservation Activity:
– The documentation of actions taken to preserve the digital object • Technical Environment:
– The documentation of the dependencies on and changes in the technical environment needed to render and use the digital object
• Rights:
– The documentation of the rights and permissions for carrying out preservation activities on the digital object (duplication, migration, transformations)
OAIS Information Model
Information Package Concepts and Relationships (Figure 2-3)
Preservation Description Information
Preservation Description Information
Reference Information
Provenance Information
Context Information
Fixity Information
Preservation Description Information (Figure 4-16) – June 2012 version
Reference information: identifiers of the Content Provenance information: history of the custody Context information: relation of the Content to other objects Fixity information: a data integrity checksum of the Content Access Rights Information: permissions for preservation operations
Access Rights Information
How to record and manage change
OAIS rule: if the PDI changes, the AIP version changes.
Implementation choices: e.g. fixity information in source AIP + keep log of data integrity checks and their
outcomes separate from the AIP.
16
OAIS compliance relevant to preservation metadata
OAIS Mandatory Responsibilities: 1. Negotiating and accepting information 2. Obtaining sufficient control of the information to
ensure long-term preservation 3. Determining the "designated community" 4. Ensuring that information is independently
understandable 5. Following documented policies and procedures 6. Making the preserved information available
Digital repository certification
– RLG-NARA Task Force on Digital Repository Certification – Various other certification initiatives (CRL, DCC, nestor,
DRAMBORA) – Trusted Repositories Audit & Certification (TRAC): Criteria and
Functions of a trusted digital repository relevant to preservation metadata • Maintains precise descriptions of actions necessary to ensure
that objects are preserved • Has mechanisms for monitoring and notification when formats
are becoming obsolete • Uses tools and resources such as format registries to
establish semantic and technical context • Has processes for storage media and/or hardware changes • Tracks and manages intellectual property rights and
restrictions • Ensures that agreements applicable to access conditions are
adhered to • Maintains descriptive metadata for access and retrieval and
associates it with object
20
PREMIS
21
Standards that address preservation metadata: technical • PREMIS • Images
– NISO Z39.87 and MIX – Adobe and XMP (Extensible Metadata Platform) – Exif (Exchangeable Image File Format) – IPTC (International Press Telecommunications Council)/XMP
• Text: textMD • Sound
– AES57-2011: Audio Object XML Schema – AES60-2011: Core Audio Metadata – AudioMD (Library of Congress)
Standards that address preservation metadata: technical
• Video – VideoMD – SMPTE RP210 – Technical metadata in EBUCore, PBCore – U.S. Federal Agencies Digitization Guidelines – MPEG-7 and MPEG-21 for video
Standards that address preservation metadata: Structural § METS § PREMIS § MPEG 21 Digital Item Declaration § OAI/ORE § Specific format types
– MXF – AVI
Standards that address preservation metadata: Rights • PREMIS • METS Rights • CDL Copyright schema • Creative commons • PLUS for images • MPEG-21 REL for moving images • ONIX for licensing terms • Full rights expression languages
– XRML/MPEG-21 – ODRL
PREMIS Data Dictionary • May 2005: Data Dictionary for Preservation
Metadata: Final Report of the PREMIS Working Group • March 2008: PREMIS Data Dictionary for Preservation
Metadata, version 2.0
• Jan. 2011: version 2.1
• April 2012: version 2.2
• Announced in September 2013: version 3.0
• Data Dictionary: – Comprehensive view of information needed to support digital preservation
• Guidelines/recommendations to support creation, use, management – Based on deep pool of institutional experiences in setting up and managing operational
• Preservation metadata: maintain viability, renderability, understandability, authenticity, identity in a preservation context
• Core: What most preservation repositories need to know to preserve digital materials over the long-term
• Implementable: rigorously defined; supported by usage guidelines/recommendations; emphasis on automated workflows and metadata generation
• Technical neutrality: no assumptions about technologies, systems and architectures, where metadata is stored
Scope
• What PREMIS DD is: – Common data model for organizing/thinking about preservation metadata – Guidance for local implementations – Standard for exchanging information packages between repositories – Compatible with the OAIS reference and information model
• What PREMIS DD is not: – Out-of-the-box solution: need to instantiate as metadata elements in repository
system – All needed metadata: excludes business rules, format-specific technical
metadata, descriptive metadata for access, non-core preservation metadata – Lifecycle management of objects outside repository – Rights management: limited to permissions regarding actions taken within
repository
PREMIS Data Model
Intellectual Entities
Objects
Rights Statements
Agents
Events
Intellectual Entities
Examples: • The Chamber by John Grisham (an
ebook) • “Maggie at the beach”
(a photograph) • The Metropolitan New York Library
Council Website (a website)
• Set of content that is considered a single intellectual unit for purposes of management and description (e.g., a book, a photograph, a map, a database)
• Has one or more digital representations
• May include other Intellectual Entities (e.g. a website that includes a web page)
• Not fully described in PREMIS DD, but can be linked to in metadata describing digital representation THIS WILL CHANGE IN 3.0
Objects
Examples: § a PDF file § A book composed of several
XML files and many images § TIFF file containing a header
and 2 images
Objects are what repository actually preserves FILE: named and ordered sequence of bytes that is known by an operating system REPRESENTATION: set of files, including structural metadata, that, taken together, constitute a complete rendering of an Intellectual Entity BITSTREAM: data within a file with properties relevant for preservation purposes (but needs additional structure or reformatting to be stand-alone file) FILESTREAMS (files within files) are considered files since can be rendered alone
Examples: § Rebecca Guenther (a person) § New York Public Library (an
organization) § JHOVE version 1.0 (a software
program)
• Person, organization, or software program/system associated with an Event or a Right (permission statement)
• Agents are associated only indirectly to Objects through Events or Rights
• Not defined in detail in PREMIS DD; not considered core preservation metadata beyond identification
Semantic units pertaining to Agents
• Agent Identifier • Agent Name • Agent Type • Agent Note • Agent Extension • Linking Event Identifier • Linking Rights Identifier
Rights Statements
Example: § Priscilla Caplan grants FCLA
digital repository permission to make three copies of metadata_fundamentals.pdf for preservation purposes.
• An agreement with a rights holder that grants permission for the repository to undertake an action(s) associated with an Object(s) in the repository.
• Not a full rights expression language; focuses exclusively on permissions that take the form: – Agent X grants Permission Y
to the repository in regard to Object Z.
Semantic units pertaining to Rights
• Rights Statement • Rights Statement Identifier • Rights Basis • Copyright Information • License Information • Statute Information • Other Rights Information
• Conformance statement issued in 2010 • PREMIS Conformance Working Group active
now • Levels of conformance:
– Level 1 A repository uses an internal metadata schema whose elements can be mapped to PREMIS. The mapped metadata can satisfy the principles of use at both the semantic unit and Data Dictionary levels. The repository is able to produce documentation demonstrating such mapping for representative samples of its holdings.
– Level 2 A repository implements the PREMIS Data Dictionary as its internal metadata schema in a way that satisfies the principles of use at both the semantic unit and Data Dictionary levels and in a form that does not require further mapping or conversion.
• PREMIS Implementers Group list http://listserv.loc.gov/listarch/pig.html
A use case: the preservation health check
46
- Open Planets Foundation (OPF) A community hub for digital preservation whose main goal is
to jointly manage and improve tools and research outcomes for practical use.
- OCLC Research A community resource for shared R&D that addresses
challenges facing libraries and archives in a rapidly changing information technology environment.
- Bibliothèque nationale de France The BnF runs a fully operational trusted digital repository
(SPAR). They volunteered to become a PHC-pilot site.
What is the Preservation Health Check Pilot?
As part of their preservation management task, repository managers need to be able to monitor the preservation status of the content of their repository.
We are looking at regular “routine check-ups” that can support this monitoring task. – Monitoring should be made easy (automatically
generated reports or dashboard) – Monitoring should be based on objective data,
generated by the repository (e.g. preservation metadata)
The Preservation Health Check proposition
The analogy
If a Preservation Health Check is a monitoring activity to be performed on a repository with digital content
1. What are empirical indicators (i.e. measures) for PHCs? 2. Are preservation metadata recorded by repositories
useful as health indicators for PHCs? Monitoring is about tracking change ... intentional and
unintentional change.
The research question
Goal: To develop an implementable logic (or protocol) to
support PHCs, and to test this logic against the store of preservation metadata maintained by an operational preservation repository.
The BnF runs a fully operational trusted digital repository (SPAR). They volunteered to become a PHC-pilot site.
The empirical data consists of: 1. A sample (200 GB) of the PREMIS data (AIP-METS
files), covering the following collections: – Gallica = digitised periodicals, monographs, still images and
manuscripts (TIFF + OCR-files) – Legal deposit Web harvests (warc files) – 3rd party collection (Centre Pompidou)
The pilot site
The empirical data consists of (continued): 2. All the Reference Information packages in SPAR that
contain reference information/code/specifications of (external) tools used during INGEST (ex. JHOVE) and of formats ingested;
3. Per collection: SLAs defining policy agreements with SIP suppliers concerning the preservation regime to be applied at the INGEST and ARCHIVAL STORAGE stages.
The pilot site
Mapping PREMIS on to SPOT
PREMIS Data
Model
Int. Ent.
SPOT Model
Availability
Identity
Persistence
Renderability
Understandability
Authenticity
Objects
Agents
Rights
Events
Semantic Units
Threats
preservation metadata in 2005 “Preservation metadata (…) metadata supporting
the functions of maintaining viability, renderability, understandability, authenticity, and identity in a preservation context.” (p. ix)
55
http://www.loc.gov/standards/premis/
Findings: coverage
SPOT property # of PREMIS semantic units*
• Availability 16 • Identity 19 • Persistence 10 • Renderability 15 • Understandability 14 • Authenticity 16 *Container level only; Agents, Events, Rights considered one semantic unit
Findings: coverage
• What does coverage in terms of “number of PREMIS semantic units” mean?
• More meaningful: Do the PREMIS semantic units address the threats associated with a SPOT property?
Example of a gap between SPOT and PREMIS: SPOT property: Understandability We found no PREMIS semantic units that provide
information that aids in the understanding or interpretation of the content of the archived digital object.
A repository usually implements a large number of explicit and implicit policy decisions; however, PREMIS currently makes few provisions for recording these in preservation metadata (the semantic unit preservationLevel being a notable exception).
Findings: preservation policies
PREMIS conformance does not require explicit encoding of metadata if the information applies to all objects in the repository.
This impedes the provision of automated PHC services (by a third-party provider) because efficient provision of this service would likely require the information in semantic units to be explicitly recorded, and implemented in a standard way.
Findings: explicit encoding
Logic for assessing Persistence
SPOT Model
Availability
Persistence
Identity
Renderability
Understandability
Authenticity
Threats
Six essential properties of successful digital preservation
62
• If storage medium information is not available in PREMIS metadata, the PHC will need to take other information sources into account – such as audit reports generated by storage management systems.
• We note that there are no pre-defined events for Corruption and Readability in PREMIS, which means that the repositories need to define their own events. PREMIS does provide a list of recommended event labels for the semantic unit eventType, but it is just a “suggested starter list”.
• The repository should have policies in place that prescribe frequencies of fixity checks, of medium refreshment, backup policy, etc. The PREMIS semantic unit preservationLevel does not address such policies. The PHC flow thus needs to get the policy information from other sources.
Logic for assessing Persistence
A use case: the preservation health check (to be continued)