Top Banner
An Introduction to An Introduction to Digital Preservation Digital Preservation at the Library of at the Library of Congress Congress Leslie Johnston Library of Congress
36

An Introduction to digital preservation at the Library of Congress

Dec 17, 2014

Download

Technology

lljohnston

Introduction to digital preservation initiatives at the Library of Congress and the National Digital Information Infrastructure and Preservation Program
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Introduction to digital preservation at the Library of Congress

An Introduction toAn Introduction toDigital PreservationDigital Preservation

at the Library of Congressat the Library of Congress

Leslie JohnstonLibrary of Congress

Page 2: An Introduction to digital preservation at the Library of Congress

22

NDIIPPNDIIPPNational Digital Information InfrastructureNational Digital Information Infrastructure

and Preservation Programand Preservation Program

MISSION: Ensure access over time to a rich body of digital content through establishment of a national network of partners committed to selecting, collecting and preserving at-risk digital information.

http://www.digitalpreservation.gov/

Page 3: An Introduction to digital preservation at the Library of Congress

33

NDIIPPNDIIPP

Learn By Doing

Catalyze Activity

Support Collaboration

Page 4: An Introduction to digital preservation at the Library of Congress

44

NDIIPP Focus AreasNDIIPP Focus Areas

Digital Content

Partnerships: Government, Industry, Academia

Technical Infrastructure

Education

Page 5: An Introduction to digital preservation at the Library of Congress

55

Access Drives PreservationAccess Drives Preservation

Page 6: An Introduction to digital preservation at the Library of Congress

66

There are Important Non-Technical IssuesThere are Important Non-Technical Issues

Legal: intellectual property, copyright, privacy, national Legal: intellectual property, copyright, privacy, national security classificationsecurity classification

Collaboration: new models needed for institutions, Collaboration: new models needed for institutions, communities to work togethercommunities to work together

Institutional culture: staff need new skills, new policies need to Institutional culture: staff need new skills, new policies need to be made, leaders need to integrate analog and digitalbe made, leaders need to integrate analog and digital

Cost: many cost variables; economic sustainability is an issueCost: many cost variables; economic sustainability is an issue

Page 7: An Introduction to digital preservation at the Library of Congress

77

Digital Content can be Copyrighted, Digital Content can be Copyrighted,

Private, ConfidentialPrivate, Confidential

Societal norms and expectations for privacy are Societal norms and expectations for privacy are shiftingshifting Especially on the InternetEspecially on the Internet

Data mining and other techniques allow for new Data mining and other techniques allow for new kinds of access and new policieskinds of access and new policies Email – public and personalEmail – public and personal Personal digital archives in special Personal digital archives in special

collectionscollections

Page 8: An Introduction to digital preservation at the Library of Congress

88

Economic IssuesEconomic Issues

http://ncdd.nl/en/document/EnglishSummary.pdf

http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf

Hard to know the ongoing costs for digital preservation, lots of variables

Institutions often need to support the preservation of analog and digital collections with tight budgets

Demonstrate value of preserved digital content through use and reuse

Page 9: An Introduction to digital preservation at the Library of Congress

99

Organizational IssuesOrganizational Issues

http://ncdd.nl/en/document/EnglishSummary.pdf

Digital preservation is a big challenge

New models are needed for institutions, communities to work together

Preservationists need to be involved much earlier in the lifecycle of a digital object

A variety of new skills and training opportunities are needed.

Page 10: An Introduction to digital preservation at the Library of Congress

1010

Examples of Digital Preservation InitiativesExamples of Digital Preservation Initiatives

Open Planets Foundation Open Planets Foundation European project using a solution adopted by national

heritage organizations and others

National Archives and Records AdministrationNational Archives and Records Administration Developing Electronic Records Archives system to meet

federal records management and archival needs

National Library of New ZealandNational Library of New Zealand Developing National Digital Heritage Archive for digital

collections

International Internet Preservation ConsortiumInternational Internet Preservation Consortium Group of national libraries and other organizations

collaborating in web content preservation and developing common tools

Page 11: An Introduction to digital preservation at the Library of Congress

1111

What are examples What are examples of some of the of some of the collecting and collecting and preservation preservation challenges at the challenges at the Library of Library of Congress?Congress?

Page 12: An Introduction to digital preservation at the Library of Congress

1212

National DigitalNational DigitalNewspaper ProgramNewspaper Program

chroniclingamerica.loc.gov/chroniclingamerica.loc.gov/Some researchers want to search for stories in historic newspapers. 

Some researchers want to mine newspaper OCR for trends across time periods and geographic areas.  Requests have come in to analyze all 6 million pages.

The site gets approximately 5 million views per day. The program has: Multiple producers (25 now, ultimately 54) Free and open public access APIs for machine access and automated processes

Files TIFFs, JPEGs, JPEG 2000s, and XML. Over 6 million newspaper pages ingested to date Over 250 Tb of data

Page 13: An Introduction to digital preservation at the Library of Congress

1313

eDeposit for eSerialseDeposit for eSerialseDeposit for eSerials is a collaborative effort between

the U.S. Copyright Office and the Library of Congress.

Copyright Mandatory Deposit represents the largest acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.

eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.

The files must come to the Library “as published” – in whatever their original formats are. This means a wide variety of XML content and metadata, HTML, and PDFs. We have received 49 different file extensions…so far.

Page 14: An Introduction to digital preservation at the Library of Congress

1414

Packard Campus National Packard Campus National Audio-Visual CenterAudio-Visual Center

Preserving Film, Broadcast Television, and Audio

The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future.

The facility also handles born-digital video and audio received directly from producers.

The formats include MPEG-4, MP3, BWF, AVI, and a wide variety of specialized commercial formats.

Over 3.5 PB of files.

Page 15: An Introduction to digital preservation at the Library of Congress

1515

Web ArchivingWeb Archivinghttp://www.loc.gov/webarchiving/ http://www.loc.gov/webarchiving/

lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.htmllcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records.

Websites are complex objectsmultiple formats

interrelated elements

distributed authors

ownership is not transparent

The concept of publishing on the Web doesn’t match with legal definition

The volume of content is immense

Website publishing technology is constantly changing

When we began archiving election web sites, we imagined users browsing through the web pages. But when our first researchers came to the Library, they wanted to mine the collections

Files Every format possible on the web Approximately 7 billion files Over 400 TB

Page 16: An Introduction to digital preservation at the Library of Congress

1616

The Twitter ArchiveThe Twitter ArchiveEvery public tweet since Twitter’s launch in March

2006.

The Library’s researcher services will not recreate twitter, and cannot be openly accessible.

Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.

The collection comprises only a few TB, but over 10s of billions of tweets.

A White Paper is available at http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/

status

privacycommercial

personal

events

social media

visualization

social science

Page 17: An Introduction to digital preservation at the Library of Congress

1717

Libraries/archives/museums have reasons to engage with individuals about personal digital preservation

May bring in personal digital collectionsRaise institutional visibility Answer patron questions

Guidance for the general public on saving their digital stuff: documents, photos, music, video, email, websites etc.

Public Events

Further How-to’s and tutorials

“Personal Archiving Day”

http://www.digitalpreservation.gov/personalarchiving/http://www.digitalpreservation.gov/personalarchiving/

Personal Digital ArchivingPersonal Digital Archiving

Page 18: An Introduction to digital preservation at the Library of Congress

1818

What are some of the What are some of the technological challenges of technological challenges of managing and preserving managing and preserving large digital collections in large digital collections in many formats, and making many formats, and making them available for re-use?them available for re-use?

Page 19: An Introduction to digital preservation at the Library of Congress

1919

Sheer amount.Sheer amount.

Huge variation in file formats.Huge variation in file formats.

Unclear and undocumented rights.Unclear and undocumented rights.

SecuritySecurity

Missing metadata.Missing metadata.

Data citation and identifier issues.Data citation and identifier issues.

Discovery expectations: discovery across collections and Discovery expectations: discovery across collections and institutions together.institutions together.

Cost.Cost.

Page 20: An Introduction to digital preservation at the Library of Congress

2020

I will mention infrastructure only in passingI will mention infrastructure only in passing

There are scale issues related to:

Bandwidth

Storage

Backup and tape archiving

Software development

Staffing for processing

Page 21: An Introduction to digital preservation at the Library of Congress

2121

Preservation ArchitecturePreservation Architecture

There is no national preservation architecture, system, or storage backend.

Highly variable institution by institution, but commonalities in backend repository systems, ingest models, and discovery models.

Community- and discipline-based repositories, often with an unclear relationship to libraries or archives.

Multiple methods for certifying the trust level for a repository.

Agreed upon protocols and mechanisms for the transfer of files, but no single standard for the interchange of files and metadata between environments.

Synchronization and versioning are not just a technical challenge; it complicates management and preservation and access.

Page 22: An Introduction to digital preservation at the Library of Congress

2222

And at the Library of Congress?And at the Library of Congress?

The Library has an active digital reformatting program across all formats.

The Library is currently modifying its preservation and collection security policies around digital collections.

The Library has repository services that inventory its file assets and maintains multiple copies of files on servers and on tape, in geographically distributed locations.

The Library developed the BagIt transfer specification for the movement of files between and within organizations.

http://www.digitalpreservation.gov/documents/bagitspec.pdf

The Library has documented sustainability factors for file formats. http://www.digitalpreservation.gov/formats/

For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.

http://www.copyright.gov/circs/circ07b.pdf

Page 23: An Introduction to digital preservation at the Library of Congress

2323

What are the Library’s strategies What are the Library’s strategies for formats?for formats?

The Library has documented sustainability factors for file formats.

http://www.digitalpreservation.gov/formats/

For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.

http://www.copyright.gov/circs/circ07b.pdf

The Library is ready to start developing Digital Format Preservation Action Plans.

Page 24: An Introduction to digital preservation at the Library of Congress

2424

What are the Digital Preservation What are the Digital Preservation Services?Services?

We must develop sufficient infrastructure for distributed, replicated preservation We must develop sufficient infrastructure for distributed, replicated preservation storage. storage.

We will spend an increasing amount of time auditing our files and storage to We will spend an increasing amount of time auditing our files and storage to ensure that no issues have arisen.ensure that no issues have arisen.

We may need to process all files to create a variety of derivatives that are more We may need to process all files to create a variety of derivatives that are more sustainable, and that might be required for various forms of use and analysis sustainable, and that might be required for various forms of use and analysis before ingesting them and providing access. before ingesting them and providing access.

We must develop sufficient infrastructure to support large scale discovery. We must develop sufficient infrastructure to support large scale discovery.

We are comfortable with self-service through the institutional repository model, We are comfortable with self-service through the institutional repository model, but can libraries ingest, manage and provide access to an increasing number but can libraries ingest, manage and provide access to an increasing number of digital collections without any mediation?of digital collections without any mediation?

We are providing quite a bit of guidance to researchers on digital preservation We are providing quite a bit of guidance to researchers on digital preservation standards and personal digital preservation.standards and personal digital preservation.

Page 25: An Introduction to digital preservation at the Library of Congress

2525

And where are the And where are the digital preservation digital preservation innovations?innovations?

Page 26: An Introduction to digital preservation at the Library of Congress

2626

The Cloud is a The Cloud is a supplement – NOT a supplement – NOT a replacement – for local replacement – for local preservation storage preservation storage resources.resources.

Page 27: An Introduction to digital preservation at the Library of Congress

2727

In content characterization In content characterization tools, such as JHOVE and tools, such as JHOVE and DROID and FITS, so we can DROID and FITS, so we can understand the risks inherent in understand the risks inherent in the files in our collections. the files in our collections.

Page 28: An Introduction to digital preservation at the Library of Congress

2828

In the adaptation and use of In the adaptation and use of forensics tools for the creation forensics tools for the creation of complete and authentic of complete and authentic copies of unique digital media.copies of unique digital media.

Page 29: An Introduction to digital preservation at the Library of Congress

2929

In virtualization and emulation In virtualization and emulation technologies used to recreate technologies used to recreate environments needs for digital environments needs for digital preservation and for access.preservation and for access.

Page 30: An Introduction to digital preservation at the Library of Congress

3030

Preservation Preservation Partnerships are a Partnerships are a Necessary InnovationNecessary Innovation

The Library cannot collect everything on its own, so works as part of:

The National Digital Stewardship Alliance http://www.digitalpreservation.gov/ndsa/

The International Internet Preservation Consortium http://netpreserve.org/about/index.php

among others…

Page 31: An Introduction to digital preservation at the Library of Congress

3131

What is Success for any Digital What is Success for any Digital Preservation Initiative?Preservation Initiative?

Success must be measured in Success must be measured in concrete goals and deliverables that concrete goals and deliverables that are widely and openly distributed.are widely and openly distributed.

Success is also measured in Success is also measured in enthusiasm, participation, and in enthusiasm, participation, and in adoption by the community.adoption by the community.

Page 32: An Introduction to digital preservation at the Library of Congress

3232

SummarySummary

Digital information presents tough issues in terms of Digital information presents tough issues in terms of preservation and accesspreservation and access

Libraries and archives must address these issues even Libraries and archives must address these issues even though there are no ideal solutions and some open though there are no ideal solutions and some open questionsquestions

Progress is evident though the application of shared Progress is evident though the application of shared conceptsconcepts

Initiatives are underway around the world testing different Initiatives are underway around the world testing different approaches to preservationapproaches to preservation

There are a number of significant non-technical issuesThere are a number of significant non-technical issues

Digital preservation is also relevant on the personal levelDigital preservation is also relevant on the personal level

Page 33: An Introduction to digital preservation at the Library of Congress

3333http://www.digitalpreservation.gov/formats/index.shtml

The Library of Congress “Sustainability of Digital Formats” site, which analyzes the preservation merits of a variety of digital file formats.

NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach

Page 34: An Introduction to digital preservation at the Library of Congress

3434

NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach

http://www.digitalpreservation.gov

The Library of Congress Digital Preservation web site

Page 35: An Introduction to digital preservation at the Library of Congress

3535http://blogs.loc.gov/digitalpreservation

The NDIIPP blog “The Signal”: Where we post, and discuss, the many issues, news items and project updates about digital preservation and library technology, both inside and outside of the Library of Congress.

NDIIPP Digital Preservation OutreachNDIIPP Digital Preservation Outreach

Page 36: An Introduction to digital preservation at the Library of Congress

3636

Leslie JohnstonLeslie JohnstonLibrary of CongressLibrary of Congress

[email protected]@loc.gov