Leslie Johnston: Big Data at Libraries, Georgetown University Law School Symposium on Big Data, January 2013

Post on 24-Jun-2015

647 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Big Data: New Challenges for Digital Preservation and Digital Services

Leslie JohnstonActing Director, National Digital Information

Infrastructure & Preservation ProgramLibrary of Congress

What are the Biggest Insights that we have Learned in Fifteen Years of

Building Digital Collections?

We can never guess every way that our collections will be used.

Researchers do not use digital collections the same way that they use

analog collections.

Stewardship organizations have, until recently, spoken of “collections” and “content” and “records” and even “files.”

Now it’s also data.

Data is not just generated by satellites, identified during experiments, or collected during surveys.

Datasets are not just scientific and business tables and spreadsheets.

We have Big Data in our Libraries, Archives and Museums.

What are examples of some of the challenges of collecting and preserving large scale collections in many formats, and making them usable as collections and as data?

More and more researchers want to use collections as a whole, mining and organizing the information in novel ways.

Researchers use algorithms to mine the rich information and tools to create pictures that translate that information into knowledge.

Researchers may want to interact with a collection of artifacts, or they may want to work with a data corpus.

We still have collections. But what we also have is Big Data, which requires us to rethink the infrastructure that is needed to support Big Data services. Our community used to expect researchers to come to us, ask us questions about our collections, and use our digital collections in our environment.

Now our collections are, more often than not, self-serve.

What are some use cases?

National Digital Newspaper Program

chroniclingamerica.loc.gov/Some researchers want to search for stories in historic newspapers.

Some researchers want to mine newspaper OCR for trends across time periods and geographic areas.

Requests have come in to analyze all 5 million pages.

The site gets approximately 4 million hits per day.

The program has:

Multiple producers (25 now, ultimately 54)

Free and open public access

APIs for machine access and automated processes

Over 5.25 million newspaper pages ingested to dateOver 250 Tb of data

Packard Campus National Audio-Visual Center

Preserving Film, Broadcast Television, and Audio

The Packard Campus is a variety of preservation workflows, including those for obsolete physical formats such as wire recordings, wax cylinders, and 2“ videotape. The Campus is fully equipped to play back and preserve all antique film, video and sound formats, and to maintain that capability far into the future.

The facility also handles born-digital video and audio received directly from producers and copyright owners.

Over 3 PB of files.

WEB ARCHIVES http://www.loc.gov/webarchiving/

lcweb2.loc.gov/diglib/lcwa/html/lcwa-home.html

The Library has been archiving the web since 2000. Subject area specialists curate the collections, and Library catalogers create collection-level metadata records. Permission requirements vary by site category.

The collections include:• U.S. elections• Web sites created by members of the House and Senate• Thematic collections around events, such as elections in the

Philippines, the Iraq war, and the appointment of Supreme Court Justices.

• Collections around an area of study, such as Legal “Blawgs”

When we began archiving election web sites, we imagined users browsing through the web pages, studying the graphics or use of phrases or links. But when our first researchers came to the Library, they wanted to know about all those topics, but they used scripts to query for them and sort them into categories. They were not very much interested in reading web pages.

Approximately 6 billion filesOver 300 TB

THE TWITTER ARCHIVEEvery public tweet since Twitter’s launch in March

2006.

Research requests have included users looking for their own Twitter history, the study of the geographic spread of news, the study of the spread of epidemics, and the study of the transmission of new uses of language.

The collection comprises only a few TB, but over 10s of billions of tweets.

A White Paper is available online at: http://blogs.loc.gov/loc/2013/01/update-on-the- twitter-archive-at-the-library-of-congress/

status

privacycommercial

personal

events

social media

visualization

social science

eSerialsCopyright Mandatory Deposit represents a large

acquisitions channel for the Library. In general, all U.S. publishers are legally required to submit for deposit two copies of each of their publications to the Copyright Office. This mechanism has allowed the Library to build the collection and to preserve the publications.

eSerials became subject to mandatory deposit in January 2010, with the publication of a new interim regulation. Demands began in June 2010 and files began to arrive in October 2010.

The files must come to the Library “as published” – in whatever their original formats are.

Articles may be accompanied by their associated datasets.

RESEARCH DATASETSThe datasets generated/used in the research

process.

Datasets can be:

Small, such as surveys of a small sample population

Medium, such as a corpus of images

Big Data, such as years of observational astronomical data.

It is not enough to be collecting publications.

We have to collect and preserve research data, in addition to recognizing that the collections we already have are Big Data to be mined.

Are our institutions ready?

We are building large digital collections and must consider new ways in which they should be managed and used.

I will mention infrastructure only in passing.

There are scale issues related to:

Storage

Backup and tape archiving

Bandwidth

Software development

Staffing for processing

Library of Congress Preservation InfrastructureThe Library developed the BagIt transfer specification for the movement of files between and within organizations.

http://www.digitalpreservation.gov/documents/bagitspec.pdf

The Library inventories incoming files, and is gradually inventorying all digital content.

The Library maintains multiple copies of files on servers and on tape, in geographically distributed locations.

The Library has documented sustainability factors for file formats.

http://www.digitalpreservation.gov/formats/

For cases where we do have control over what comes in, we have a “Best Edition” Preferred Formats statement, which is currently being updated.

•http://www.copyright.gov/circs/circ07b.pdf

There are many new activities to be planned for with new researcher uses and expectations.

How much ingest processing should be done with data collections, or collections that can be treated as data?

Should collections be processed to create a variety of derivatives that might be used in various forms of analysis before ingesting them?

Do libraries have sufficient infrastructure to create full- test indexes for millions/billions of files to support full discovery?

Do libraries support analysis? Analytical tools are still in early days for the scale of large datasets.

And what are the service models?

If libraries decide that they will simply provide access to data, do they limit it to the native format or provide pre- processed or on-the-fly format transformation services for downloads?

Can libraries handle the download traffic?

Can staff develop the expertise to provide guidance to researchers in using analytical tools? Or is the expectation that researchers will fend for themselves?

Libraries are increasingly looking towards self-service – researchers need not ask to download or tell us that they have. We may never know.

BUT, libraries do have collections that are limited to on-site only access due to licenses or gift agreements. In that case, libraries may have to consider providing high-powered workstations with analytical tools for researchers to work with these collections and take analysis outputs away with them.

Both have policy implications and implications for public service staffing.

But the benefits outweigh the challenges.

Libraries are managing and preserving the datasets and big data necessary for re-use and replicability.

This is an important new role for libraries in enabling new research.

And libraries need to make the deposit and management of such data easier to accomplish.

Discussion…

Leslie Johnstonlesliej@loc.gov

top related