Profiling Web Archives Sawood Alam and Michael L. Nelson Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar Shankar Los Alamos National Laboratory, Los Alamos, NM David S. H. Rosenthal Stanford University Libraries, Stanford, CA Supported in part by the International Internet Preservation Consortium (IIPC)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Profiling Web Archives
Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University
Norfolk, Virginia - 23529
Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar ShankarLos Alamos National Laboratory, Los Alamos, NM
David S. H. RosenthalStanford University Libraries, Stanford, CA
Supported in part by the International Internet Preservation Consortium (IIPC)
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Long Tail of Archives
Long Tail of Archives
● 400B+ web pages at IA do not cover everything
● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013)
● Targeted crawls● Special focus archives● Restricted resources● Private archives
Archive Profile
● High-level summary of an archive● Predicts presence of mementos of a URI-R
in an archive● Provides various statistics about the
holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other
things
Available Profiling Resources
● Client request● Archive response● Archive index (CDX files)