TPDL 2015 - Profiling Web Archives

Post on 16-Feb-2017

1739 Views

Category:

Internet

0 Downloads

Preview:

Click to see full reader

Transcript

Profiling Web Archives

Sawood Alam and Michael L. NelsonComputer Science Department, Old Dominion University

Norfolk, Virginia - 23529

Herbert Van de Sompel, Lyudmila L. Balakireva, and Harihar ShankarLos Alamos National Laboratory, Los Alamos, NM

David S. H. RosenthalStanford University Libraries, Stanford, CA

Supported in part by the International Internet Preservation Consortium (IIPC)

Memento Aggregator

Memento Aggregator

Memento Aggregator

Memento Aggregator

Memento Aggregator

Memento Aggregator

Long Tail of Archives

Long Tail of Archives

● 400B+ web pages at IA do not cover everything

● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013)

● Targeted crawls● Special focus archives● Restricted resources● Private archives

Archive Profile

● High-level summary of an archive● Predicts presence of mementos of a URI-R

in an archive● Provides various statistics about the

holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other

things

Available Profiling Resources

● Client request● Archive response● Archive index (CDX files)

A Client Request

An Archive Response

A CDX Snippet

Profiling Strategies

● Complete URI-R Profiling (1 URI-R = 1 Profile Key)

○ bbc.co.uk/images/logo.png?w=90○ cnn.com/2014/03/15/?id=128734

● TLD-only Profiling (1 TLD = 1 Profile Key)

○ com)/○ uk)/

● Middle Ground○ uk,co)/○ uk,co,bbc)/images○ uk,co,bbc)/0/2/1○ com,cnn)/ 201309 ar

Frequency Measurements

CDXJ Serialization

URI-Key Generation

Profile Merging

Base profile

New profile

Merged profile

Dataset

● Three archives● Four sample query sets● 23 profiles for each archive and sample set

Archives

Archive URI-Rs URI-Ms Size

Archive-It 1.9B 5.3B 1.8TB

UKWA 0.7B 1.7B 0.5TB

Stanford 12M 25M 8.3GB

Sample Query Sets

Sample In Archive-It In UKWA In Stanford

DMOZ 4.097% 1.912% 0.034%

MementoProxy 4.182% 0.179% 0.046%

IAWayback 3.716% 0.231% 0.039%

UKWayback 0.108% 0.034% 0.002%

Sample Size: 1M URIs Each

Evaluation

● Relate CDX Size, URI-M, URI-R, and URI-Key

● Analyze profile growth● Estimate Relative Cost● Evaluate Routing Precision vs. Relative Cost

CDX Size vs URI-M (UKWA 10 Years)

Alpha: 175 bytes per CDX line

URI-M vs URI-R (UKWA 10 Years)

Gamma: 2.46 K : 2.686Beta: 0.911

Space Cost (UKWA 7 Years)

Phi: 8.5e-07 -- 0.70583

Time Cost (UKWA 7 Years)

Tau: 5.7e-05 -- 6.2e-05CDX: 45GBURI-Ms: 181MURI-Rs: 96MTime: 3 hours

Resource Requirement

Archive-It

UKWA

Stanford

Cost vs Precision

Group Cost Precision

G1 (H1P0/TLD) Bound by # of TLDs < 0.05

G2 (H3P0, DDom, DSub, DPth, DQry) < 0.01 ≈ 2 * G1

G3 (DIni) ≈ 2 * G2 ≈ (3--4) * G1

G4 (HxP1) ≈ 5 * G3 ≈ (5--7) * G1

G5 (Higher HmPn) 0.4 -- 0.7 Not Explored

G6 (URIR) 1.0 1.0

Future Work

● Generating sample URI sets● Profiling via sampling● Language profiles● Evaluation of combination profiles such as

URI-Key along with Datetime● Profiles for usage other than Memento

routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)

Conclusions

● Generated profiles with different policies for two archives

● Examined cost-precision tradeoffs of various policies

● Related CDX Size, URI-M, URI-R, and URI-Key

● Gained up to 22% routing precision with <5% relative cost without any false negatives

● Code @ GitHub:/oduwsdl/archive_profiler

top related