Using the Web Infrastructure for Real Time Recovery of Missing Web Pages Dissertation Defense Martin Klein [email protected]Old Dominion University Norfolk, VA 07/18/2011 mittee: Michael L. Nelson (Advisor) Yaohang Li Michele C. Weigle Mohammad Zubair Robert Sanderson Herbert Van de Sompel
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages
Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed
Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last
30
Evolution of LSs over Time
Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return
Rooted
31
Evolution of LSs over Time
Results:• Overlap increases over time• Seem to reach steady state around 2003
Sliding
32
Performance of LSs
Idea: • Query LSs against Google search API• Identify URI in result set
• For each URI it is possible that:1. URI is returned as the top ranked result2. URI is ranked somewhere between 2 and 103. URI is ranked somewhere between 11 and 1004. URI is ranked somewhere beyond rank 100
considered as not returned33
Performance of LSs wrt Length
Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best
• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered
• 8 terms and beyond do not show improvement 34
nDCG for LSs consisting of 2-15 terms(mean over all years)
Performance of LSs wrt Length
35
Performance of LSs over Time
nDCG for LSs consisting of 2, 5, 7 and 10 terms
36
• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize
• LSs older than 5 years perform poorly• 5-, 6- and 7-term LSs seem to perform best
• 7 – most top ranked• 5 – lowest mean rank
• 2..4 as well as 8+ term LSs are insufficient
Contribution:Determined age and length limits for LSs.
Conclusions
37
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
38
Motivation Background
Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010)
2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives
2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives
Title Evolution - Example II
2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free
2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
• Experimental evaluation of tag based query length cf. 5- or 7-term LSs
• Test combination of methods to improve retrieval performance
• Investigate “descriptive” power of tags
64
The ProblemThe Experiment
• Tags queried against the Yahoo! BOSS API• Same four retrieval cases introduced earlier• nDCG w/ binary relevance scoring• Mean Average Precision
65
The ProblemThe Experiment
Combining methods
66
The Problem
• Fact:• ~50% of tags do not occur in page [Bischoff2008]
• “Secret”:• ~50% of tags do not occur in current version
of page• ergo: How about previous versions?
The Experiment
67
The Problem
• 3,306 URIs w/ older copies• 66.3% of our tags do not occur in page • 4.9% of tags occur in previous version of page Ghost Tags• represent a previous version better than the
current one
• What kind of tags are these?• Important to the document, to the Delicious
user?
Ghost Tags
68
The ProblemGhost Tags
Document importance:TF rank
User importance:Delicious rank
Normalized rank:0 - top1 - bottom
69
Concluding RemarksConclusions
• Tags can be used for search (if available)• Combining tags with titles and LSs gains URIs• Ghost Tags exist!
• 1/3 of them are important to the page and user
Contributions:Added tags to web page discovery framework and introduced notion of Ghost Tags.
70
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
71
Motivation Background
Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)
The Problem
We have seen that we have a good chance to rediscover missing pages with
• Lexical signatures• Titles
BUT
What if no archived/cached copy can be found?Plan A: Tags
The Problem
72
The ProblemPlan B
ComputerDominionNorfolkMonarchextract
is about
Link neighborhood Lexical Signatures (LNLSs)
73
The ProblemThe Idea
• Determine for well performing LNLS:• Length• Number of backlinks• Backlink levels• Radius of terms on backlink page
74
The ProblemThe Radius on a Backlink Page
Paragraph
Entire page
Anchor text
75
The Dataset
• 309 URIs• 28,325 first level• 306,700 second level backlinks• Filter for language, file type, etc.
12% discarded• Lexical signature generation
• IDF values from Yahoo!• 1..7 and 10 terms
• Query Yahoo! API• Compute “goodness” (nDCG) 76
The ProblemThe Results
level-radius-rank
1st and 2nd
level
bett
er
77
The ProblemThe Results – Radius
level-radius-rank
All Radii
78
The ProblemThe Results – Backlink Rank
level-radius-rank
Ranks10
1001000
79
The ProblemThe Results – In Numbers
1-anchor-1000
WINNER1-anchor-10
GOOD
80
Concluding RemarksConclusions
• Optimal link neighborhood lexical signatures:• Contain 4 terms• Parsed from top 10 backlink pages• Include first backlink level only• Consider anchor text only
Contributions:Added LNLS to web page discovery framework.
81
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
82
Motivation Background
Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)
Concluding RemarksSynchronicity
• Firefox add-on• Triggers on 404 error• Rediscover page via:
• Memento• Title• Lexical signature• Tags• Link neighborhood lexical signature• URI modification
• http://bit.ly/no-more-40483
Concluding RemarksContributions
1. Introduce reliable real-time approach to estimate IDF values
2. Workflow for generation of well performing lexical signatures
3. Performance evaluation of web page titles4. Investigation of tags for web page discovery5. Analysis of link neighborhood lexical
signatures and their optimal parameter6. Introduce Synchronicity implementing the
entire framework 84
Concluding Remarks
85
Concluding RemarksNext Stop… New Mexico
86
Concluding RemarksList of my Relevant Publications
1. M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46
2. M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382
3. M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627
5. M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009
6. M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-127. M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages
from the Web Infrastructure”, JCDL 2010, pp. 59-688. M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link
Web Pages in Real Time”, JCDL 201110. M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the
(Re-)Discovery of Web Pages”, TPDL 2011 to appear87
Concluding RemarksReferencesBischoff2008K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008Dellavalle2003R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003Jones1973K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973Kahle1997B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997Koehler2002W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002Lawrence2001S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001McCown2005F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005Nelson2002M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002Ntoulas2006A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006Park2004S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004Phelps2000T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000Sanderson2011R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011 88
Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages