Scripts in a Frame: A Two-Tiered Approach for Archiving Deferred Representations Justin F. Brunelle Dissertation Defense February 5, 2016 Committee Members: Michael L. Nelson Michele C. Weigle Elizabeth J. Vincelette Irwin B. Levinstein
Apr 08, 2017
Scripts in a Frame:A Two-Tiered Approach for Archiving
Deferred Representations
Justin F. Brunelle
Dissertation Defense
February 5, 2016
Committee Members:
Michael L. Nelson
Michele C. Weigle
Elizabeth J. Vincelette
Irwin B. Levinstein
A simpler time…
2
Mass hysteria. Human sacrifices. Dogs and cats living together.
3
<iframe><script>…</script></iframe>
4
t
5http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
Missing resources (bad)
2008
6http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.html
20082012
Missing resources (bad) and Temporal violations (worse)
Old ads are interesting
7
New ones are annoying…for now.
8
“Why are your parents wrestling?”
Today’s ads are missing from the archives
9
http://adserver.adtechus.com/addyn/3.0/5399.1/2394397/0/-1/QUANTCAST;;size=300x250;target=_blank;alias=p36-17b4f9us2qmzc8bn;kvp36=p36-17b4f9us2qmzc8bn;sub1=p-4UZr_j7rCm_Aj;kvl=172802;kvc=794676;kvs=300x250;kvi=c052a803d0b5476f0bd2f2043ef237e27cd48019;kva=p-4UZr_j7rCm_Aj;rdclick=http://exch.quantserve.com/r?a=p-4UZr_j7rCm_Aj;labels=_qc.clk,_click.adserver.rtb,_click.rand.85854;rtbip=192.184.64.144;rtbdata2=EAQaFUhSQmxvY2tfMjAxNlRheFNlYXNvbiCZiRcogsYKMLTAMDoSaHR0cDovL3d3dy5jbm4uY29tWihUUEhwYlUzM3ZqeFU5LTA1SGZEMk1SXzE0anBVcGU0d0dxTG10STFUdUs2IECAAb_JicoFoAEBqAGhy7YCugEoVFBIcGJVMzN2anhVOS0wNUhmRDJNUl8xNGpwVXBlNHdHcUxtdEkxVMAB3ed3yAGUp7GUqSraAShjMDUyYTgwM2QwYjU0NzZmMGJkMmYyMDQzZWYyMzdlMjdjZDQ4MDE55QHvEWs-6AFkmAK2wQqoAgWoAgawAgi6AgTAuECQwAICyAIA0ALe9baMj4Cos-oB
JavaScript is hard to replay
What happens when things are completely lost?http://ws-dl.blogspot.com/2013/11/2013-11-28-replaying-sopa-protest.html
10
Remember SOPA? And the protest?
11https://en.wikipedia.org/wiki/Stop_Online_Piracy_Acthttps://en.wikipedia.org/wiki/Protests_against_SOPA_and_PIPA
http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 12
http://web.archive.org/web/20120118110520/http://en.wikipedia.org/wiki/Main_Page January 18th, 2012 13
14
Problem!
The archives contain the Web as seen by crawlers
Why archive?
The Internet Archive has everything!
Why didn’t you back it up?
Participating institutions can hand over their databases.
15
Crimean Conflict
Russian troops captured the Crimean Center for Investigative Journalism
Gunman: "We will try to agree on the correct truthful coverage of events.”
16
http://gijn.org/2014/03/02/masked-gunmen-seize-crimean-investigative-journalism-center/
Archive-It to the rescue!
17
How?
Masked gunman have your servers
Where are your backups?
Transactional archive? Too late!
18
Preservation over HTTP
How?
Masked gunman have your servers
Where are your backups?
Transactional archive? Too late!
19
Preservation over HTTP
Any future discussion of the 21st
century will involve the web and the web archives
20
Any future discussion of the 21st
century will involve the web and the web archives
But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users
21
Any future discussion of the 21st
century will involve the web and the web archives
But JavaScript is hard to archive, resulting in archives of content as seen by crawlers rather than as seen by users
22
Goal: Mitigate the impact of JavaScript on the archives by making crawlers behave like users
23
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Some Institutional Archives
24
Some Page-at-a-time Archivers
25
Some Archival Tools
261: http://warcreate.com/2: http://matkelly.com/wail/
1
2
Memento Framework
27http://mementoweb.org/guide/rfc/
Machine readable bidirectional link between the past and present web
28
29
30
URI-R: Original Resource Identifier
URI-M: memento Identifier
URI-T: TimeMapIdentifier
Page on the live web
Archived version of a page
List of archived pages
Web Architecture
31
Dereference a URI, get a representation
JavaScript makes requests for new resources after the initial page load
32
http://maps.google.com
Identifies
Represents
Deferred Representation
33
http://maps.google.com
Identifies
Represents
JavaScript != Deferred
34
Deferred
HTTP GETHTTP GET HTTP GETHTTP GET
onload
Nondeferred
HTTP GET
Web Browsing Process
35
User-controlled
Interaction
Environmentvariables → content negotiation
Client-controlledrepresentationchanges
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders and displays R
JavaScript requests embedded resources
Server returns embedded resources
R updates its representation
Web Browsing Process
36
There is no longer “the”representation.
At any given time, users get “a” representation.
GeoIP: Washington, D.C.URI-R: http://www.wunderground.com/
GeoIP: Suffolk, VAURI-R: http://www.wunderground.com/
The Internet Archive got everything, right?
37
Missing tiles, not interactive
38
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders and displays R
JavaScript requests embedded resources
Server returns embedded resources
R updates its representation
Web Browsing Process
39
Archival Tools stop here
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders and displays R
JavaScript requests embedded resources
Server returns embedded resources
R updates its representation
Web Browsing Process
40
Archival Tools stop here
HTTP GET Request for Resource R
HTTP 200 OK Response: R Content
Browser renders and displays R
JavaScript requests embedded resources
Server returns embedded resources
R updates its representation
Web Browsing Process
41
Archival Tools stop here
Still not solved!
42
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Research Questions
RQ1. To what extent does JavaScript impact archival tools?
RQ2. How do we measure memento quality?
RQ3. How can we crawl, archive, and play back deferred representations?
43
44
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
20152013
Zombies!
45
2008
2012
Measuring JavaScript
1,000 URIs from Twitter
1,000 URIs from Archive-itDataset available at http://www.cs.odu.edu/~jbrunelle/jsDataSet.txt
Capture with tools
Study the archivability
46“The impact of JavaScript on archivability”, 2015, International Journal of Digital Libraries
( )
Good
47
Good
48
Good
49
Meh
50
Meh
51
Bad
52
Bad
53
Bad
54
Bad
55
Bad
56
Leakage by archival tool
57Twitter has more leakage than Archive-It
Leakage by archival tool
58Wayback reduces leakage the most
Leakage -> Zombies
5912% increase in embedded mementos loaded via JavaScript
Leakage increasing over time
60Increased JavaScript -> increases in missing embedded resources
61
• 73.1% of all missing embedded mementos are loaded via JavaScript
• 33% increase in missing embedded mementos from JavaScript between 2005-2012
Leakage increasing over time
62
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
20152014
63“Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014, International Journal of Digital Libraries, 2015
VS.
63
“Live” XKCD
• Missing 17% of embedded resources
• Looks complete
64
“Live” XKCD
• Take three resources:• Logo
• Main Comic
• Navigation Strip
• Relative importance?
• All present in “Live” XKCD
65
Damaging XKCD
• Created a local memento
• Removed the logo and navigation strip
• Now missing 29% of embedded resources
• Human assessment: looks OK
66
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of embedded resources
• Human assessment: Not a usable memento
67
Damaging XKCD
• From our local memento
• Removed the Main Comic
• Now missing 24% of embedded resources
• Human assessment: Not a usable memento
• Percent of missing embedded resources is not a suitable metric for memento quality
68
Image Importance
• Size (as percentage of all pixels)
69
Image Importance
• Size
• Position (in viewport?)
70
Image Importance
• Size
• Position
• Centrality (in the vertical or horizontal center?)
71
Missing CSS
• More important than thought
• Calculated the amount of content in each vertical third
• If >=80% in left column and missing CSS, CSS is important
• Only performed if stylesheets are missing
72
Methodology
• Defined Dm and Mm metrics
Mm = 𝑀𝑖𝑠𝑠𝑖𝑛𝑔 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
𝐴𝑙𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
Dm = 𝑖=1
𝑛𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠𝑤 𝑖
𝑗=1
𝑛𝑎𝑙𝑙 𝑟𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑤 𝑗
• Used Amazon Mechanical Turkers to assess web user perception of quality
• Assessed Dm versus Mm in manually damaged pages
• Assessed Dm versus Mm in the archives
73
Turk Results
74
Live vs Manually Damaged Dm
Mementos from Internet Archive
Agreement with Dm
Mementos from Internet Archive
Agreement with Mm
50/50 Chance
Damage in the Archives
75
Internet Archive WebCite
Mementos with deferred representations have 13.5% higher damage rating
76
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
2015 2016
77
CurrentWorkflow
• Dereference URI-Rs• Archive representation• Extract embedded URI-Rs• Repeat
78
Two-Tiered Crawling
“Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015
“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
79
<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!
Current workflow not suitable for deferred representations
Use PhantomJS to run JavaScript, interact with the representation
Two-tiered crawling approach to optimize performance
80
<script> tags alone are not indicative of a deferred representation. JavaScript can be played back in the archives!
Current workflow not suitable for deferred representations
Use PhantomJS to run JavaScript, interact with the representation
Two-tiered crawling approach to optimize performance
More URI-Rs in the crawl frontier
Runs more slowly but more deeply
Comparing Performance
• Crawled 10,000 URI-RsDataset available at http://www.cs.odu.edu/~jbrunelle/wsdl/10kuris.txt
• Compare crawl speed & discovered frontier size• With and without classifier
• Code available at https://github.com/jbrunelle/classifyDeferred/
81
Performance: Frontier Size
82PhantomJS creates a 1.5x larger crawl frontier than Heritrix
Performance: Crawl Speed
83
Heritrix: ~2 URIs/second
PhantomJS: ~4 seconds/URI
Classifier
We are omitting a discussion about the classifier for deferred vs. nondeferred representations
Please see Section 7.4 in the dissertation for a detailed discussion
84
Descendants = States of deferred representations reached through client-side events
85
Click Pan Zoom
Click Pan Zoom
Crawling descendants
• Interactions represented as N-ary tree G
• FSM: M = (S, s0, Σ, δ)‒ S is the finite set of client states
‒ s0 ϵ S is the initial state reached by dereferencing the URI-R and executing the initial on-load events
‒ e ϵ Σ defines the client-side event e as a member of the set of all events Σ
‒ δ : Sx Σ → S is the transition function in which a client-side event is executed and leads to a new state
si, sj ϵ S
δ(si, e) = sj
e = client-side event
j = i + 186
“Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
87http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interaction Trees are 2 Levels Deep
88http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Interaction Trees are 2 Levels Deep
89
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
90
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
91
Interaction Trees are 2 Levels Deep
http://www.bloomberg.com/bw/articles/2014-06-16/open-plan-offices-for-people-who-hate-open-plan-offices
s0
s1
s2
Expanding the Crawl Frontier
92
Level s1 provides the greatest benefit to the crawl frontier
Nondeferred
Deferred
Crawling Descendants
93
New embedded resources at levels s1 are largely unarchived
Crawling Descendants
94
Level s1 has the highest cost-benefit Return on Investment
Storage Impact of Two-Tiered Crawling
IIPC-proposed JSON metadata of interactions, resulting descendants
–Potentially used to resolve URI-M collisions
–16.5KB WARC metadata
–143MB for total dataset
11.4 times larger for deferred vs nondeferred
Totals 5.12 times more storage per URI-R for total dataset
95
2013
96
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
Future Work
• Modeling user interactions, tendencies, and simulation– Form filling– Click and navigation likelihood
• Evaluating success of crawling deferred representations– Random walks through the archives– Dm vs Mm of mementos of deferred representations
• Archival Halting Problem: How much is enough?– Mapping Applications – How many pans and zooms gets all the Norfolk,
VA Google map tiles?– How many CNN.com pages get all the Google Ads?
• Playing back WARCs with IIPC metadata of deferred representations and descendants
97
98
Motivating Examples
Background Information
Research Questions
Measuring the Impact of JavaScript
Measuring Memento Quality
Crawling Deferred Representations
Future Work
Conclusions
RQ1. To what extent does JavaScript impact archival tools?Contributions:
• Defined and identified zombie resources
• Adoption of JavaScript correlates with missing embedded resources in mementos
• Defined deferred representations
• Showed that deferred representations have reduced archivability
99
2012: ws-dl.blogspot.com
2013: TPDL2013
2015: iPRES2015
2015: IJDL
2015: IJDL
Section 4.3
Ch. 5
Ch. 2
Ch. 5
For more information, reference:
RQ2. How do we measure memento quality?
Contributions:
• Mm is not accurate (worse than coin-flip)
• Created Dm metric
• Dm is closer to user perception than Mm
• Mementos of deferred representations have higher Dm than nondeferred representations
100
2015: JCDL2015
2015: IJDL Special Issue
Ch. 6
Section 6.6
For more information, reference:
RQ3. How can we crawl, archive, and play back deferred representations?Contributions:
• Defined a framework for archiving deferred representations
• Showed that the framework will crawl more slowly but more thoroughly
• Defined descendants, showed that they are 2-levels deep
• Showed the storage impact of crawling descendants and deferred representations
101
2015: iPRES2015
2016: arXiv:1601.05142
Ch. 7
Ch. 7
For more information, reference:
Summary
• Measured the impact of JavaScript on the archives
• Quantified damage caused by JavaScript
• Measured the cost in time and space to archive JavaScript
Provides policy makers information to make decisions regarding JavaScript handling in crawling and archiving
Quantified an intuitive understanding of crawling deferred representations at web scale
102
Backups
103
104
Year RQ Venue Abbreviated Title Notes
2012 JCDL2012 Doctoral Consortium Capturing Dynamic Web
2013 JCDL2013 TimeMap Caching
2013 RQ1 TPDL2013 Archivability Over Time
2013 TPDL2013 Transactional Archiving
2013 RQ1 DLib Magazine 19(11/12) Identifying Mementos
2014 RQ2 JCDL2014 Measuring Memento Damage Best Student Paper
2015 RQ1 International Journal of Digital Libraries Measuring Impact of JavaScript
2015 RQ2 International Journal of Digital Libraries Measuring Memento Damage JCDL2015 Special Issue
2015 JCDL2015 Merging Mobile and Desktop Best Poster
2015 RQ3 iPRES2015 Two-Tiered Crawling
2016 RQ3 Technical Report, arXiv:1601.05142 Hypercube Model for Archiving
2016 DLib Magazine 22(1/2) Archiving Corporate Intranets
Publications
Publications• Justin F. Brunelle “Filling in the Blanks: Capturing the
Dynamic Web”, JCDL 2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLib Magazine, 22(1/2) 2016
105
Mobile Mink: Merging Mobile and
Desktop Archived WebsWesley Jordan, Mat Kelly, Justin F. Brunelle,
Laura Vobrak, Michele C. Weigle, Michael L. Nelson
This work supported in part by the NEH HK-50181. This work
was performed as part of Wesley Jordan’s mentorship at The
MITRE Corporation. The author’s affiliation with The MITRE
Corporation is provided for identification purposes only, and is
not intended to convey or imply MITRE’s concurrence with, or
support for, the positions, opinions or viewpoints expressed by
the author.
Acknowledgements
http://bitly.com/MobileMink/
More about Mobile Mink
Desktop URIs are much
more prevalent than their
mobile counterparts in the
archives because crawlers
use desktop user-agent
strings.
Corresponding Mobile URIs
are archived less frequently
even though the
representations are different
than their desktop
counterparts.
http://espn.go.com/ http://m.espn.go.com/
Same
ESPN,
different
URIs,
different
HTML,
different
TimeMaps.
.
Browse to a URI-R
Potential content-
negotiation from
user-agent
Access tool from the
“Share” menu
MobileMink merges TimeMaps of
http://espn.go.com & http://m.espn.go.com/
Desktop and mobile webs differ and
the linkage between them is lost in the
archives
Discovers mobile and
desktop URI-Rs
Uses Memento to get
all available
TimeMaps
Provides integrated
TimeMap
Offers users ability
to submit mobile
and desktop URI-Rs
to archives
Increases
coverage of mobile
URI-Rs in the
archives
HTTP Request
$ curl -i -v http://www.cs.odu.edu/
> GET / HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: www.cs.odu.edu
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 25 Mar 2014 23:42:38 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
<
107
HTTP Response
HTTP/1.1 200 OK
Server: nginx
Date: Tue, 25 Mar 2014 23:40:09 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0036)http://www.cs.odu.edu/newcssite/new/ -->
<!-- saved from url=(0019)http://sci.odu.edu/ -->
<HTML xmlns:st1 = "urn:schemas-microsoft-com:office:smarttags">
<HEAD>
<meta name="verify-v1" content="CXMn8RoyhZpl9fsKpbgxtiFw3kIdHD51r/ntbf1Rrcw=" >
<TITLE>Department Of Computer Science</TITLE>
108
Client-side code modifiesthe DOM
109
Internet Archive URI-M
110
http://web.archive.org/web/20140314130018/http://espn.go.com/
Archive Prefix Memento-DateTime URI-R
Deferred Representations
Representation is incomplete
Client-side code execution completes the build of the representation
111
Web Browsing Process
112
Deferredrepresentations
Percent Missing vs. Weighted Damage
• 𝑀𝑀 = Percent of embedded resources missing
𝑀𝑀 =𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠 𝑀𝑖𝑠𝑠𝑖𝑛𝑔
𝑇𝑜𝑡𝑎𝑙 𝐸𝑚𝑏𝑒𝑑𝑑𝑒𝑑 𝑅𝑒𝑠𝑜𝑢𝑟𝑐𝑒𝑠
• 𝐷𝑀 = Damage rating of missing embedded resources
𝐷𝑀 =𝐷𝑀𝐴𝑐𝑡𝑢𝑎𝑙𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙
𝐷𝑀𝑃𝑜𝑡𝑒𝑛𝑡𝑖𝑎𝑙 = 𝑖=1
𝑛[𝐼|𝑀𝑀]𝐷[𝐼|𝑀𝑀] (𝑖)
𝑛[𝐼|𝑀𝑀]+ 𝑖=1
𝑛[𝐶]𝐷[𝐶] (𝑖)
𝑛𝐶 113
𝐼 = 𝐼𝑚𝑎𝑔𝑒
𝑀𝑀 = 𝑀𝑢𝑙𝑡𝑖𝑀𝑒𝑑𝑖𝑎
𝐶 = 𝐶𝑆𝑆
• Measured Internet Archive mementos
• Damage generally improves over time
• Despite missing more resources over time
Damage in the Internet Archive
114
Expanding the crawl frontier
115
Click events lead to the most descendants
Related Work
116
Deep Web
• Deferred=Deep (Bergman, 2001)
• Mobile requires context (Schneider, 2013)
• Static → Dynamic Web (Rosenthal, 2011)(IIPC, 2012)
• Crawlers & deep Web (Ast, 2008) (B. He, 2007) (Y. He, 2013)
• Google’s deep Web crawler (Madhavan, 2008)
• Forms (Ntoulas, 2005)
117
Archive Quality
• SHARC, Quality Conscious Archiving (Spaniol, 2009)
• Quality of archives (Spaniol, 2009, 2009)
• Archiveready (Banos, 2013, 2015)
• Acid test (Kelly, 2014)
• Block Importance (Ye, 2003) (Fersini, 2008) (Kohlschutter, 2010)
118
Monitoring for Security
• Ripley (Vikram, 2009)
• Mugshot (Mickens, 2010)
• ActionShot (Li, 2010)
• Ajax testing and states (Mesbah, 2007, 2008, 2009, 2009, 2012)
• Crawling Ajax (Dincturk, 2013, 2014)
119
PublicationsMaster’s:
• Kyle Dempsey, Justin Brunelle, G. Tanner Jackson, Chutima Boonthum, Irwin Levinstein, Danielle McNamara. “MiBoard: Multiplayer Interactive Board Game”, AIED2009
• Justin F. Brunelle, Irwin B. Levinstein, Chutima Boonthum. “MiBoard: Metacognitive Training Through Gaming in iSTART”, 2009 VMASC Capstone Conference
• Best paper in track
• Justin F. Brunelle, Kyle B Dempsey, G. Tanner Jackson, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “MiBoard: Metacognitive Training Through Gaming”, SCiP2009
• Justin F. Brunelle, G. Tanner Jackson, Kyle Dempsey, Chutima Boonthum, Irwin B. Levinstein, Danielle S. McNamara. “Analysis of MiBoard as an iSTARTPractice Tool”, FLAIRS-24, 2010
• Kyle Dempsey, G. Tanner Jackson, Justin Brunelle, Michael Rowe, Danielle McNamara. “MiBoard: Assessing Collaborative Learning Through Game-Based Practice”, FLAIRS-24, 2010
PhD:
• Justin F. Brunelle “Filling in the Blanks: Capturing the Dynamic Web”, JCDL 2012 Doctoral Consortium
• Justin F. Brunelle, Michael L. Nelson “An Evaluation of Caching Policies for Memento TimeMaps”, JCDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, Michael L. Nelson, “On the Change in Archivability of Websites Over Time”, TPDL 2013
• Justin F. Brunelle, Michael L. Nelson, Lyudmila Balakireva, Robert Sanderson, Herbert Van de Sompel, “Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool”, TPDL 2013
• Mat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 19(11/12), 2013.
• Justin F. Brunelle, Mat Kelly, Hany SalahEldeen, Michele C. Weigle, and Michael L. Nelson “Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources”, JCDL 2014
• Best Student Paper, International Journal of Digital Libraries: JCDL2015 Special Issue
• Justin F. Brunelle, Mat Kelly, Michele C. Weigle, and Michael L. Nelson “The impact of JavaScript on archivability”, 2015, IJDL
• Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson “Mobile Mink: Merging Mobile and Desktop Archived Webs”, JCDL 2015
• Best Poster
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Archiving Deferred Representations Using a Two-Tiered Crawling Approach”, iPRES2015
• Justin F. Brunelle, Michele C. Weigle, and Michael L. Nelson, “Adapting the Hypercube Model to Archive Deferred Representations at Web-Scale”, Technical Report, arXiv:1601.05142, 2016
• Justin F. Brunelle, Krista Ferrante, Eliot Wilczek, Michele C. Weigle, and Michael L. Nelson, “Leveraging Heritrix and the Wayback Machine on a corporate intranet: A case study on improving corporate archives”, DLibMagazine, 2016
120
Performance with classifier
121
Mobile Sites in the Archives
122
http://m.espn.go.com/wireless/http://espn.go.com/
“A Method for Identifying Personalized Representations in Web Archives”, D-Lib Magazine, 2013
Mobile Sites in the Archives
123
http://m.espn.go.com/wireless/http://espn.go.com/
URI-M:
http://web.archive.org/web/20140330125315/http://espn.go.com/
URI-M:
http://web.archive.org/web/20140330125414/http://m.espn.go.com/wireless/
Collisions in the Archives
124
http://www.cnn.com/
URI-M? URI-T?
http://web.archive.org/web/[DATETIME]/http://www.cnn.com/
Need a better way to index mementos
• URI-R is no longer enough
• Environmental factors:‒ Content negotiation
‒ Interaction
‒ Personalization
‒ GeoIP
125
Content Negotiation
Server-side interpretation of client-provided parameters
Multiple representations, single resource
126
Resource
URI Representation 2Represents
Representation 1
Represents
Identifies
Content Negotiation
Mobile
Desktop
user-agent