Web Archiving A Brief Introduction Sawood Alam Department of Computer Science Old Dominion University Norfolk, Virginia - 23529 (USA)
Web ArchivingA Brief Introduction
Sawood AlamDepartment of Computer ScienceOld Dominion UniversityNorfolk, Virginia - 23529 (USA)
About Me
Sawood Alam
Lexical SignatureWeb, Digital Library, Web Archiving, Ruby on Rails, PHP,
XHTML, CSS, JavaScript, ExtJS, Urdu, RTL and Linux.
● BTech, Jamia Millia Islamia, India, 2008● MSc, Old Dominion University, USA, 2013● PhD, Old Dominion University, USA, Current
She Calls Me Dad!
Agenda● Archiving and Web archiving● Purpose and importance● Scope of the web archiving● Issues and challenges● Tools and techniques● Memento: Time Travel for the Web● Archive X-Ray● Research opportunities in Web archiving● Our WSDL Research Group
What is an Archive?● Accumulation of historical records● Long term storage and preservation● Less frequently used● Physical or digital
What is Web Archiving?● Periodic snapshots of web pages● Preserving important events on the Web● Making archived content accessible
Why do We Care Archiving?
Web contents decay rapidly!
● To preserve the history● To tell a story● For evidence● For backup● For personal satisfaction
Issues and Challenges● Crawling● Storage● Retrieval● Replay● Accessibility● Completeness● Accuracy● Credibility
Web Archiving Efforts● Internet Archive● Archive-It● Wikipedia● UK Web Archive● Various national and non-profit archives● Film, music and other multimedia archives● Scholarly archives● Personal archiving
Tools and Techniques● Heritrix, PhantomJS, WGet, cURL● OpenWayback, PyWB● TimeTravel, MemGator● CarbonDate, Warrick, Synchronicity● Preserve Me!● WARCreate,WAIL, Mink● Browsertrix● And many more...
Memento<http://example.com>; rel="original",
<http://web.archive.org/web/20020120142510/http://example.com/>;
rel="memento";
datetime="Sun, 20 Jan 2002 14:25:10 GMT",
<http://web.archive.org/web/20020328012821/http://www.example.com/>;
rel="memento";
datetime="Thu, 28 Mar 2002 01:28:21 GMT",
<http://webarchive.loc.gov/all/20020803080544/http://www.example.com/>;
rel="memento";
datetime="Sat, 03 Aug 2002 08:05:44 GMT",
<http://wayback.archive-it.org/all/20091213015014/http://www.example.com/>;
rel="memento";
datetime="Sun, 13 Dec 2009 01:50:14 GMT",
Archive X-Ray!● How much of the Web is archived?● Profiling various archive services● Predicting what they contain● Routing Memento aggregator queries
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Memento Aggregator
Long Tail of Archives
Archive Profile● High-level summary of an archive● Predicts presence of mementos● Provides statistics about the holdings● Small in size and publicly available● Easy to update and partially patch● Useful for Memento query routing and
other things
com,cnn)/ {“frequency”: 40, “spread”: 2}
uk,co,bbc)/ {“frequency”: 20, “spread”: 1}
com,usatoday)/ {“frequency”: 5, “spread”: 1}
Research Opportunities● Information retrieval● Information visualization● Client and server side archiving● Archiving dynamic content● Distributed archiving● Discovering alternate long term archiving
techniques● Predicting “Important” events on the Web
and archiving them timely
Web Science and Digital Libraries Research Group
ws-dl.cs.odu.edu
ws-dl.blogspot.com
@WebSciDL
github.com/oduwsdl
flickr.com/photos/124419986@N07
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
WSDL Research Group
Sawood AlamDepartment of Computer Science
Old Dominion UniversityNorfolk, Virginia - 23529 (USA)
[email protected]@ibnesayeed
www.cs.odu.edu/~salam