Top Banner
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014
15

VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Sep 23, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

VT Web ArchivingAnthony Rinaldi and Dev Mehta

CS 4624Clients: Mohamed Magdy and Tarek Kanan

Blacksburg, VA5/6/2014

Page 2: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Project Goals● Setup a web-crawler with Heritrix

● Archive files from vt.edu

● Integrate with Wayback

● Set-up Search with Solr (Stretch)

Page 3: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Problems Encountered

● Older version of software. ● Finding documentation to configure

Heritrix. o Only crawl vt.edu pages. o Crawl all vt.edu pages.

● Issues with CentOS firewalling.

Page 4: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Work Accomplished

● Working set-up of Heritrix that successfully crawls vt.edu web-pages.o Customized configuration to increase crawl depth. o Reject non-domain based URLs.

● Working set-up of Wayback machine:o Processes warc files from Heritrix. o Front-end for Heritrix-based crawls.

Page 5: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Lessons Learned

● Sometimes, documentation leaves much to be desired.

● Crawls can be extremely large if not configured properly.

Page 6: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Demo

Heritrix:● https://administrator:[email protected]:12222/

Wayback:● http://webarchive.cc.vt.edu/

Page 7: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 8: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 9: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 10: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 11: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 12: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 13: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 14: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based
Page 15: VT Web Archiving...Work Accomplished Working set-up of Heritrix that successfully crawls vt.edu web-pages. o Customized configuration to increase crawl depth. o Reject non-domain based

Questions?