Top Banner
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1
16

Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Dec 23, 2015

Download

Documents

Kevin Farmer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Archive-It Architecture Introduction

April 18, 2006Dan Avery

Internet Archive

1

Page 2: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Archive-It Components

•Crawling

•User Interface

•Storage

•Playback

•Text Indexing

•Integration

2

Page 3: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Component Integration

3

Page 4: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Crawling

•Heritrix ( http://crawler.archive.org/ )

•Java application

•Open source (LGPL)

•Crawls for completeness/depth

•Highly configurable

4

Page 5: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Crawling - Distributed Crawling•Heritrix Cluster Controller

•Java component - open source - developed by IA

•http://crawler.archive.org/hcc

•Provides proxy access to pool of Heritrix instances through JMX interface

•Provides crawler control and status

•Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown

5

Page 6: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Archive-It Web Application

• User Interface and Crawl Scheduling

• Gets seed URLs and crawl parameters from users

• Schedules new periodic crawls

• Talks to crawler pool through HCC

• Provides access, search, and crawl history UI 6

Page 7: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Storage

•archive.org ARC repository

•custom Perl system

•simple storage on primary/backup pairs

•monthly MD5 digest verification

•robust, non proprietary file format

•Alexandria (Egypt)/Amsterdam

7

Page 8: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Access• Internet Archive Wayback

Machine

• Replaying archived web pages since 2001

• Current IA version written in Perl and C, with components distributed across various machines

• Not open source, but open source beta (in Java) available now

8

Page 9: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Full-Text Indexing

•Nutch (http://nutch.org)

•NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files

•Standard text search plus link analysis

•can search by date instead of relevance, useful for individual archives

9

Page 10: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Text Indexing Challenges

•Some parts are distributable, some are not

•Incremental indexing - goal of new crawls in index within 72 hours

•Working on Archive-It usable map/reduce version - July

•In the meantime, a lot of workarounds

10

Page 11: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Integration

•Group of Perl and bash scripts - planning more complex than the execution

•Most components available individually

•Decentralized control, centralized monitoring

•Each component operates almost entirely independently

11

Page 12: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

The Big Picture

12

Page 13: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Future Challenges•Crawler trap detection

•Scalability

•Current setup can accommodate 300 partners at current crawling rates

•During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks

•More machines can be easily added to storage and crawling clusters

13

Page 14: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Scalability

•Current Nutch is between versions

•Old version has some non-distributable pieces

•New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing

14

Page 15: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

Looking ahead•After basic UI/archiving/indexing...

•Time-based search UI

•Analyzing archives for research and ongoing collection improvement

•Content classification

•Rate of change

•New site suggestions

15

Page 16: Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.

http://www.archive-it.org16