Top Banner
Web Crawling Tools and Services from the Internet Archive: Archive-It and Contract Crawling Courtney C. Mumma, Internet Archive November 17, 2016 - Dutch Institute for Sound and Vision
41

Internet Archive: Archive-It and Contract Crawling, C. Mumma

Apr 16, 2017

Download

NCDD
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Web Crawling Tools and Services from the Internet Archive: Archive-It and Contract Crawling

Courtney C. Mumma, Internet ArchiveNovember 17, 2016 - Dutch Institute for Sound and Vision

Page 2: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Talk overview● Archiving the web at IA● Partnerships and services

○ Contract crawls○ Archive-It

■ Research Services○ Interoperability & Distributed Preservation

● New technology for new challenges

Page 3: Internet Archive: Archive-It and Contract Crawling, C. Mumma

The Internet ArchiveNon-Profit Library

Founded in 1996 by Brewster Kahle

Universal Access to All Knowledge

Page 4: Internet Archive: Archive-It and Contract Crawling, C. Mumma

30,000,000,000,000,000 Bytes Archived(30 PetaBytes)

20 Years of Archiving the Web500,000,000,000+ URLs

Page 5: Internet Archive: Archive-It and Contract Crawling, C. Mumma

1996 US Presidential Campaigns with Smithsonian

218,342,520 Web Captures

Page 6: Internet Archive: Archive-It and Contract Crawling, C. Mumma

1997 First Full Crawl

525,362,846Web Captures

Page 7: Internet Archive: Archive-It and Contract Crawling, C. Mumma

1998 Donation of Crawl to the Library Of Congress

1,166,891,826Web Captures

Page 8: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2000US Presidential Campaigns with the Library of Congress

6,153,042,235Web Captures

Page 9: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2001Launch of the WayBack Machine

12,082,859,018Web Captures

Page 10: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2003International Internet Preservation Consortium Founded

38,868,116,181Web Captures

Page 11: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2006Archive-It Started

103,943,903,726Web Captures

Page 12: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2007Ireland

184,277,909,308Web Captures

Page 13: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2008National Archive Government Crawls

209,160,715,829Web Captures

Page 14: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2009Archive-It Adds its 100th Partner7 National Library Partners

225,658,093,516Web Captures

Page 15: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2010Broad and Survey Web-Scale Crawls

246,744,306,660Web Captures

Page 16: Internet Archive: Archive-It and Contract Crawling, C. Mumma

2015Archive-It Adds its 400th Partner

467,195,419,069

Web Captures

Page 17: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 18: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Global Wayback

● Broad snapshot

● Deep crawl on popular

sites

● Broad crawl on known

domains

● No more 404s

● On-demand

● Donated and targeted crawls

● https://web-beta.archive.org/

with KEYWORD SEARCH

and more!

Page 19: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Support Open Source Software

Page 20: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Web Archiving Partnerships and Services

Page 21: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Domain Scale Web Preservation

Page 22: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Contract CrawlingDomain-scale • Run by Internet Archive • Average 300 million URLs per collection

Partial List of Partners• National Libraries of Australia and New Zealand• U.S. National Archives and Library of Congress• Luxembourg National Library• Israel National Library

Partial List of Collections• Iraq War (2003-2011)• 2005 US Supreme Court Nominations

Page 23: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Archive-ItCurated, Selective Web Archiving

Page 24: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Archive-It

Web based - nothing to install

Fully hosted service with

unlimited support

Simple to select, manage, scope

and catalog with metadata

10 different crawl frequencies

Includes quick access and

storage

html, videos, audio, social

media, PDFs, images, news

Full text search

Restricted access options

Page 25: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 26: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 27: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 28: Internet Archive: Archive-It and Contract Crawling, C. Mumma

How our partners use Archive-It

● Enhance and supplement traditional offline collections ○ archives, topical collections

● Support records retention and archival policies● Capture event-based content

○ Spontaneous○ Planned

● Individual organizations and Consortial collaboration

Page 29: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Research Services

Page 30: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Goals of Archive-It Research Services

● Expand access models for web archives

● Enable new insights into collections

● Leverage Internet Archive infrastructure for large-scale

processing to produce datasets for research

● Facilitate computational analysis and new use cases

● Increase use, visibility, and value of Archive-It partner

collections

Page 31: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Web Archives Datasets

Archive-It Research Serviceshttp://bit.ly/ait_ars

Page 32: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Exploring the Canadian Political Interest Group and Political Parties Web Sphere via WAT files

Page 33: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Named Entities in the Human Rights Collection

Page 34: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Systems Interoperability and Distributed Preservation

Page 35: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Lost in the maze in Labyrinth (1986, LucasFilm, screen capture)

WARCs, CDXs

and derivatives

Access

Storage

Preservation

Content Mgmt

Web Archiving Tools

Page 36: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 37: Internet Archive: Archive-It and Contract Crawling, C. Mumma

APIs(*application programming interfaces)

● Interoperability ● Flexibility and modularity● Loose coupling of services (so we can improve pieces as

needed)● Scalability - Bulk data upload and download

Page 38: Internet Archive: Archive-It and Contract Crawling, C. Mumma

New technology to face new challenges

Page 39: Internet Archive: Archive-It and Contract Crawling, C. Mumma
Page 40: Internet Archive: Archive-It and Contract Crawling, C. Mumma

Ongoing efforts

• Open Wayback

• Social media / Dynamic content

– Brozzler and Umbra (Archive-It)

– Social Feed Manager (GWU)

• URL nomination tools (UNT)

• Capture tools (GWU, IA, Rhizome)

• WASAPI - Community building and API

• Memento

BROZZLER!

Page 41: Internet Archive: Archive-It and Contract Crawling, C. Mumma

????s

THANK YOU

[email protected]