Top Banner
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006
26

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

Jan 16, 2016

Download

Documents

Stewart Austin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

1

Archiving and Preserving the WebDan Avery

Kristine HannaMerrilee Proffitt

Internet ArchiveRLG

April 2006

Page 2: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

2

Agenda

RLGInternet Archive Archive-ItChallengesThe FutureQ&A

Page 3: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

3

The importance of archiving the web

• The web contains much of what will be the basis of scholarship in the future– record of events– official publications– personal viewpoints– ephemeral material

Page 4: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

4

RLG’s interest

• RLG mission includes working with its member organizations to enhance their ability to provide research resources

• RLG members have long been participating in web archiving, but so far, this has been an activity restricted to large organizations

Page 5: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

5

Members active in web archiving

• Bibliothèque Nationale de France

• British National Library• California Digital Library• Library of Congress • National Library of Australia• National Library of New Zealand

Page 6: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

6

Archive-It pilot partners

• Indiana University• International Institute of Social History

• University of Toronto• Swarthmore/Haverford College

Page 7: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

7

About Internet Archive

• Founded in 1996 • Largest public web archive• 60 billion pages, 55 million sites• Have expanded to include texts, audio, moving images, and software: 2.6 million downloads a day

Page 8: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

8

What do we collect?Web Archive

• Take a broad snapshot of the web every 2 months

• 2 billion pages a month• Websites from every domain (.org, .com, .edu etc)

• Content in 21 languages

Page 9: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

9

Policy

• We follow Oakland Archive Policy, 2002

• Founded by commercial and non commercial organizations

• Opt-out policy• We collect it all, and make it

inaccessible if requested by site owner

• Site owner directly blocks harvester on website

Page 10: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

10

Access to Web Archive

• Entire archive accessible for free to the public via the website at www.archive.org

• Receive100 hits/second• 60k unique users per day• Evolving/Fluid: through public use we hope to find out what is important and to continuously improve

Page 11: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

11

Why try to collect and preserve it all?

• Web has no boundaries, no limits• What will be important?• What is there today may be gone tomorrow

– “Capture now, ask why later”– “Grab it while you can, work it out later”

– “Lose as little as possible”

Page 12: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

12

Open Source Technology primarily developed by Internet

Archive and IIPC

• Heritrix: web crawler• Wayback Machine: access tool for rendering and viewing files

• Nutchwax: Search engine• Arc File: archival record format (ISO work item)

How do we collect it?

Page 13: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

13

Wayback Machine

Page 14: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

14

Preservation

• Store multiple copies of each Archive

• 1300 machines/servers• Multiple copies at different geographical locations (U.S. Alexandria, Amsterdam)

• Standard storage boxes, open source design

Page 15: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

15

Next Steps

Institutions:•need to create collections around primary source web material

•want to do more than broad crawling with specific and complete web archives

•want a technology partner that could harvest, index, access, store and preserve their collections for them.

Page 16: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

16

• In 2002, began to form partnerships with Library of Congress, NARA and other National Libraries, including Australia and France.– Library of Congress collections:

•Iraq War: 450,000,000 documents and growing

•U.S. National Elections– 2000:131,331,973 documents– 2004: 87,481,265 documents

•Supreme Court Nomination 2005: 100 Million documents

1. Partner Contract Crawls

Page 17: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

17

• Last year, early 2005, we had requests from state archivists, university librarians and other memory institutions to expand our archiving services and develop an application that acknowledge resource constraints

• Developed Archive-It, web based service that allows partners to create, manage, search and store their web archives through an easy to use web interface

• Does not require technical expertise or infrastructure

• Pilot launched in September 2005• 1.0 Release in February• 1.5 Release in April• 2.0 Release in July

2. Archive-It

Page 18: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

18

Pilot Partners

• Center for Research Libraries• Research Libraries Group ( U of Toronto, U of Indiana, Haverford and Swarthmore Colleges, IISH)

• University of Texas• Library of Virginia• State Archives South Dakota• State Archives North Carolina• State Archives Alabama• Minnesota Historical Society• Institut d'Etude Politique de Grenoble

Page 19: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

19

• 1.0 Release in February

• 1.5 Release in April

• 2.0 Release in July

Archive-It

Page 20: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

20

Archive-It Collections

•Some samples:–Virginia’s political landscape, 2005 (Gov. Mark Warner)–Hurricane Katrina–Jamestown 2007 Commemoration

Page 21: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

21

Archive-It Access

• All collections are accessible for free to the general public, with text search, at:– www.archiveit. org– Partners websites with links

• Plus, member web application with login

Page 22: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

22

Demo

Page 23: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

23

Dan’s slides

Tech

Page 24: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

24

Challenges we face

• Making the collections useful for a variety of end users (i.e. general public, researchers)

• Making sure we capture the best and most relevant content

• Continuing to develop our tools for access and harvesting (crawler.archive.org)

Page 25: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

25

Internet Archive’s priorities

• Collaboration and Partnerships– Continue to act as a technology partner in providing web archiving services to government and memory institutions

– Continue to develop Open Source software– Develop common tools, storage formats and standards through the IIPC (International Internet Preservation Consortium)

– Open Content Alliance (OCA) digital books project

• Multiple copies across the world– Within IA’s own facilities and with partners such as LC, Bnf, Library of Alexandria

Page 26: 1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

26

RLG’s web archiving program

• Collaborative collection development.

• Descriptive metadata for web archives.

• Usability/user studies• Intellectual property concerns• Web Archiving 101• Web archiving services and software