Top Banner
Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004
66

Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Jan 11, 2016

Download

Documents

Cory Fields
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Managing the Rhizome:

METS for Web Archiving

Leslie Myrick, NYUCNI Fall 2004 Task Force MtgPortland, OR6-7 December, 2004

Page 2: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Outline for Today

• Peculiarities of websites as complex digital objects

• How METS is particularly suited to address the challenges of encapsulating website objects

• How METS could be used to manage and even navigate website objects in an archive

Page 3: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

The problem:

• WWW an increasingly important vehicle for dissemination of information

• Tools and infrastructure perhaps as volatile as material we’re collecting

• Ensure that fugitive materials will be captured, preserved, made accessible

Page 4: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Early Implementers of Large-Scale Web Archiving Repositories

• Internet Archive Wayback Machine• National Library of Australia

– PANDORA

• National Library of Sweden– Kulturarw3

Page 5: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Major Web Archiving Initiatives

• Nordic Web Archive (NWA)• UK Web Archiving Consortium (UKWAC)• International Internet Preservation

Consortium (IIPC)• National Digital Information Infrastructure

and Preservation Program (NDIIPP)– Half of Partnerships involve web content– “The Web at Risk” CDL, UNT, NYU

Page 6: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Political Communications Web Archive Project (PCWA)

• Under auspices of CRL and Mellon• Participants: Cornell University, Stanford

University, UT Austin, NYU• Focus: SE Asia, Sub-Saharan Africa, Latin

America, Western Europe• Radical or NGO political born-digital “ephemera” • Content: Internet Archive (.arc files of website

snapshots culled from Alexa crawls)

Page 7: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Websites as Moving Target

• Volatility– BBC Site Banner: Updated Every Minute of

Every Day

• Ephemerality– Peter Lyman: Average lifespan of webpage is

44 days.– Entire websites disappear at alarming rate as

well …

Page 8: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Political Websites: ephemerality + import = urgency

• Candidates in the Nigerian Elections of April 2003 (PCWA) 8/37

• 135 Candidates’ websites in California Recall campaign of 2003 (CDL)

• Defunct Federal Agencies (CyberCemetery at UNT)

Page 9: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

An enterprise fraught with questions

• How do we collect it before it disappears or radically changes?

• How do we define what we are collecting?

• How do we manage the curiosity cabinet of MIME Types we ingest?

• How do we supply effective metadata for management, preservation and access?

Page 10: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Basic METS Recipe

• [metsHdr]• fileSec• structMap• structLink • dmdSec• amdSec• [behaviorSec]

Page 11: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Web Archiving Challenges I:Definition and Taxonomy

• Definition of the object “website” and its boundaries

• Complexities of website structure(s)

• Complex “symphonic” nature of a webpage itself

Page 12: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Definition and boundaries

• Website as “a structured aggregate of files”– related by hyperlinking or embedding

• Capture and treatment of “near files”– .css, .js, icons that may live on another server or

domain but are necessary to render the page

• Treatment of external links– and of files at the end of those links

Page 13: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Which website structure?

• Physical file structure from host server as represented by captured mirror?

• Logical tree structure?– entry page as parent ; other pages as children

• Hyperlink structure?

• All of the above? or some combination?

Page 14: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Webpage as “symphonic”

• HTML wrapper around embedded data streams + hyperlinks all rendered in parallel

– embedded multimedia or Flash– image SRCs– javascript, .css– HREFs

Page 15: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:fileSec>(the easy bit)

Page 16: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

How to inventory captured files?

• All the files harvested in a single snapshot– Single <fileGrp>– Sorted according to resource type?

• Incremental harvest (trickier)– Handled by multiple fileGrps?

• Separate fileGrps for initial snapshot, migrated or refreshed files?

Page 17: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS File inventory<METS:fileSec> <METS:fileGrp> <METS:file ID="FID18" MIMETYPE=" text/html" ADMID="ADM1"> <METS:FLocat LOCTYPE="URL" xlink:href="www.apgawomen.org/" /> </METS:file> <METS:file ID="FID113" MIMETYPE="text/html” ADMID="ADM2">

<METS:FLocat LOCTYPE="URL" xlink:href="www.apgawomen.org/officers.htm" /> </METS:file> <METS:file ID="FID120" MIMETYPE="text/html” ADMID="ADM3">

<METS:FLocat LOCTYPE="URL" xlink:href="www.apgawomen.org/calender.htm" /> </METS:file> <METS:file ID="FID154" MIMETYPE="text/html" ADMID="ADM4">

<METS:FLocat LOCTYPE="URL" xlink:href="www.apgawomen.org/newsarchives.htm" /> </METS:file> <METS:file ID="FID1059" MIMETYPE="text/html" ADMID="ADM5">

<METS:FLocat LOCTYPE="URL" xlink:href="www.apgawomen.org/home.htm" /> </METS:file> ,,, </<METS:fileGrp></METS:fileSec>

Page 18: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:structMap>

Page 19: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS <structMap> View of a Websitein a Nutshell

Flattened logical tree hierarchy of three levels:

<div> root entry page, “index.html”

<div> each HTML page

<div> each hyperlink on that page to a page internal to the site

Page 20: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS <structMap> view of an HTML page

HTML wrapper around parallel elements: page itself, embedded files and hyperlinks:

<div> for the HTML page<fptr> <par>

<area>for HTML page + each embedded “parallel” element -- .css, .js, images etc.

(with ID-IDREF to file ID in fileSec) <div> for each (internal) hyperlinked page

Page 21: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Simple <structMap> Example

• Nigerian Election (April 2003) Testbed

• APGA Women’s Website

• Entry HTML page with:– two embedded flash files– one href around – an embedded image to non-flash “home”.

Page 22: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 23: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<html><head><title>index</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head><body bgcolor="#000000"><table width="100%"> <tr><td> <div align="center"><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=4,0,2,0" width="700" height="150"> <embed src="notjust.swf" quality=high pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash" type="application/x-shockwave-flash" width="700" height="150"> </embed> </object></div> </td></tr></table><table width="100%"> <tr> <td> <div align="center"><object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=4,0,2,0" width="600" height="64"> <embed src="apgawnew.swf" quality=high pluginspage="http://www.macromedia.com/shockwave/download/index.cgi?P1_Prod_Version=ShockwaveFlash" type="application/x-shockwave-flash" width="600" height="64"> </embed> </object></div> </td> </tr></table><p>&nbsp;</p><div align="center"> <table width="85%"> <tr><td> <div align="right"><a href="home.htm"><img src="enterarrow.gif" width="80" height="27" border="0"></a></div> </td></tr> </table></div></body></html>

Page 24: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:div DMDID="DM1" TYPE="web page" ID="page18" LABEL="http://dlibdev.nyu.edu/webarchive/metstest/www.apgawomen.org/

index.html "> <METS:fptr> <METS:par>

<METS:area FILEID="FID18"/> [index.html ] <METS:area FILEID="FID1036"/> [notjust.swf] <METS:area FILEID="FID1043"/> [apgawnew.swf] <METS:area FILEID="FID1075"/> [enterarrow.gif] </METS:par>

</METS:fptr> <METS:div TYPE="hyperlink" ID="LINK1" LABEL="home">

<METS:fptr> <METS:area BEGIN="000" BETYPE="BYTE" END="111"

FILEID="FID18"/> </METS:fptr> </METS:div>

Page 25: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:structLink>

Page 26: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

LC Flattened Structure:structMap and structLink

<METS:div DMDID="DM01" TYPE="wc:webpage" ID="page18" LABEL="http://dlibdev.nyu.edu/webarchive/metstest/www.apgawomen.org/index.html"> <METS:fptr>

<METS:par><METS:area FILEID="FID18"/><METS:area FILEID="FID1036"/><METS:area FILEID="FID1043"/><METS:area FILEID="FID1075"/>

</METS:par> </METS:fptr></METS:div>  <METS:structLink>

<METS:smLink from="page18" to="page1059"/>…<METS:smLink from="page1059" to="page154"/> <METS:smLink from=“page1059” to=“page237”/><METS:smLink from=“page1059” to=“page398”/>

Page 27: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Mapping Hyperlink Structure, Redux

<div> in structMap (obliquely) cross-referenced to <smLink> in structLink:

<METS:structLink> <METS:smLink from="LINK1" to="page1059"

xlink:title="home"/> <METS:smLink from="LINK2" to="page113"

xlink:title=”officers"/> <METS:smLink from="LINK3" to="page102"

xlink:title=”calendar"/></METS:structLink>

Page 28: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Web Archiving Challenges II: Extracted vs Human-Catalogued Metadata

• Lack of influence over content production– More importantly: embedded metadata

• Technical metadata seen as “safe” because it can be programmatically extracted from the file itself

• Metadata embedded by producers of web pages, e.g. <title> <meta> tags, questionable at best– Do we want to take descriptive metadata wholesale

from <title>, <meta> tags? – Really?

Page 29: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 30: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

The Case of the Purloined Metadata

Page 31: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

The Case of the Purloined Metadata, continued

<snip>

<HTML><!-- saved from url=(0041)http://www.sport.de/spart/sk1/ski006.php3 --><HEAD><TITLE>Bienvenue sur le site de Front Social</TITLE><META CONTENT="text/html; charset=windows-1252" HTTP-EQUIV="Content-Type"><META CONTENT="Sport sports Baseball Basketball Beach-Volleyball Bob Boxen Bundesliga Bundesligavereine Championsleague DEL DFB DFB-Pokal Eishockey Ergebnisse Europameisterschaft Europapokal Fernsehen Football Formel1 Formel3 Fußball Golf Hallenmasters Handball Hockey Inline-Skating Leichtathletik Motorbike Motorrad Motorsport Nationalmannschaft NBA NFL NHL Reiten Rodeln Schwimmen Skifahren Skispringen Snowboard Sportarten Sportnachrichten Surfen Tennis Tischtennis Turniere Uefa-Cup US Open Vereine Volleyball Wassersport WBA WBC WBO Weltmeisterschaft Weltrangliste Wimbledon Fußball Motorsport Radsport Volleyball Sport Eishockey Skisport Boxen Handball Leichtathletik Pferdesport Schwimmen" NAME="keywords"> <META CONTENT="Sport Sportnachrichten Sportvereine Ergebnisse Tabellen Ranglisten Bundesliga DEL Formel 1 Tennis" NAME="description"> <META CONTENT="thu, 30 mar 2000 12:00:00 GMT" HTTP-EQUIV="date"> <SCRIPT language="JavaScript" SRC="sport_fichiers/sidiscript.js"> <SCRIPT language="JavaScript"><!--var on = "/ima/pfeil_weiss2.gif";var off = "/ima/pfeil_weiss.gif"; </snip>

Page 32: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:dmdSec>

Page 33: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Case study: Metadata from an Alexa .arc

• Typical Alexa / IA SIP = .arc and .dat files along with byte-offset .ndx file– IA .arc = 100 MB .gz archive packed with files

from web crawl along with server’s HTTP response headers for each file.

Page 34: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Typical Internet Archive .arc snippet

<snip>[ crawler’s file header]http://www.apgawomen.org:80/calender.htm 63.241.136.203 20030417223125 text/html 2570

[http headers]HTTP/1.1 200 OKDate: Thu, 17 Apr 2003 21:35:43 GMTServer: Apache/1.3.27 (Unix) FrontPage/5.0.2.2510Last-Modified: Sun, 26 Jan 2003 04:05:37 GMTETag: "3b01d2-8fb-3e335e91"Accept-Ranges: bytesContent-Length: 2299Connection: closeContent-Type: text/html

[file itself]<html><head><title>calender</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head><body bgcolor="#FFFFFF"> </snip>

Page 35: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

What is extractable (dmdSec)?

HTTP/1.1 200 OKDate: Thu, 17 Apr 2003 21:35:43 GMTServer: Apache/1.3.27 (Unix) FrontPage/5.0.2.2510Last-Modified: Sun, 26 Jan 2003 04:05:37 GMTETag: "3b01d2-8fb-3e335e91"Accept-Ranges: bytesContent-Length: 2299Connection: closeContent-Type: text/html

<html><head><title>calender</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"></head><body bgcolor="#FFFFFF"> </snip>

Page 36: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Website-Level MODS<mods:mods> <mods:titleInfo> <mods:title>Website of the APGA Women</mods:title> </mods:titleInfo> <mods:genre>Web site</mods:genre> <mods:originInfo> <mods:dateCaptured encoding="iso8601">20030417</mods:dateCaptured> </mods:originInfo> <mods:language authority="iso639-2b">eng</mods:language> <mods:physicalDescription> <mods:internetMediaType>text/html</mods:internetMediaType> <mods:internetMediaType>image/jpg</mods:internetMediaType> <mods:internetMediaType>image/gif</mods:internetMediaType> <mods:internetMediaType>application/msword</mods:internetMediaType> <mods:internetMediaType>application/x-shockwave-flash</mods:internetMediaType> </mods:physicalDescription> <mods:abstract>Supports the All Progressive Grand Alliance political party (APGA). Information on the APGA presidential candidate, Chief Chukwuemeka Odumegwu-Ojukwu. Based in Kennesaw, Georgia.</mods:abstract> <mods:subject> <mods:topic>Political Parties</mods:topic> <mods:geographic>Africa</mods:geographic> <mods:geographic>Nigeria</mods:geographic> </mods:subject> <mods:relatedItem type="host"> <mods:titleInfo> <mods:title>CRL Political Web Archiving Project</mods:title> </mods:titleInfo> <mods:identifier type="uri">http://www.crl.edu/content/PolitWeb.htm</mods:identifier> </mods:relatedItem> <mods:identifier displayLabel="Archived site" type="uri">http://dlibdev.nyu.edu/webarchive/metstest/apgawomen/20030417/www.agpawomen.org /</mods:identifier> </mods:mods>

Page 37: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

MINERVA MODS Display

Page 38: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<METS:amdSec> <METS:techMD>

Page 39: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Technical Metadata Sources ( .arc)

• Alexa, Heritrix crawler frontier application– writes metadata about the harvest itself, the .arc file

• Host server’s HTTP response headers– metadata about the host server, files recorded

• Captured files themselves– file headers; IPTC headers -- human input– Post-processing with JHOVE, ImageMagick etc.

Page 40: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

ImageMagick dump for Mao1925.jpg

Image: Mao1925.jpgFormat: JPEG (Joint Photographic Experts Group JFIF format)Geometry: 142x185Class: DirectClassType: true colorDepth: 8 bits-per-pixel componentColors: 11423Resolution: 300x300 pixelsFilesize: 8115bInterlace: PlaneBackground Color: grey100Border Color: #DFDFDFMatte Color: grey74Iterations: 0Compression: JPEGsignature:8c173bd33c3e5667d27e51aee539afcd58ccbc8d4a11ab76b127408905f598fdTainted: False

Page 41: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

<mix:mix> <mix:BasicImageParameters> <mix:Format> <mix:MIMEType>image/jpeg</mix:MIMEType> <mix:ByteOrder>little-endian</mix:ByteOrder> <mix:Compression> <mix:CompressionScheme>5</mix:CompressionScheme> <mix:CompressionLevel>0</mix:CompressionLevel> </mix:Compression> <mix:PhotometricInterpretation> <mix:ColorSpace/> </mix:PhotometricInterpretation> </mix:Format> <mix:File> <mix:ImageIdentifier>perso.magic.fr/images/Mao1925.jpg</mix:ImageIdentifier> <mix:FileSize>8115</mix:FileSize> </mix:File> <mix:PreferredPresentation/> </mix:BasicImageParameters> <mix:ImageCreation/> <mix:ImagingPerformanceAssessment> <mix:SpatialMetrics> <mix:ImageWidth>142</mix:ImageWidth> <mix:ImageLength>185</mix:ImageLength> </mix:SpatialMetrics> <mix:Energetics> <mix:BitsPerSample>8</mix:BitsPerSample> </mix:Energetics> </mix:ImagingPerformanceAssessment> <mix:ChangeHistory/> </mix:mix>

Page 42: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Web Archiving Challenges III:Structuring and Managing Versions

Version control-related storage and access issues in a continuous archive:

• Creator-driven changes: successive harvests and versions– Especially tricky with incremental harvest

• Repository-driven changes: refreshing, migration

Page 43: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Modeling Website Objects with METS in a Continuous Archive

One possibility:

• Root level METS (web site X as intellectual object) with <mptr>s down to

• Intermediary METS (web site X as harvested on April 17, 2003) with <mptr>s down to

• Leaf node METS (single web page in web site X harvested on April 17, 2003)

Page 44: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

APGA Women Websites

April 17, 2003 December 12, 2003 February 2, 2004

home.htmlabout.html

officers.html

home.html

officers.html officers.html

about.htmlabout.html

home.html

news.html

Page 45: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

APGA Women Websites

April 17, 2003 December 12, 2003 February 2, 2004

Page 46: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Aggregator / Single Capture Model

• METS for top level aggregation that uses <mptr>s to point to either another intermediary aggregator or to more than one captured version(s) of a web site.

• METS for single standalone captured site, whether part of successive harvests or a one-off capture.

Page 47: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS Website Aggregator

• Contains single MODS record describing the aggregation as an intellectual object– e.g. Election 2004; JohnKerry.com (Oct 1-Nov 3)

• Contains no fileSec or structLink• Contains TBD digiProv, rights in amdSec

• Consists of a root <div> for the aggregation – nesting <div>s with <mptr>s to each subsidiary

aggregation or captured version

Page 48: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

MINERVA Election 2004

Kerry Nader Bush

Nov 1

Nov 2

Nov 3

Nov 1

Nov 2

Nov 3

Nov 1

Nov 2

Nov 3

Page 49: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

MINERVA Election 2004

November 1, 2004November 2, 2004

November 3, 2004

Kerry

Bush

NaderKerryNader

Bush

Kerry Nader

Bush

Page 50: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 51: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 52: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

How websites escape from archives

• External links left live• Internal links not parsed out of FLASH• Internal links not parsed out of javascript• .php (etc) files not converted to static

HTML• .js runners or applets with date() functions

not disabled

Page 53: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Sealing the archive

• What Crawlers Can Do:

– leave external links live? Or create custom 404s?– rewrite internal links to relative links– repair producer-generated relative links– rewrite dynamic extensions e.g. .php to .html– successfully parse out javascript, FLASH URLs

Page 54: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Sealing the Archive

• What Viewer Applications can do:

– PANDAS– METS Viewer

Page 55: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 56: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.
Page 57: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

PANDORA Treatment of External Links

<h1>External Links to African Websites</h1> <p><b>African News links:</b> <a href="/external.html?link=www-sul.stanford.edu/depts/ssrg/africa/news.html"><br> Latest African news</a><br> <a href="/external.html?link=kahn.interaccess.com/intelweb/africa.html">More African news sources</a></p> <p><b>General comprehensive resource links on Africa: </b><a href="/external.html?link=www.columbia.edu/cu/libraries/indiv/area/Africa/"><br> Columbia University - African Studies Internet Resources</a> <a href="/external.html?link=www-sul.stanford.edu/depts/ssrg/africa/guide.html"><br> African South of the Sahara internet resources</a><br> <a href="/external.html?link=www.sas.upenn.edu/African_Studies/Home_Page/AFR_GIDE.html"> Electronic Guide for African Resources on the Internet - University of Pennsylvania</a><br> <a href="/external.html?link=www.africa.com/">Africa.com</a><br> <a href="/external.html?link=www.sourceafrica.com/">Source Africa</a><br> <a href="/external.html?link=www.africapolicy.org/">African Policy Information Centre</a><br> <a href="/external.html?link=www.cc.utah.edu/~pks1019">University of Utah - Africa Homepage</a> <br> <a href="/external.html?link=www.fordham.edu/halsall/africa/africasbook.html">African History Internet Sourcebook</a><br>

Page 58: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS Viewer

Page 59: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS Viewer External Links

Page 60: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

METS Strengths / Websites

• Suitability as SIP, AIP, DIP – ease of conversion between Packages (XSLT)

• Cross-referencing of structMap and structLink– easy implementation using bottom line (XSLT)

• Open Source Community Support• Implementations from XSLT to Java Apps• Emergence of Profiles for Interoperability• Harmonization with LO Metadata schemas

Page 61: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Repositories examining use of METS to manage web materials

• OCLC Digital Archive• MIT CWSpace (DSpace, OCW)

– Ultimately decided upon IMS-CP

• NDIIPP “The Web at Risk” Partnership– CDL Digital Preservation Repository– NYU may use DSpace – UNT may use CDL repository instance

Page 62: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

CDL Digital Preservation Repository

• METS-ready at HTML web page level• In process of defining full Web Archive

Data Model (WADO)– And the metadata to facilitate ingest, retention

and interchange

• Will support METS at website level

Page 63: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

DSpace

• Can ingest web material at HTML page level

• Can bundle all the resources for a page

• METS Exporter structMap-less as yet …

• Will support METS import, archiving and export at website level (?)

Page 64: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Links

• http://dlibdev.nyu.edu:8083/xmldev/servlet/SaxonServlet?source=nigerian-root.xml&style=modspage4.xsl

• http://dlibdev.nyu.edu:8083/xmldev/servlet/SaxonServlet?source=apgawomen-root.xml&style=modspage3.xsl

• http://dlibdev.nyu.edu:8083/xmldev/servlet/frames/apgawomen20040202-dspace.xml

Page 65: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

For More Information

• Political Communications Web Archive Project– http://www.crl.edu/content/PolitWeb.htm

• NDIIPP “The Web at Risk” Partnership– http://www.digitalpreservation.gov/about/pr_093004.html

• IIPC – http://netpreserve.org/about/index.php

• Heritrix Crawler– http://crawler.archive.org/

Page 66: Managing the Rhizome: METS for Web Archiving Leslie Myrick, NYU CNI Fall 2004 Task Force Mtg Portland, OR 6-7 December, 2004.

Contact to Chat More

[email protected]

Leslie Myrick

NYU Digital Library Team