Collecting Government Web Content at the National Library of Australia AGLIN Forum 2 May 2012 Paul Koerbin Manager Web Archiving National Library of Australia
Collecting Government Web Content at
the National Library of Australia
AGLIN Forum 2 May 2012
Paul Koerbin
Manager Web Archiving
National Library of Australia
Web Archiving at the NLA
• Background
• Scale of collections
• Archival collections (selective, bulk, govt)
• Objectives, selection and scope
• Retention and preservation
• Finding government content in PANDORA
Web Archiving at the NLA
• Began web archiving activity in 1996
– http://pandora.nla.gov.au/
• Government content is included in all NLA web
collections
– „PANDORA Archive‟ collection, 1996 to now
• Selective
– The „auscrawl‟ whole .au domain harvest collections
• Annual since 2005
– The „whole-of-government‟ collections
• Seed list
• 2011, 2012
Web Archiving at the NLA
• Scale of collecting– PANDORA (as at April 2012, i.e. 15 years of collecting)
• 31,000 titles– All govt ~ 55 % of titles
– Commonwealth Govt ~ 12 % of titles
• 75,000 instances
• 145 million files
• 6.5 Tb
– Australian .au domain harvests 2005-2011
• 3.5 billion files
• 140 Tb
– ‘Whole-of-government ‘ seed list crawl 2011
• 7.4 million files
• 538 Gb
Web Archiving at the NLA
• PANDORA Archive– Strong representation of govt content including Commonwealth,
State and Territory, and local govt (> 50 % of titles)
– Generally does not include whole departmental websites
– Prominent ministerial micro-sites (speeches, press releases)
– Government initiatives websites (e.g. Firearms buyback, 2000)
– Major reports, enquiries, documents (e.g. Gershon Review, 2008)
– Discrete „titles‟ and „instances‟ – no links between instances
– Quality checked
– Catalogued and full text indexed
– Accessible through the Trove and PANDORA discovery
services
Web Archiving at the NLA
• Whole .au domain harvests („auscrawl‟)– Crawls of the entire .au domain (plus some)
– Averages over 1 million hosts crawled each year (av. 650m files)
– Includes gov.au second level domain
– Relies on crawler capabilities and subject to crawler limitations
and constraints
– Obeys robots.txt (except for inline image and style elements)
– No quality checking for completeness of harvest or functionality
(e.g. look and style)
– Retains linkages between content that is in scope for the crawl
– Full-text and URL indexes
– But, not accessible to public
Web Archiving at the NLA
• Collecting Commonwealth Govt websites– Whole-of-government arrangements
• Whole-of-government ICT policy
• Secretaries‟ ICT Governance Board, 7 May 2010
• AGIMO circular 2010/01
• http://www.finance.gov.au/e-government/strategy-and-
governance/Whole-of-Government-ICT-Policies.html
• Covers FMA Act agencies– CAC Act agencies – still require individual permissions
• Subject to opt-out arrangements
• Replaced the need for individual copyright licence arrangements
coordinated through the CCA
• NLA now permitted to collect, preserve and make accessible freely
available govt web content
Web Archiving at the NLA
• Whole-of-government collection– Based on list of specified URLs (most at domain
level)
– Around 800 seed URLs
– Only includes FMA Act agency sites
– No QA and fixing
– Obeys robots.txt (except for inline images and style
elements)
– Full-text and URL indexes
– No pubic access yet (but perhaps soon)
Web Archiving at the NLA
• Collecting mandate and objective– The National Library Act 1960 mandate to build and
maintain a national comprehensive collection of
material relating to Australia and Australians
– ... and to make the collection available in the national
interest
– Objective is about ensuring future and ongoing
access to materials of interest to Australia‟s social,
cultural and publishing heritage
– Not the function of NLA web collecting (archiving)
program to satisfy requirements for agencies under
the Archives Act 1983
Web Archiving at the NLA
• Government „Web Guide‟ recordkeeping advice:
– “Archiving websites”
• Mandatory requirement (Archives Act 1983 and Evidence Act 1995)
• seek advice from NAA
– “Retaining access to outdated content”
• Not a mandatory requirement
• Recommends nominating content for inclusion in PANDORA
• Does not ensure safeguarding of content
• Selective
– Create own publicly accessible archive
– Publish advice how people can access out of date content
• New „whole-of-government‟ web collection• More inclusive and larger scale than PANDORA
• FMA Act agencies requirement (with „opt-out‟ provisions)
• CAC Act agencies – opt-in!
Web Archiving at the NLA
• PANDORA selection
– Commonwealth Government publications a priority
collecting area
– Methodical approaches have been attempted but ...
– Curator expertise and current awareness
– Stakeholders as nominators (e.g. indexing agencies,
other collecting areas in NLA, Parl Library, depts)
– Selecting and scoping • Whole site, part site, specific documents
• Substance and research value
• Scheduling (when to harvest and how frequently)
• Resources to undertake work
• Technical constraints
Web Archiving at the NLA
• PANDORA collecting
– Websites and web „documents‟
• documents (discrete files), whole sites, parts of sites
• text, images, video, style elements, client side scripts
– Content is harvested using a crawl robot
• efficient (no work for publisher), automated process
• deposit of complex objects is harder to deal with
– Dynamic content becomes static HTML
• an artefact of the original
• the published version as you would view it from a web browser, not
from the content management system
• loses dynamic functionality
• „normalising‟ process
– Persistent URIs
Web Archiving at the NLA
• Retention of collected web content– Archiving means preservation
– Long term access
– Collections developed and maintained in perpetuity
for future generations
– What is the preservation reality?
• Is access in perpetuity achievable?
– Investing in systems to manage for preservation
• More than preserving the bit stream
• Establishing preservation intent
• Collecting and managing preservation metadata
• Understanding formats and their risks (... and actions?)
Web Archiving at the NLA
• „DIY‟ archive of your published web content
– Use a subscription service
• ArchiveIT (Internet Archive) www.archive-it.org
• CDL Web Archiving Service webarchives.cdlib.org
– Build your own with open-source tools
• Heritrix archival crawler crawler.archive.org
• WARC packages
• Wayback interface
– Lightweight approach
• HTTrack (free) offline browser for website snapshots
www.httrack.com
– Citation service
• on demand archiving of web resources webcitation.org
Web Archiving at the NLA
• Current and future developments at NLA– Digital Library Infrastructure Replacement (DLIR)
project
• Replacing infrastructure that manages our digital
assets
• Will require new web collecting infrastructure and
processes
• Already taking steps such as the gov.au seed list
crawl
– Some testing of new tools underway (Heritrix,
Wayback)
– Opening access to domain harvest content (gov.au)
Web Archiving at the NLA
• Extension of „legal deposit‟ to digital
content– Attorney-General‟s consultation paper
• Submissions closed 14 April
– Proposed model covers:
• physical format digital (mandatory delivery)
• online electronic publications (mandatory delivery on
demand)
– May put pressure on NLA resources & priorities
– Already have „whole-of-government‟ arrangements
• Bulk harvesting of FMA Act agencies‟ domains
• Seek „opt-in‟ from CAC Act agencies
Web Archiving at the NLA
• Finding government content in PANDORA
– Full text search through Trove
• Trove „Archived websites 1996 - now‟ silo
• All Trove (results in „Books‟ and „Archived websites‟
• PANDORA portal
– Browse lists on PANDORA portal site
• „Commonwealth Government‟ (263 titles)
– Catalogue (MARC record search)
• NLA online catalogue
• Libraries Australia
• Trove (books silo)
• Search e.g.: innovation industry pandora
– Advanced search options for best results
– „Pandora electronic collection‟ (MARC 830 series field)
http://www.flickr.com/photos/ricksmit/15671245/
Web Archiving at the NLA
• Government Web Guide and NAA links
– Archiving websites• http://webguide.gov.au/recordkeeping/archiving-a-website/
– Retaining access of outdated content• http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/
– NAA Archiving Websites advice• http://www.naa.gov.au/records-management/publications/index.aspx#Archiving-
Websites:-Advice-and-Policy-Statement