Aglin

Collecting Government Web Content at

the National Library of Australia

AGLIN Forum 2 May 2012

Paul Koerbin

Manager Web Archiving

National Library of Australia

Web Archiving at the NLA

• Background

• Scale of collections

• Archival collections (selective, bulk, govt)

• Objectives, selection and scope

• Retention and preservation

• Finding government content in PANDORA


• Began web archiving activity in 1996

– http://pandora.nla.gov.au/

• Government content is included in all NLA web

collections

– „PANDORA Archive‟ collection, 1996 to now

• Selective

– The „auscrawl‟ whole .au domain harvest collections

• Annual since 2005

– The „whole-of-government‟ collections

• Seed list

• 2011, 2012


• Scale of collecting– PANDORA (as at April 2012, i.e. 15 years of collecting)

• 31,000 titles– All govt ~ 55 % of titles

– Commonwealth Govt ~ 12 % of titles

• 75,000 instances

• 145 million files

• 6.5 Tb

– Australian .au domain harvests 2005-2011

• 3.5 billion files

• 140 Tb

– ‘Whole-of-government ‘ seed list crawl 2011

• 7.4 million files

• 538 Gb


• PANDORA Archive– Strong representation of govt content including Commonwealth,

State and Territory, and local govt (> 50 % of titles)

– Generally does not include whole departmental websites

– Prominent ministerial micro-sites (speeches, press releases)

– Government initiatives websites (e.g. Firearms buyback, 2000)

– Major reports, enquiries, documents (e.g. Gershon Review, 2008)

– Discrete „titles‟ and „instances‟ – no links between instances

– Quality checked

– Catalogued and full text indexed

– Accessible through the Trove and PANDORA discovery

services


• Whole .au domain harvests („auscrawl‟)– Crawls of the entire .au domain (plus some)

– Averages over 1 million hosts crawled each year (av. 650m files)

– Includes gov.au second level domain

– Relies on crawler capabilities and subject to crawler limitations

and constraints

– Obeys robots.txt (except for inline image and style elements)

– No quality checking for completeness of harvest or functionality

(e.g. look and style)

– Retains linkages between content that is in scope for the crawl

– Full-text and URL indexes

– But, not accessible to public


• Collecting Commonwealth Govt websites– Whole-of-government arrangements

• Whole-of-government ICT policy

• Secretaries‟ ICT Governance Board, 7 May 2010

• AGIMO circular 2010/01

• http://www.finance.gov.au/e-government/strategy-and-

governance/Whole-of-Government-ICT-Policies.html

• Covers FMA Act agencies– CAC Act agencies – still require individual permissions

• Subject to opt-out arrangements

• Replaced the need for individual copyright licence arrangements

coordinated through the CCA

• NLA now permitted to collect, preserve and make accessible freely

available govt web content

http://www.finance.gov.au/e-government/strategy-and-governance/Whole-of-Government-ICT-Policies.html
















• Whole-of-government collection– Based on list of specified URLs (most at domain

level)

– Around 800 seed URLs

– Only includes FMA Act agency sites

– No QA and fixing

– Obeys robots.txt (except for inline images and style

elements)

– Full-text and URL indexes

– No pubic access yet (but perhaps soon)


• Collecting mandate and objective– The National Library Act 1960 mandate to build and

maintain a national comprehensive collection of

material relating to Australia and Australians

– ... and to make the collection available in the national

interest

– Objective is about ensuring future and ongoing

access to materials of interest to Australia‟s social,

cultural and publishing heritage

– Not the function of NLA web collecting (archiving)

program to satisfy requirements for agencies under

the Archives Act 1983


• Government „Web Guide‟ recordkeeping advice:

– “Archiving websites”

• Mandatory requirement (Archives Act 1983 and Evidence Act 1995)

• seek advice from NAA

– “Retaining access to outdated content”

• Not a mandatory requirement

• Recommends nominating content for inclusion in PANDORA

• Does not ensure safeguarding of content

• Selective

– Create own publicly accessible archive

– Publish advice how people can access out of date content

• New „whole-of-government‟ web collection• More inclusive and larger scale than PANDORA

• FMA Act agencies requirement (with „opt-out‟ provisions)

• CAC Act agencies – opt-in!


• PANDORA selection

– Commonwealth Government publications a priority

collecting area

– Methodical approaches have been attempted but ...

– Curator expertise and current awareness

– Stakeholders as nominators (e.g. indexing agencies,

other collecting areas in NLA, Parl Library, depts)

– Selecting and scoping • Whole site, part site, specific documents

• Substance and research value

• Scheduling (when to harvest and how frequently)

• Resources to undertake work

• Technical constraints


• PANDORA collecting

– Websites and web „documents‟

• documents (discrete files), whole sites, parts of sites

• text, images, video, style elements, client side scripts

– Content is harvested using a crawl robot

• efficient (no work for publisher), automated process

• deposit of complex objects is harder to deal with

– Dynamic content becomes static HTML

• an artefact of the original

• the published version as you would view it from a web browser, not

from the content management system

• loses dynamic functionality

• „normalising‟ process

– Persistent URIs


• Retention of collected web content– Archiving means preservation

– Long term access

– Collections developed and maintained in perpetuity

for future generations

– What is the preservation reality?

• Is access in perpetuity achievable?

– Investing in systems to manage for preservation

• More than preserving the bit stream

• Establishing preservation intent

• Collecting and managing preservation metadata

• Understanding formats and their risks (... and actions?)


• „DIY‟ archive of your published web content

– Use a subscription service

• ArchiveIT (Internet Archive) www.archive-it.org

• CDL Web Archiving Service webarchives.cdlib.org

– Build your own with open-source tools

• Heritrix archival crawler crawler.archive.org

• WARC packages

• Wayback interface

– Lightweight approach

• HTTrack (free) offline browser for website snapshots

www.httrack.com

– Citation service

• on demand archiving of web resources webcitation.org

http://www.archive-it.org/



webarchives.cdlib.org

crawler.archive.org

http://www.httrack.com/

webcitation.org


• Current and future developments at NLA– Digital Library Infrastructure Replacement (DLIR)

project

• Replacing infrastructure that manages our digital

assets

• Will require new web collecting infrastructure and

processes

• Already taking steps such as the gov.au seed list

crawl

– Some testing of new tools underway (Heritrix,

Wayback)

– Opening access to domain harvest content (gov.au)


• Extension of „legal deposit‟ to digital

content– Attorney-General‟s consultation paper

• Submissions closed 14 April

– Proposed model covers:

• physical format digital (mandatory delivery)

• online electronic publications (mandatory delivery on

demand)

– May put pressure on NLA resources & priorities

– Already have „whole-of-government‟ arrangements

• Bulk harvesting of FMA Act agencies‟ domains

• Seek „opt-in‟ from CAC Act agencies


• Finding government content in PANDORA

– Full text search through Trove

• Trove „Archived websites 1996 - now‟ silo

• All Trove (results in „Books‟ and „Archived websites‟

• PANDORA portal

– Browse lists on PANDORA portal site

• „Commonwealth Government‟ (263 titles)

– Catalogue (MARC record search)

• NLA online catalogue

• Libraries Australia

• Trove (books silo)

• Search e.g.: innovation industry pandora

– Advanced search options for best results

– „Pandora electronic collection‟ (MARC 830 series field)

http://www.flickr.com/photos/ricksmit/15671245/


• Government Web Guide and NAA links

– Archiving websites• http://webguide.gov.au/recordkeeping/archiving-a-website/

– Retaining access of outdated content• http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/

– NAA Archiving Websites advice• http://www.naa.gov.au/records-management/publications/index.aspx#Archiving-

Websites:-Advice-and-Policy-Statement

http://webguide.gov.au/recordkeeping/archiving-a-website/





http://webguide.gov.au/recordkeeping/retaining-access-to-outdated-content/










http://www.naa.gov.au/records-management/publications/index.aspx













Aglin

Technology

government web content

government content

government web collection

function of nla web

web documents

web archiving activity

web browser

nla scale