www.kb.se Kulturarw³ Capturing the web The Swedish experience www.kb.se/kw3
Mar 26, 2015
www.kb.se
Kulturarw³Capturing the web
The Swedish experience
www.kb.se/kw3
www.kb.se
• Background
• Kulturarw3
– goals
– strategy
– Sweden on the net?
• Harvesting– Software
– Fimding links
– problem
• Statistics– What have we got?
• The Archive– priorities
– storage
– what we save
• Development– IIPC
– Tools, format
• conclusion
Content
www.kb.se
Background
• Legal deposit, 1661
• Latest revision 1993– Only electronic documents in fixed form
– CD-ROM, diskettes
• New law– juli 1:st, 2002, exception from personal privacy law.
• First Swedish web news paper lost– Printed newspapers since 1645
• Kulturarw3 started 1996
• Still waiting for new legal deposit law
www.kb.se
Goals
• All web pages in Sweden– pictures, video etc.
– .se, .and other Top Level Domains
– Electronic journals
www.kb.se
Strategy: two choices
• Select what is importantHow to know what will be considered important in the future?Labour intense
• Everything using automatic softwareGets everything (well, not really)Less labour intense
www.kb.se
Strategy
• Take snapshots of the Swedish weba few times each year
– Gets “all”
– Needs less labour
– Computer memory is cheap
– However, large volumes makes quality control difficult
• Selective harvestingabout 150 newspapers every day
• In the future; events, eg elections
With as little human intervention as possible.
www.kb.se
http://www.kb.se/kbstart.htm
Only the domain part relevant
Sweden on the web?
• .se• .nu, Niue popular in Sweden. ”nu” means now in Swedish• Others if the server is geographically located in Sweden
•Language?
www.kb.se
Harvesting software
• A harvester (crawler, spider) collects web pages by automatically following links and saving pages
• Open-source harvester: Heritrix- Main developer: Internet Archive (IA)
- Written in Java. Active community.
- Designed for archiving. not indexing.
• Earlier: Modified version of Combine- From NetLab, Lund university.
• Important!Indexing isn't archiving and archiving isn't indexing!
• Collects also pictures, sound etc.
www.kb.se
Problems
…or challenges if you are an optimist…
• Scripts
• Interactive pages
• Password protected
• Video/streaming material
• Social sites
www.kb.se
Statistics – what did we get?
Bulk crawls (everything Swedish)
• First sweep – 1997 , only .se- 6.8 million files- 160 GB data
• A sweep 2007-2008 , .se and other tld:s- 270 million files- 11500 GB data
www.kb.se
Statistics – what did we get?
Periodika (newspapers)
• Started june 2002• 88 miljoner URLer• 4.0 TB• About 40 000 URLs every day
www.kb.se
More statistics
Bulk (everything Swedish)
• 823 100 web servers (including inlines)
• 651 700 “swedish”
- .se 50 %
- .nu 21%
- others 29%
• 1549 different MIME-typer found. – Html about 50%
– text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents.
– A lot of garbage, miss-spellings etc.
www.kb.se
Trends
• Html: stable, 50-60% . Increasing lately
• Jpeg: increasing, 11% (-97), 27% (05)
• Gif: decreasing, 23% (-97), 11% (-05)
• Pdf: increasing, 9:th to 4:th position
www.kb.se
Accessing the archive
Firsta priority is to access the archive using traditional web technologies.
Surf, in “space” and time
Free text search
Nb, not using traditional library methods: cataloging etc.
www.kb.se
Development
• International Internet Preservation Consortium (IIPC)– Started by Internet Archive national libraries of: Sweden,
Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC)Now many more
– Develop common standards, tools and methods for web archiving.
– Raise awareness
www.kb.se
Development, standards
• Archiving formats– Earlier formats
• MIME (Multipart Mail Extension)
• ARC
• NedLib
– WARC (Web ARChive file format)• File format for saving web material
each web page is one record in a warc-fileA record contains metada and content
• ISO 28500.
www.kb.se
Development, Tools
• Tools– Harvesting: Heritrix
• Designed for archiving (NOT a modified indexer)• Open soure: Java, Linux etc.• Supported by IIPC• Mainly developed by Internet Archive with contributions• Will (is) support WARC. Supports ARC and MIME
– Surfing tools • New Wayback Machine • WERA - surf with time line• WAXToolbar – support when using new WM
– NutchWax• Free text search (with time line)
– Curator tool• Possible for a new-technician to do collection and quality control
www.kb.se
• Use Open standards, open source → IIPC
• Get users of the archive
• Think big. Hundreds of tera bytes, billions of files
• Accept that what you do is a best effort
Advices
www.kb.se
• The web is constantly changing continuous development.
• Possible to get a reasonable picture of the web. But never complete!
• Do something now
Conclusion
www.kb.se
Questions? Comments?
? ? ?
www.kb.se
Links
• IIPC: www.netpreserve.org
• Kulturarw3: www.kb.se/kw3
• Internet Archive: www.archive.org