Top Banner
www.kb.se Kulturarw³ Capturing the web The Swedish experience www.kb.se/kw3
21

Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

Mar 26, 2015

Download

Documents

Michelle Reyes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Kulturarw³Capturing the web

The Swedish experience

www.kb.se/kw3

Page 2: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

• Background

• Kulturarw3

– goals

– strategy

– Sweden on the net?

• Harvesting– Software

– Fimding links

– problem

• Statistics– What have we got?

• The Archive– priorities

– storage

– what we save

• Development– IIPC

– Tools, format

• conclusion

Content

Page 3: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Background

• Legal deposit, 1661

• Latest revision 1993– Only electronic documents in fixed form

– CD-ROM, diskettes

• New law– juli 1:st, 2002, exception from personal privacy law.

• First Swedish web news paper lost– Printed newspapers since 1645

• Kulturarw3 started 1996

• Still waiting for new legal deposit law

Page 4: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Goals

• All web pages in Sweden– pictures, video etc.

– .se, .and other Top Level Domains

– Electronic journals

Page 5: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Strategy: two choices

• Select what is importantHow to know what will be considered important in the future?Labour intense

• Everything using automatic softwareGets everything (well, not really)Less labour intense

Page 6: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Strategy

• Take snapshots of the Swedish weba few times each year

– Gets “all”

– Needs less labour

– Computer memory is cheap

– However, large volumes makes quality control difficult

• Selective harvestingabout 150 newspapers every day

• In the future; events, eg elections

With as little human intervention as possible.

Page 7: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

http://www.kb.se/kbstart.htm

Only the domain part relevant

Sweden on the web?

• .se• .nu, Niue popular in Sweden. ”nu” means now in Swedish• Others if the server is geographically located in Sweden

•Language?

Page 8: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Harvesting software

• A harvester (crawler, spider) collects web pages by automatically following links and saving pages

• Open-source harvester: Heritrix- Main developer: Internet Archive (IA)

- Written in Java. Active community.

- Designed for archiving. not indexing.

• Earlier: Modified version of Combine- From NetLab, Lund university.

• Important!Indexing isn't archiving and archiving isn't indexing!

• Collects also pictures, sound etc.

Page 9: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Problems

…or challenges if you are an optimist…

• Scripts

• Interactive pages

• Password protected

• Video/streaming material

• Social sites

Page 10: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Statistics – what did we get?

Bulk crawls (everything Swedish)

• First sweep – 1997 , only .se- 6.8 million files- 160 GB data

• A sweep 2007-2008 , .se and other tld:s- 270 million files- 11500 GB data

Page 11: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Statistics – what did we get?

Periodika (newspapers)

• Started june 2002• 88 miljoner URLer• 4.0 TB• About 40 000 URLs every day

Page 12: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

More statistics

Bulk (everything Swedish)

• 823 100 web servers (including inlines)

• 651 700 “swedish”

- .se 50 %

- .nu 21%

- others 29%

• 1549 different MIME-typer found. – Html about 50%

– text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents.

– A lot of garbage, miss-spellings etc.

Page 13: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Trends

• Html: stable, 50-60% . Increasing lately

• Jpeg: increasing, 11% (-97), 27% (05)

• Gif: decreasing, 23% (-97), 11% (-05)

• Pdf: increasing, 9:th to 4:th position

Page 14: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Accessing the archive

Firsta priority is to access the archive using traditional web technologies.

Surf, in “space” and time

Free text search

Nb, not using traditional library methods: cataloging etc.

Page 15: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Development

• International Internet Preservation Consortium (IIPC)– Started by Internet Archive national libraries of: Sweden,

Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC)Now many more

– Develop common standards, tools and methods for web archiving.

– Raise awareness

Page 16: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Development, standards

• Archiving formats– Earlier formats

• MIME (Multipart Mail Extension)

• ARC

• NedLib

– WARC (Web ARChive file format)• File format for saving web material

each web page is one record in a warc-fileA record contains metada and content

• ISO 28500.

Page 17: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Development, Tools

• Tools– Harvesting: Heritrix

• Designed for archiving (NOT a modified indexer)• Open soure: Java, Linux etc.• Supported by IIPC• Mainly developed by Internet Archive with contributions• Will (is) support WARC. Supports ARC and MIME

– Surfing tools • New Wayback Machine • WERA - surf with time line• WAXToolbar – support when using new WM

– NutchWax• Free text search (with time line)

– Curator tool• Possible for a new-technician to do collection and quality control

Page 18: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

• Use Open standards, open source → IIPC

• Get users of the archive

• Think big. Hundreds of tera bytes, billions of files

• Accept that what you do is a best effort

Advices

Page 19: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

• The web is constantly changing continuous development.

• Possible to get a reasonable picture of the web. But never complete!

• Do something now

Conclusion

Page 20: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Questions? Comments?

? ? ?

Page 21: Www.kb.se Kulturarw³ Capturing the web The Swedish experience .

www.kb.se

Links

• IIPC: www.netpreserve.org

• Kulturarw3: www.kb.se/kw3

• Internet Archive: www.archive.org