Top Banner
ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany [email protected]
15

Arcomem training system-overview_beginner

Nov 29, 2014

Download

Technology

arcomem

This presentation on the ARCOMEM system is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Arcomem training system-overview_beginner

ARCOMEM System Overview(Beginner Level)

Thomas Risse

L3S Research Center

Hannover, Germany

[email protected]

Page 2: Arcomem training system-overview_beginner

Overview

Beginner Level• Approach of current crawlers• What’s new in ARCOMEM?• The ARCOMEM Approach

– Overview about the phases– Overview about the processing levels

• Handling of Preservation in ARCOMEM

Advanced Level• Overview of the system architecture• Possible ARCOMEM System Configurations

Slide 2

Page 3: Arcomem training system-overview_beginner

Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission

Web Crawlere.g. Heritrix, HTTrack

1

1. A seedlist is specified as input for the crawler. This specification might also contain some limited crawling parameters like the crawl depth or maximum crawl time. Also blacklists of domain to reduce spam can be given.

Page 4: Arcomem training system-overview_beginner

Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission

Web Crawlere.g. Heritrix, HTTrack

2Crawling

1

2. The Web crawler collects the content from the Web and follows the links up to the specified depth to crawl.

Page 5: Arcomem training system-overview_beginner

Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission

Web Crawlere.g. Heritrix, HTTrack

Storage

Archive

2Crawling

1 3

3. The results of the crawl are are directly stored in the Web archive. This is typically in WARC or ARC format.

Page 6: Arcomem training system-overview_beginner

Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission

Web Crawlere.g. Heritrix, HTTrack

Storage

Archive

2Crawling

1 3

QualityAssurance

44. The Quality Assurance is applied as the last step to ensure that all information are collected and that the pages are fully stored in the archive. Missing URLs are given to the Web Crawler for re-crawling

Page 7: Arcomem training system-overview_beginner

What‘s new in ARCOMEM?

• Intelligent Crawler– Semantically Enhanced Crawl Specification– „Understands“ the crawl intention– Crawler guidance by using social and semantic information– Stops crawling at irrelevant pages– Two stage crawling strategy: Web ARCOMEM Storage Archive

• Advanced Web Archive Enrichment– Semantic Information: Entities, Topics, Opinions, Events (ETOE)– Social Context: Interlinking Web Social Web, Trustworthiness of

information and users

• Archivist and End User Support– Archivist Tool– Searching and browsing Web archives with different facets

Slide 7

Page 8: Arcomem training system-overview_beginner

ARCOMEM Phases: Crawl Specification1. Intelligent Crawl Specification (ICS)

The ICS describes the intended crawl by specifying keywords, entities, topics, etc. together with reference page and starting points. Reference pages matches to 100% with the crawl content and are used by the crawler to learn more about the crawl.

Slide 8

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Page 9: Arcomem training system-overview_beginner

ARCOMEM Phases: Crawling & Online Processing

Slide 9

2. Crawling & Online ProcessingIn this phase the web pages and social web content will be collected and a first semantic analysis will be applied. The analysis result is used to guide the crawler by ranking extracted links by their importance.

All information are stored in the ARCOMEM Storage.

Crawling

Online Processing

ARCOMEMStorage

Crawling

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

Page 10: Arcomem training system-overview_beginner

ARCOMEM Phases: Offline Processing

Slide 10

3. Offline ProcessingThe offline processing runs after the collection of content has been finished. The aim of this phase is the enrich the crawled pages with meta-information that has been extracted from the content. The enrichments helps selecting content for the final web archive. Furthermore it eases the searching and browsing within the final Web archive.

Crawling

Online Processing

Offline Processing

ARCOMEMStorage

Crawling

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

Page 11: Arcomem training system-overview_beginner

ARCOMEM Phases: Appraisal & Selection

Slide 11

4. Based on the information given in the Intelligent Crawl Specification (ICS) and the enrichment of the content, the most interesting content items are selected to be stored in the final Web archive. The final Web archive are WARC files, which include the crawled pages and all enrichments done during the offline processing in RDF format.

Crawling

Online Processing

Offline Processing

ARCOMEMStorage Archive

CrawlingAppraisalSelection

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

Page 12: Arcomem training system-overview_beginner

ARCOMEM Phases: Applications

Slide 12

Crawling

Online Processing

Offline ProcessingSARA

forBroadcaster,Parliaments

ARCOMEMStorage Archive

CrawlingAppraisalSelection

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

5. The Search and Retrieval Application (SARA) allows end users to search and browse the archive in different ways, e.g. based on keywords, entities, topics, opinions.

Page 13: Arcomem training system-overview_beginner

ARCOMEM Phases: Cross Crawl Analytics

Slide 13

Crawling

Online Processing

Offline ProcessingSARA

forBroadcaster,Parliaments

ARCOMEMStorage Archive

CrawlingAppraisalSelection

Cross Crawl Processing

EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …

KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …

Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...

Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...

Internet

6. The Cross-Crawl analysis allows content analytics across archives. This enables the possibility to combine Web archives to get a larger collection of documents or to study evolutions over time. Examples are evolution of languages, opinions, etc.

Page 14: Arcomem training system-overview_beginner

Preservation in ARCOMEM

Content Preservation in ARCOMEM• Selection and appraisal of Web and Social Web content• Preparation of WARC files for preservation• Provides access to preserved Web content• Not part of ARCOMEM are

– Long-term preservation of WARC files– Format handling, etc.

Semantic Preservation in ARCOMEM• Extraction of Entities, Events, Topics, Opinions• Enrichment with Linked Data• Created WARC files contain

– Raw Web Data– RDF triples of enrichment

• Preservation of Linked Data– Not part of ARCOMEM– See EU Projects: DIACHRON (IP), PRELIDA (CA)

Slide 14

+

WARC

Page 15: Arcomem training system-overview_beginner

THANK YOUCONTACT DETAILS

Dr. Thomas Risse+49 511 762 17764

[email protected]