ARCOMEM System Overview (Beginner Level) Thomas Risse L3S Research Center Hannover, Germany [email protected]
Nov 29, 2014
ARCOMEM System Overview(Beginner Level)
Thomas Risse
L3S Research Center
Hannover, Germany
Overview
Beginner Level• Approach of current crawlers• What’s new in ARCOMEM?• The ARCOMEM Approach
– Overview about the phases– Overview about the processing levels
• Handling of Preservation in ARCOMEM
Advanced Level• Overview of the system architecture• Possible ARCOMEM System Configurations
Slide 2
Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission
Web Crawlere.g. Heritrix, HTTrack
1
1. A seedlist is specified as input for the crawler. This specification might also contain some limited crawling parameters like the crawl depth or maximum crawl time. Also blacklists of domain to reduce spam can be given.
Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission
Web Crawlere.g. Heritrix, HTTrack
2Crawling
1
2. The Web crawler collects the content from the Web and follows the links up to the specified depth to crawl.
Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission
Web Crawlere.g. Heritrix, HTTrack
Storage
Archive
2Crawling
1 3
3. The results of the crawl are are directly stored in the Web archive. This is typically in WARC or ARC format.
Standard CrawlersSeedlisthttp://www.economist.com/node/21534849http://www.ekathimerini.com/ekathi/commenthttp://www.bbc.co.uk/news/world-europe-15589568http://www.bbc.co.uk/search/news/?q=Greek%20crisishttp://www.guardian.co.uk/business/bloghttp://www.kathimerini.gr/http://twitter.com/#!/EU_Commission
Web Crawlere.g. Heritrix, HTTrack
Storage
Archive
2Crawling
1 3
QualityAssurance
44. The Quality Assurance is applied as the last step to ensure that all information are collected and that the pages are fully stored in the archive. Missing URLs are given to the Web Crawler for re-crawling
What‘s new in ARCOMEM?
• Intelligent Crawler– Semantically Enhanced Crawl Specification– „Understands“ the crawl intention– Crawler guidance by using social and semantic information– Stops crawling at irrelevant pages– Two stage crawling strategy: Web ARCOMEM Storage Archive
• Advanced Web Archive Enrichment– Semantic Information: Entities, Topics, Opinions, Events (ETOE)– Social Context: Interlinking Web Social Web, Trustworthiness of
information and users
• Archivist and End User Support– Archivist Tool– Searching and browsing Web archives with different facets
Slide 7
ARCOMEM Phases: Crawl Specification1. Intelligent Crawl Specification (ICS)
The ICS describes the intended crawl by specifying keywords, entities, topics, etc. together with reference page and starting points. Reference pages matches to 100% with the crawl content and are used by the crawler to learn more about the crawl.
Slide 8
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
ARCOMEM Phases: Crawling & Online Processing
Slide 9
2. Crawling & Online ProcessingIn this phase the web pages and social web content will be collected and a first semantic analysis will be applied. The analysis result is used to guide the crawler by ranking extracted links by their importance.
All information are stored in the ARCOMEM Storage.
Crawling
Online Processing
ARCOMEMStorage
Crawling
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Offline Processing
Slide 10
3. Offline ProcessingThe offline processing runs after the collection of content has been finished. The aim of this phase is the enrich the crawled pages with meta-information that has been extracted from the content. The enrichments helps selecting content for the final web archive. Furthermore it eases the searching and browsing within the final Web archive.
Crawling
Online Processing
Offline Processing
ARCOMEMStorage
Crawling
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Appraisal & Selection
Slide 11
4. Based on the information given in the Intelligent Crawl Specification (ICS) and the enrichment of the content, the most interesting content items are selected to be stored in the final Web archive. The final Web archive are WARC files, which include the crawled pages and all enrichments done during the offline processing in RDF format.
Crawling
Online Processing
Offline Processing
ARCOMEMStorage Archive
CrawlingAppraisalSelection
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
ARCOMEM Phases: Applications
Slide 12
Crawling
Online Processing
Offline ProcessingSARA
forBroadcaster,Parliaments
ARCOMEMStorage Archive
CrawlingAppraisalSelection
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
5. The Search and Retrieval Application (SARA) allows end users to search and browse the archive in different ways, e.g. based on keywords, entities, topics, opinions.
ARCOMEM Phases: Cross Crawl Analytics
Slide 13
Crawling
Online Processing
Offline ProcessingSARA
forBroadcaster,Parliaments
ARCOMEMStorage Archive
CrawlingAppraisalSelection
Cross Crawl Processing
EntitiesObama, Romney, Biden, Ryan, Republicans, Democrats, …
KeywordsUS Election, CommitToMitt, Teaparty, Budget deficit, …
Reference Seedlisthttps://twitter.com/whitehouse , https://twitter.com/blog44 , https://twitter.com/BarackObama, ...
Seedlisthttp://news.bbc.co.uk/, http://telegraph.co.uk/, ...
Internet
6. The Cross-Crawl analysis allows content analytics across archives. This enables the possibility to combine Web archives to get a larger collection of documents or to study evolutions over time. Examples are evolution of languages, opinions, etc.
Preservation in ARCOMEM
Content Preservation in ARCOMEM• Selection and appraisal of Web and Social Web content• Preparation of WARC files for preservation• Provides access to preserved Web content• Not part of ARCOMEM are
– Long-term preservation of WARC files– Format handling, etc.
Semantic Preservation in ARCOMEM• Extraction of Entities, Events, Topics, Opinions• Enrichment with Linked Data• Created WARC files contain
– Raw Web Data– RDF triples of enrichment
• Preservation of Linked Data– Not part of ARCOMEM– See EU Projects: DIACHRON (IP), PRELIDA (CA)
Slide 14
+
WARC