Blogs are a dynamic communication medium which has been widely established on the web. The BlogForever project has developed an innovative system to harvest, preserve, manage and reuse blog content. This paper presents a key component of the BlogForever platform, the web crawler. More precisely, our work concentrates on techniques to automatically extract content such as articles, authors, dates and comments from blog posts. To achieve this goal, we introduce a simple and robust algorithm to generate extraction rules based on string matching using the blog’s web feed in conjunction with blog hypertext. This approach leads to a scalable blog data extraction process. Furthermore, we show how we integrate a web browser into the web harvesting process in order to support the data extraction from blogs with JavaScript generated content.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Our Contributions• A web crawler capable of extracting blog articles,
authors, publication dates and comments.• A new algorithm to build extraction rules from blog
web feeds with linear time complexity,• Applications of the algorithm to extract authors,
publication dates and comments,• A new web crawler architecture, including how we use a
complete web browser to render JavaScript web pages before processing them.
• Extensive evaluation of the content extraction and execution time of our algorithm against three state-of-the-art web article extraction algorithms.
7
Motivation• Extracting metadata and content from HTML
documents is a challenging task.– Web standards usage is low (<0.5% of websites).– More than 95% of websites do not pass HTML validation.
• Having blogs as our target websites, we made the following observations which play a central role in the extraction process:a) Blogs provide web feeds: structured and standardized XML
views of the latest posts of a blog,b) Posts of the same blog share a similar HTML structure.c) Web feeds usually have 10-20 posts whereas blogs contain
a lot more. We have to access more posts than the ones referenced in web feeds.
8
Content Extraction Overview
1. Use blog web feeds and referenced HTML pages as training data to build extraction rules.
2. Extraction rules capable of locating in HTML page all RSS referenced elements such as:
3. Use the defined extraction rules to process all blog pages.
9
Locate in HTML page all RSS referenced elements
10
Generic procedure to build extraction rules
11
Extraction rules and string similarity• Rules are XPath queries.• For each rule, we compute the score based on string similarity.• The choice of ScoreFunction greatly influences the running time
and precision of the extraction process.
• Why we chose Sorensen–Dice coefficient similarity:1. Low sensitivity to word ordering and length variations2. Runs in linear time
12
Example: blog post title best extraction rule• RSS feed: http://vbanos.gr/en/feed/• Find RSS blog post title: “volumelaser.eim.gr” in html page:
http://vbanos.gr/blog/2014/03/09/volumelaser-eim-gr-2/XPath HTML Element Value Similarity
1. Render HTML and JavaScript,2. Extract content,3. Extract comments,4. Download multimedia files,5. Propagate resulting records to the back-end.
• Interesting areas:– Blog post page identification,– Handle blogs with a large number of pages,– JavaScript rendering,– Scalability.
17
Blog post identification
• The crawler visits all blog pages.• For each URL, it needs to identify whether it is a
blog post or not.• We construct a regular expression based on
blog post RSS to identify blog posts.• We assume that all posts from the same blog
use the same URL pattern.• This assumption is valid for all blog platforms
we have encountered.
18
Handle blogs with a large number of pages
• Avoid random walk of pages, depth first search or breadth first search.
• Use a priority queue with machine learning defined priorities.
• Pages with a lot of blog post URLs have a higher priority.• Use Distance-Weighted kNN classifier to predict.
– Whenever a new page is downloaded, it is given to the machine learning system as training data.
– When the crawler encounters a new URL, it will ask the machine learning system for the potential number of blog posts and use the value as the download priority of the URL.
19
JavaScript rendering
• JavaScript is widely used client-side language.• Traditional HTML based crawlers do not see web pages using
JavaScript.• We embed PhantomJS, a headless web browser with great
performance and scripting capabilities.• We instruct the PhantomJS browser to click dynamic
JavaScript pagination buttons on pages to retrieve more content (e.g. Disqus Show More button to show comments).
• This crawler functionality is non-generic and requires human intervention to maintain and extend to other cases.
20
Scalability
• When aiming to work with a large amount of input, it is crucial to build every system layer with scalability in mind.
• The two core crawler procedures NewCrawl and UpdateCrawl are Stateless and Purely Functional.
• All shared mutable state is delegated to the back-end.
21
Evaluation• Task:
– Extract articles and titles from web pages• Comparison against three open-source projects: