Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom [email protected]Flavius Frasincar [email protected]Damir Vandic [email protected]Jeroen van der Meer [email protected]m Ferry Boon [email protected]Uzay Kaymak [email protected]Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.
15
Embed
Automatically Annotating Web Pages Using Google Rich Snippets
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automatically Annotating Web Pages Using Google Rich Snippets
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands
This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.
• Semantically annotating Web pages enhances machine interpretation
• Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages
• The vocabulary enables interesting applications
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Introduction (2)
• Automating annotation for static and 3rd party Web sites is deemed necessary
• Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (1)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
• Four main stages:– Hotspot identification– Subjectivity analysis– Information extraction– Page annotation
• Web pages are converted to DOM trees in order to enable easy processing
Framework (2)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
RDFa
Framework (3): Hotspots
• Reviews are characterized by large blocks of text: hotspots
• Headers, navigation elements, footers, etc., do not contain these blocks
• Text blocks have few HTML elements
• For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):
, with = # textual characters, and = total # characters in
DOM
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
DOM
text
L
LTTCR textL
DOML
Framework (4): Hotspots
• Illustrative example:
• The h1 element contains 64/73 × 100% ≈ 88% text
• However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
<h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed</h1><div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p></div>
Framework (5): Subjectivity
• Hotspots are verified as reviews whenever they are subjective enough
• We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009):– Original: check if document has ≥ k sentences that contain ≥
n subjectivity words each– Modification: check if document has ≥ m percent of all
sentences that contain ≥ n subjectivity words each
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Framework (6): IE
• Various information is extracted:– Authors:
• Named entities are detected in the vicinity of hotspots• Named Entity Recognizer (NER)
– Dates:• Many different date formats are easily parsed• Regular expressions
– Products:• Name often found in title and h1 elements• Overlapping words
– Ratings:• Many formats, e.g., images (90%), which can be numerical (80%),
descriptors (15%), or letters (5%)• We focus on numerical ratings• Regular expressions on plain text or alt text of images
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
(\w)\s(\d{1,2})(th|,)?\s(\d{2,4})
([0-9.,]+)\s?/\s?([0-9.,]+)
MM dd yyyy
4/5
Framework (7): Annotation
• Key elements are tagged using Google Rich Snippets
• A new annotated Web page is returned
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p></div>
Implementation (1)
• We have implemented the ARROW framework as a Web application:– Java-based– Apache Tomcat server
• Input:– URL– Preferred output:
• Visualizer
• Annotated document
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Implementation (2)
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Evaluation
• Test set: 100 review, 100 non-review Web pages
• Sub-second performance
• Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%)
• Main problems related to detecting authors, likely caused by the use of nicknames
• Dependency on Web site structures
11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)
Conclusions
• We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets