Automatically Annotating Web Pages Using Google Rich Snippets

Automatically Annotating Web Pages Using Google Rich Snippets

11th Dutch-Belgian Information Retrieval Workshop (DIR 2011)

February 4, 2011

Frederik [email protected]

Flavius [email protected]

Damir [email protected]

Jeroen van der [email protected]

Ferry [email protected]

Uzay [email protected]

Erasmus University RotterdamPO Box 1738, NL-3000 DRRotterdam, the Netherlands

This talk is based on the paper A Framework for Automatic Annotation of Web Pages Using the Google Rich Snippets Vocabulary. Meer, J. van der, Boon, F., Hogenboom, F.P., Frasincar, F. & Kaymak, U. (2011). In 26th Symposium on Applied Computing (SAC 2011) (pp. 763-770). ACM.

mailto:[email protected]





Introduction (1)

• Semantically annotating Web pages enhances machine interpretation

• Google Rich Snippets (RDFa) enable Web page owners to add semantics to their pages

• The vocabulary enables interesting applications


Introduction (2)

• Automating annotation for static and 3rd party Web sites is deemed necessary

• Hence, we propose the Automatic Review Recognition and annOtation of Web pages (ARROW) framework


Framework (1)


• Four main stages:– Hotspot identification– Subjectivity analysis– Information extraction– Page annotation

• Web pages are converted to DOM trees in order to enable easy processing

Framework (2)


RDFa

Framework (3): Hotspots

• Reviews are characterized by large blocks of text: hotspots

• Headers, navigation elements, footers, etc., do not contain these blocks

• Text blocks have few HTML elements

• For each element in the DOM tree, we compute the text-to-content-ratio (TTCR):

, with = # textual characters, and = total # characters in

DOM


DOM

text

L

LTTCR textL

DOML

Framework (4): Hotspots

• Illustrative example:

• The h1 element contains 64/73 × 100% ≈ 88% text

• However, the div element merely contains 34/116 × 100% ≈ 29% text due to its span elements


<h1> Intel Core i7-975 Extreme And i7-950 Processors Reviewed</h1><div> <p> Page <span class="page-number">1</span> of <span class="num-pages">15</span> </p></div>

Framework (5): Subjectivity

• Hotspots are verified as reviews whenever they are subjective enough

• We utilize an updated version of the LightWeight subjectivity Detection mechanism (LWD) of Barbosa et al. (2009):– Original: check if document has ≥ k sentences that contain ≥

n subjectivity words each– Modification: check if document has ≥ m percent of all

sentences that contain ≥ n subjectivity words each


Framework (6): IE

• Various information is extracted:– Authors:

• Named entities are detected in the vicinity of hotspots• Named Entity Recognizer (NER)

– Dates:• Many different date formats are easily parsed• Regular expressions

– Products:• Name often found in title and h1 elements• Overlapping words

– Ratings:• Many formats, e.g., images (90%), which can be numerical (80%),

descriptors (15%), or letters (5%)• We focus on numerical ratings• Regular expressions on plain text or alt text of images


(\w)\s(\d{1,2})(th|,)?\s(\d{2,4})

([0-9.,]+)\s?/\s?([0-9.,]+)

MM dd yyyy

4/5

Framework (7): Annotation

• Key elements are tagged using Google Rich Snippets

• A new annotated Web page is returned


<div xmlns:v="http://rdf.data-vocabulary.org/#" typeof="v:Review"> <span property="v:itemreviewed"> Tango Hotel Taichung </span> <span property="v:reviewer">Sarah Lee</span> <span property="v:rating">4 stars</span> <span property="v:dtreviewed">18th December 2008</span> <p property="v:summary"> Boutique like hotel without the boutique price </p></div>

Implementation (1)

• We have implemented the ARROW framework as a Web application:– Java-based– Apache Tomcat server

• Input:– URL– Preferred output:

• Visualizer

• Annotated document


Implementation (2)


Evaluation

• Test set: 100 review, 100 non-review Web pages

• Sub-second performance

• Precision and specificity are good (both ± 90%), while accuracy and recall are varying (± 40% – 60%)

• Main problems related to detecting authors, likely caused by the use of nicknames

• Dependency on Web site structures


Conclusions

• We presented ARROW, a framework for automatically annotating reviews with Google Rich Snippets

• Framework not bound to vocabulary

• Proof-of-concept implementation shows promising results

• Future work:– Improve heuristics– Add intelligent (semantically enabled) text parsers– Extend to other domains, e.g., recipes, videos, etc.


Questions

http://www.arrow-project.com/


Automatically Annotating Web Pages Using Google Rich Snippets

Documents

ievarious information

annotating web pages

web page owners

party web sites

dom trees

large blocks of text

h1 element

div element