Top Banner
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma Web Search & Mining Group Microsoft Research Asia 2009-04
28

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums

Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying MaWeb Search & Mining Group

Microsoft Research Asia

2009-04

Page 2: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Web Forum Data• An important information resource with a lot of human

knowledge.

• These information include recreation, sports, games, computers, art, society, science, home, health;

• 20% pages on the search results are from forums

Page 3: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Understanding Forum

Crawling Data Extraction

Quality AssessmentQuality Assessment

WWW’08iRobot: An Intelligent Crawler for Web Forums

SIGIR’08Exploring Traversal Strategy

KDD’09Incremental Crawling

WWW’08iRobot: An Intelligent Crawler for Web Forums

SIGIR’08Exploring Traversal Strategy

KDD’09Incremental Crawling

WWW’09,Automation Data ExtractionWWW’09,Automation Data Extraction

SIGIR’09Quality AssessmentSIGIR’09Quality Assessment

Page 4: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Challenge

• Leverage more site-level knowledge

Page 5: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Page 6: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Sitemap recovering

Page 7: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Forum Sitemap• A sitemap is a directed graph corresponding

consisting of a set of vertices and the links

List-of-Board

Digest

Login Portal

List-of-Thread

Browse-by-Tag

Home Page

Post-of-Thread

Search Result

Vertex

Link

• Rui Cai, Jiangming Yang, Wei Lai, Yida Wang and Lei Zhang. iRobot: An Intelligent Crawler for Web Forums. In Proceedings of WWW 2008 Conference

Page 8: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Page Clustering• Forum pages are based on database & template• Layout is robust to describe template

– Layout can be characterized by the HTML elements in different DOM paths

(b) (d)(a) (c)

Page 9: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Page Clustering

Dom Path Feature Discovery

Clustering by Virtual Tables

Page 10: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Link Analysis

1. Login

4. Thread List

5. Thread

A Link = URL Pattern + Location

Page 11: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Post pages

Inter-pageInter-vertexInner-page

Profile pages

I II III

Page 12: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Inner-Page Features

• The inclusion relation. Data records usually have inclusion relations.

• The alignment relation. Since data is generated from database and represented via templates, data records with the same label may appear repeatedly in a page.

• Time Order. Since post records are generated sequentially along timeline, the post time should be sorted ascending or descending.

Page 13: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Inner-vertex Features

Page 14: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Inter-vertex Features

Page 15: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

List Record

List Record

List Title

List Record

(6)

(4)(5)

(4)(7)(8)

List Title

(9)

List Title

(10)

Post pages

(7)

Page 16: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Problem Setting

Author Title Content

Page 17: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Formulas of list page

List Record

List Record

List Title

List Record

(6)

(4)(5)

(4)(7)(8)

List Title

(9)

List Title

(10)

Post pages

(7)

• Formulas for identifying list record

• Formulas for identifying list title

Page 18: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Formulas of post page• Formulas for identifying post record

• Formulas for identifying post author

Post Record

TimeAuthor

Post Record

Post Record

(11)(14)(11)

(11)(12)

(13)

Author

(15)

Author

(16)

Page 19: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Formulas of post page• Formulas for identifying post time

• Formulas for identifying post content

Post Record

Time

Time

Time

(17)(18)

(19)

Post Record

Content

Content

Content

(20)

(21)

(22)

Page 20: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Joint inference

Page 21: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Markov Logic Networks• An MLN can be viewed as a template for constructing Markov

Random Fields.

• With a set of formulas and constants, MLNs define a Markov network with one node per ground atom and one feature per ground formula. The probability of a state x in such a network is given by:

Page 22: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Markov Logic Networks• Divide DOM tree elements into three categories :

– Text element– Hyperlink element– Inner element

• Benefit

– Reduce the number of possible groundings in inference.

– Reduce the ambiguity and achieve better performance.

Page 23: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Experiments

List Pages Post Pages

Page 24: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Experiments

Page 25: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Experiments

0.8

0.9

1

1 site 2 sites 3 sites 4 sites 5 sites

Ave

rage

F1

Number of sites in training set

List Record

List Title

Post Record

Post Author

Post Time

Post Content

Page 26: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Experiments

0.6

0.7

0.8

0.9

1

10 20 30 40 50

Ave

rage

F1

Number of pages per site in training set

List Record

List Title

Post Record

Post Author

Post Time

Post Content

Page 27: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Future works

http://discussions.apple.com/

Page 28: Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.

Conclusion• A template-independent approach to extract

structured data from web forum sites.

• we can leverage power of site-level information, such as the mutual information among pages, inner or inter vertices of the sitemap.

• http://research.microsoft.com/people/jmyang/