Top Banner
iRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia June 22, 2022
27

IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Mar 26, 2015

Download

Documents

Grace Beach
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

iRobot: An Intelligent Crawler for Web Forums

Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang

Microsoft Research, Asia

April 10, 2023

Page 2: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Outline

• Motivation & Challenge• iRobot – Our Solution

– System Overview– Module Details

• Evaluation

2

Page 3: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Outline

• Motivation & Challenge• iRobot – Our Solution

– System Overview– Module Details

• Evaluation

3

Page 4: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Why Web Forum is Important

• Forum is a huge resource of human knowledge– Popular all over the world– Contain any conceivable topics and issues

• Forum data can benefit many applications– Improve quality of search result– Various data mining on forum data

• Collecting forum data– Is the basis of all forum related research– Is not a trivial task

4

Page 5: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Why Forum Crawling is Difficult

• Duplicate Pages– Forum is with complex in-site structure– Many shortcuts for browsing

• Invalid Pages– Most forums are with access control– Some pages can only be visited after registration

• Page-flipping– Long thread is shown in multiple pages– Deep navigation levels

5

Page 6: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

The Limitation of Generic Crawlers

• In general crawling, each page is treated independently– Fixed crawling depth– Cannot avoid duplicates before downloading– Fetch lots of invalid pages, such as login prompt– Ignore the relationships between pages from a

same thread

• Forum crawling needs a site-level perspective!

6

Page 7: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Statistics on Some Forums

• Around 50% crawled pages are useless• Waste of both bandwidth and storage

7

Page 8: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Outline

• Motivation & Challenge• Our Solution – iRobot

– System Overview– Module Details

• Evaluation

8

Page 9: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

What is Site-Level Perspective?

• Understand the organization structure• Find our an optimal crawling strategy

9

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

The site-level perspective of "forums.asp.net"

Page 10: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

iRobot: An Intelligent Forum Crawler

Crawler

General Web Crawling

Sitemap Construction

Traversal Path Selection

Forum Crawling

Segmentation & Archiving

Raw Pages Meta

Restart

10

Page 11: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Outline

• Motivation & Challenge• Our Solution – iRobot

– System Overview– Module Details

• How many kinds of pages? • How do these pages link with each other?• Which pages are valuable?• Which links should be followed?

• Evaluation

11

Sitemap Construction

Sitemap Construction

Traversal Path Selection

Traversal Path Selection

Page 12: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Page Clustering• Forum pages are based on database & template• Layout is robust to describe template

– Repetitive regions are everywhere on forum pages– Layout can be characterized by repetitive regions

(b) (d)(a) (c)

12

Page 13: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Page Clustering

13

Page 14: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

14

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Digest

Page 15: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Link Analysis• URL Pattern can distinguish links, but not

reliable on all the sites• Location can also distinguish links

15

1. Login

4. Thread List

5. Thread

A Link = URL Pattern + Location

Page 16: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

16

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 17: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Informativeness Evaluation

• Which kind of pages (nodes) are valuable?• Some heuristic criteria

– A larger node is more like to be valuable– Page with large size are more like to be valuable– A diverse node is more like to be valuable

• Based on content de-dup

17

Page 18: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

18

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 19: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Traversal Path Selection

• Clean sitemap– Remove valueless nodes– Remove duplicate nodes– Remove links to valueless / duplicate nodes

• Find an optimal path– Construct a spanning tree– Use depth as cost

• User browsing behaviors– Identify page-flipping links

• Number, Pre/Next19

Page 20: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

20

List-of-Board

List-of-Thread

Browse-by-Tag

Search Result

Post-of-Thread

Login Portal

Entry

Digest

Page 21: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Outline

• Motivation & Challenge• iRobot – Our Solution

– System Overview– Module Details

• Evaluation

21

Page 22: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Evaluation Criteria

• Duplicate ratio

• Invalid ratio

• Coverage ratio

0%

10%

20%

30%

40%

50%

60%

70%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Mirrored Pages iRobot

0%

5%

10%

15%

20%

25%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Mirrored Pages iRobot

0%10%20%30%40%50%60%70%80%90%

100%

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

Coverage ratio

22

Page 23: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Effectiveness and Efficiency• Effectiveness

• Efficiency0

1000

2000

3000

4000

5000

6000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(a) A Generic Crawler Invalididate Duplicate Valuable

0

1000

2000

3000

4000

5000

6000

Biketo Asp Baidu Douban CQZG Tripadvisor Gentoo

(b) iRobot Invalididate Duplicate Valuable

0

2500

5000

7500

10000

12500

15000

17500

20000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(a) A Generic Crawler Invalididate

Duplicate

Valuable

0

2500

5000

7500

10000

12500

15000

17500

20000

Biketo Asp Baidu Douban CQZG Tripadvisor Hoopchina

(b) iRobot Invalididate

Duplicate

Valuable

23

Page 24: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Performance vs. Sampled Page#

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

10 20 50 100 500 1000Number of Sampled Pages

Coverage ratio

Duplicate ratio

Invalid ratio

24

Page 25: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Preserved Discussion Threads

Forums MirroredCrawled by

iRobotCorrectly

RecoveredBiketo 1584 1313 1293

Asp 600 536 536

Baidu − − −

Douban 62 60 37

CQZG 1393 1384 1311

Tripadvisor 326 272 272

Hoopchina 2935 2829 2593

25

Page 26: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Conclusions

• An intelligent forum crawler based on site-level structure analysis– Identify page templates / valuable pages / link

analysis / traversal path selection• Some modules can still be improved

– More automated & mature algorithms in SIGIR’08• More future work directions

– Queue management– Refresh strategies

26

Page 27: IRobot: An Intelligent Crawler for Web Forums Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang Microsoft Research, Asia November 3, 2013.

Thanks!

27