Top Banner
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation Yao-Min Huang Date 09/15/2004
30

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

Jan 04, 2016

Download

Documents

Colleen Warren
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

1

SIGIR 2004Web-page Classification through Summarization

Dou Shen Zheng Chen* Qiang Yang

Presentation : Yao-Min Huang

Date : 09/15/2004

Page 2: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

2

Outline

Motivation Related work Architecture Overview Summarizer Methods (1~4) Experiments Conclusion Future Work

Page 3: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

3

Motivation To facilitate the web users to

find the desired information. Browse: Navigate through

hierarchical collections Search: submit a query to search

engine Much work has been done on

Web-Page Classification. Hyperlink

Summarization is a good method to filter the noise from the web page.

Page 4: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

4

Related Work Overview of summarization:

Goal of summarization:Summary generation methods seek to identify document contents that convey the most “important” information within the document

Types of Summarization Indicative vs. informative Extraction vs. Abstract Generic vs. query-oriented Unsupervised vs. Supervised Single-document vs. multi-document

Page 5: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

5

Related Work (Cont.) --Summarization in IR

Methods Unsupervised Methods—Cluster and Select Supervised Methods

Applications Generic summaries for indexing in

information retrieval(Tetsuya Sakai SIGIR2001)

Term Selection in Relevance Feedback for IR(A.M.Lam-Adesina SIGIR2001)

Page 6: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

6

Architecture Overview

Testing set

Train set

Train Summaries

Testing Summaries

Classifier (NB/SVM)

Result

Ensemble

Summarization

Luhn LSA Supervised

Page-layout analysis

Human

Classification

( 10-fold cross validation )

Page 7: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

7

Summarizer 1:Adapted Luhn’s Method (IBM journal 1958)

Assumption:

The more “Significant Words” there are in a sentence and the closer they are, the more meaningful the sentence is.

Approach The sentences with the highest significance factor

are selected to form the summary.

An Example— — — [ # — — # — # # # ] — — — #

significance factor = 5*5/8 = 3.125

frequency is between high-frequence cutoff and low-frequency cutoff

Limit L=2 , which two # could be considered as being significantly related

Page 8: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

8

Summarizer 1: Adapted Luhn’s Method (Cont.)

Build Significan

t word pool

Cat1 Cat2 Catm

Significant Word Pool

Significant Word Pool

Significant Word Pool

Sentences in a training page from category m

Summary

For train pages

Sentences from

a testing page

Summary

average

For testing pages

A web page

Significant word pool

Sentences in this page

Summary

Original method Adapted method

Page 9: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

9

Summarizer 2: Latent Semantic Analysis (SIGIR2001)

A fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse.

OverviewGiven an mn term-by-sentence matrix A; A = U∑VT

∑: diag(1, …,r , 0, …, 0) , sorted singular valuer =rank(A)

Umn: whose column vectors are called left singular vectors (salient patterns among the terms)

Vnn: whose column vectors are called right singular vectors(salient patterns among the sentences)

Page 10: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

10

Summarizer 2: Latent Semantic Analysis (Cont.)

****** . . .

. . .

***. . .

U

U*1 U*nU*2

U1*

Um*

U2*

.

.

.=

×∑×

*∆***∆ . . .

. . .

***. . .

VT

V*1'

V*n'

V*2' . . .

V1*'

Vn*'

V2*'

.

.

.

W1

S1

Wm

W2

SnS2

.

.

.

****** . . .

. . .

***. . .

A

Select the sentence which has the largest index valuewith the right singular vector (column vector of V)

Page 11: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

11

Summarizer 3: Summarization by Page Layout Analysis (WWW10,2001)

In HTML content, a BO (Basic Object) is a non-breakable element within two tags or an embedded Object.

Page 12: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

12

Summarizer 3: Summarization byPage Layout Analysis

Analyze the structure of Web Pages

Compute the similarity graph between objectsNodes = Objects

Weight of Edge = Similarity

Get the core object

Extract the content body (CB)

Header

Search Box

Main Body

Navigation List

Copyright

Page 13: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

13

Summarizer 3: Summarization byPage Layout Analysis (Cont.) Detect the Content Body (CB) algorithm

Consider each selected object as a single document and build the TF*IDF index for the object.

Calculate the similarity between any two objects using Cosine similarity computation, and add a link between them if their similarity is greater than a threshold.

A core object is defined as the object having the most edges.

Extract the CB as the combination of all objects that have edges linked to the core object.

Summary All sentences that are included in the content body

give rise to the summary of the Web page.

Page 14: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

14

Summarizer 3: Summarization byPage Layout Analysis (Cont.)

Obj1 Obj2 Obj3 Obj4

Obj1 1.00 0.03 0.08 0.00

Obj2 1.00 0.15 0.00

Obj3 1.00 0.02

Obj4 1.00

1 2

3 4

Content Body

0.03

0.08

0.02

0.00

0.15

0.00

Page 15: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

15

Naïve Bayes Classifier (lecture of ML&DM berlin 2004)

Assume target function , where each instance described by attributes

Most probable value of is

Naïve Bayes assumption:

Naïve Bayes Classifier:

VXf :x

naaa ,...,, 21

xf

jjn

Vv

n

jjn

Vv

njVv

MAP

vPvaaaP

aaaP

vPvaaaP

aaavPv

j

j

j

,...,,maxarg

,...,,

,...,,maxarg

,...,,maxarg

21

21

21

21

i

jijn vaPvaaaP ,...,, 21

i

jijVv

NB vaPvPvj

maxarg

predict the target value/classification

Page 16: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

16

Given a data set Z with 3-dimensional Boolean examples. Train a naïve Bayes classifier to predict the classification

What is the predicted probability ?

Naïve Bayes: Example

Attribute A

Attribute B

Attribute C

Classification D

F T F T

F F T T

T F F T

T F F F

F T T F

F F T F

TCFBTATDP ,,

3/1 ,3/2

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

2/1 ,2/1

FDFBPFDTCP

FDFBPFDTBP

FDFAPFDTAP

TDFBPTDTCP

TDFBPTDTBP

TDFAPTDTAP

FDPTDP

Page 17: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

17

Naïve Bayes: Example

3

1

42

2

21

32

32

31

21

31

32

31

21

31

32

31

,,,,

,,

,,

,,

,,

FDPFDTCFBTAPTDPTDTCFBTAP

TDPTDTCFBTAP

TCFBTAP

TDPTDTCFBTAP

TCFBTATDP

Page 18: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

18

Summarizer 4: Supervised Summarization (SIGIR 1995)

FeaturesGiven sentence Si

Fi1 : the position of a sentence Si in a certain paragraph; Fi2 : the length of a sentence Si; Fi3 : ∑TFw*SFw; (Term frequency , Sentence Frequency) Fi4 : the similarity (cosine) between Si and the title; Fi5 : the similarity between Si and all text in the page; Fi6 : the similarity between Si and meta-data in the page; Fi7 : the number of occurrences of word in Si from a

special word set (italic or bold or underlined words) . Fi8 : the average font size of the words in Si

Page 19: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

19

Summarizer 4: Supervised Summarization (Cont.)

Classifier

Each sentence will then be assigned a score by the above equation.

8

1

8

1821

)(

)()|()...,|(

jj

jj

fp

SspSsfpfffSsp

)( Ssp

)( ifp

)|( Ssfp i

stands for the compression rate

is the probability of each feature i

is the conditional probability of each feature i

Page 20: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

20

Ensemble Summarizers

The final score for each sentence is calculated by summing the individual score factors obtained for each method used.

Schema1: We assigned the weight of each summarization method in proportion to the performance of each method (the value of micro-F1).

Schema2-5: We increased the value of wi (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one.

sup4321 SwSwSwSwS cblsaluhn

Page 21: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

21

Experiment Setup Dataset

2 millions Web pages from the LookSmart Web directory .

500 thousand pages with manually created descriptions. Randomly sampled 30% (includes 153,019 pages). Distributed among 64 categories (Only the top two level

categories on LookSmart Website) Classifier

NB (Naïve Bayesian) and SVM (Support Vector Machine) Evaluation

10-fold cross validation Precision,Recall,F1 Micro (gives equal weight to every document) vs. Macro

21

P RF

P R

Page 22: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

22

Experiment 1: feasibility study

Baseline: Remaining text by removing the html tags

Human-authored summary as the ideal summary for the page;

Conclusion: Good summary can improve classification performance obviously.

0102030405060708090

Precision Recall Micro-F1

Baseline(Plain Text) Human Summary(Description)

0102030405060708090

Precision Recall Micro-F1

Baseline(Plain Text) Human Summary(Description)

NB

SVM

14.8%

13.2%

Page 23: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

23

Experiment 2: Evaluation on Automatic Summarizers

Similar improvement among unsupervised methods Unsupervised methods are better than the supervised method All automatic methods are not as good as human summary

  microP microR micro-F1

Baseline 70.7 57.7 63.6

Human 81.5 66.2 73.0

Summ1 77.9 63.3 69.8

Summ2 77.2 62.7 69.2

Summ3 75.9 61.7 68.1

Summ4 75.2 60.9 67.3

microP microR micro-F1

Baseline 72.4 59.3 65.1

Human 82.1 66.9 73.7

Summ1 77.3 62.8 69.3

Summ2 78.6 63.7 70.3

Summ3 79.2 64.3 71.0

Supervised 76.3 61.8 68.3

NB SVM

Summ1 = Luhn; Summ2 = Content Body; Summ3 = LSA; Summ4 = Supervised

Page 24: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

24

Experiment 2: Evaluation on Automatic Summarizers (Cont.)

The ensemble of summarizers achieves similar improvement as human summary.

0

10

20

30

40

50

60

70

80

90

microP microR micro-F1

Baseline

Human SummaryEnsemble of Summarizer

0

10

20

30

40

50

60

70

80

90

microP microR micro-F1

Baseline

Human SummaryEnsemble of Summarizer

14.8%

12.9%

13.2%

11.5%

NB SVM

Page 25: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

25

Experiment3: Parameter Tuning --Compression Rate

Compression rate is the most parameter in consideration

Most of automatic methods achieve best result when the compression rate is 20% or 30%

0.20 0.15 0.10 0.05

CB65.0±0.

567.0±0.

469.2±0

.466.7±0.

3

Performance of CB with different Threshold with NB

10% 20% 30% 50%

Luhn66.1±0.

569.8±0.

567.4±0.

464.5±0.

3

LSA66.3±0.

667.0±0.

568.1±0.

563.4±0.

3

Supervised

66.1±0.5

67.3±0.4

64.8±0.4

62.9±0.3

Hybrid66.9±0.

469.3±0.

471.8±0.

367.1±0.

3

Performance on different compression rate with NB

Page 26: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

26

Experiment3: Parameter Tuning --Weight Schema

Schema1:weight of each summarization method in proportion to the performance of each method

Schema2-5:Increase the value of (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one

microP microR micro-F1

Origin80.2±0.

365.0±0.

371.8±0.

3

Schema181.0±0.

365.6±0.

372.5±0.

3

Schema281.3±0.

466.1±0.

472.9±0.

4

Schema379.5±0.

464.4±0.

471.2±0.

4

Schema481.1±0.

365.5±0.

372.5±0.

3

Schema579.7±0.

464.7±0.

471.4±0.

4

Page 27: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

27

Analysis Why summarization helps?

Summarization can extract the main topic of a Web page while remove noises.

#of pages

Total Size(k)

Average Size/page (k)

A 100 31210 31.2

B 500 54500 10.9A: 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline system B: 500 pages randomly from the testing pages

Conclusion: Summarization is helpful especially for large web pages

Page 28: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

28

Conclusion Summarization techniques can be

helpful for classification. A new summarizer based on Web-

page structure analysis. Modification to Luhn’s method New features for Web-page

supervised summarization

Page 29: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

29

Future Work Improve the summarization

performance Take Hypertext/Anchor Text into

consideration. Use the Hyperlink Structure. Use Query Log.

Apply Summarization to other applications Clustering

Page 30: 1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.

30

Thanks