1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

1

SIGIR 2004Web-page Classification through Summarization

Dou Shen Zheng Chen* Qiang Yang

Presentation ： Yao-Min Huang

Date ： 09/15/2004

2

Outline

Motivation Related work Architecture Overview Summarizer Methods (1~4) Experiments Conclusion Future Work

3

Motivation To facilitate the web users to

find the desired information. Browse: Navigate through

hierarchical collections Search: submit a query to search

engine Much work has been done on

Web-Page Classification. Hyperlink

Summarization is a good method to filter the noise from the web page.

4

Related Work Overview of summarization:

Goal of summarization:Summary generation methods seek to identify document contents that convey the most “important” information within the document

Types of Summarization Indicative vs. informative Extraction vs. Abstract Generic vs. query-oriented Unsupervised vs. Supervised Single-document vs. multi-document

5

Related Work (Cont.) --Summarization in IR

Methods Unsupervised Methods—Cluster and Select Supervised Methods

Applications Generic summaries for indexing in

information retrieval(Tetsuya Sakai SIGIR2001)

Term Selection in Relevance Feedback for IR(A.M.Lam-Adesina SIGIR2001)

6

Architecture Overview

Testing set

Train set

Train Summaries

Testing Summaries

Classifier (NB/SVM)

Result

Ensemble

Summarization

Luhn LSA Supervised

Page-layout analysis

Human

Classification

( 10-fold cross validation )

7

Summarizer 1:Adapted Luhn’s Method (IBM journal 1958)

Assumption:

The more “Significant Words” there are in a sentence and the closer they are, the more meaningful the sentence is.

Approach The sentences with the highest significance factor

are selected to form the summary.

An Example— — — [ # — — # — # # # ] — — — #

significance factor = 5*5/8 = 3.125

frequency is between high-frequence cutoff and low-frequency cutoff

Limit L=2 , which two # could be considered as being significantly related

8

Summarizer 1: Adapted Luhn’s Method (Cont.)

Build Significan

t word pool

Cat1 Cat2 Catm

Significant Word Pool



…

…

…

…

Sentences in a training page from category m

Summary

For train pages

Sentences from

a testing page

Summary

average

For testing pages

A web page

Significant word pool

Sentences in this page

Summary

Original method Adapted method

9

Summarizer 2: Latent Semantic Analysis (SIGIR2001)

A fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse.

OverviewGiven an mn term-by-sentence matrix A; A = U∑VT

∑: diag(1, …,r , 0, …, 0) , sorted singular valuer =rank(A)

Umn: whose column vectors are called left singular vectors (salient patterns among the terms)

Vnn: whose column vectors are called right singular vectors(salient patterns among the sentences)

10

Summarizer 2: Latent Semantic Analysis (Cont.)

****** . . .

. . .

***. . .

U

U*1 U*nU*2

U1*

Um*

U2*

.

.

.=

×∑×

*∆***∆ . . .

. . .

***. . .

VT

V*1'

V*n'

V*2' . . .

V1*'

Vn*'

V2*'

.

.

.

W1

S1

Wm

W2

SnS2

.

.

.

****** . . .

. . .

***. . .

A

Select the sentence which has the largest index valuewith the right singular vector (column vector of V)

11

Summarizer 3: Summarization by Page Layout Analysis (WWW10,2001)

In HTML content, a BO (Basic Object) is a non-breakable element within two tags or an embedded Object.

12

Summarizer 3: Summarization byPage Layout Analysis

Analyze the structure of Web Pages

Compute the similarity graph between objectsNodes = Objects

Weight of Edge = Similarity

Get the core object

Extract the content body (CB)

Header

Search Box

Main Body

Navigation List

Copyright

13

Summarizer 3: Summarization byPage Layout Analysis (Cont.) Detect the Content Body (CB) algorithm

Consider each selected object as a single document and build the TF*IDF index for the object.

Calculate the similarity between any two objects using Cosine similarity computation, and add a link between them if their similarity is greater than a threshold.

A core object is defined as the object having the most edges.

Extract the CB as the combination of all objects that have edges linked to the core object.

Summary All sentences that are included in the content body

give rise to the summary of the Web page.

14

Summarizer 3: Summarization byPage Layout Analysis (Cont.)

Obj1 Obj2 Obj3 Obj4

Obj1 1.00 0.03 0.08 0.00

Obj2 1.00 0.15 0.00

Obj3 1.00 0.02

Obj4 1.00

1 2

3 4

Content Body

0.03

0.08

0.02

0.00

0.15

0.00

15

Naïve Bayes Classifier (lecture of ML&DM berlin 2004)

Assume target function , where each instance described by attributes

Most probable value of is

Naïve Bayes assumption:

Naïve Bayes Classifier:

VXf :x

naaa ,...,, 21

xf

jjn

Vv

n

jjn

Vv

njVv

MAP

vPvaaaP

aaaP

vPvaaaP

aaavPv

j

j

j

,...,,maxarg

,...,,

,...,,maxarg

,...,,maxarg

21

21

21

21

i

jijn vaPvaaaP ,...,, 21

i

jijVv

NB vaPvPvj

maxarg

predict the target value/classification

16

Given a data set Z with 3-dimensional Boolean examples. Train a naïve Bayes classifier to predict the classification

What is the predicted probability ?

Naïve Bayes: Example

Attribute A

Attribute B

Attribute C

Classification D

F T F T

F F T T

T F F T

T F F F

F T T F

F F T F

TCFBTATDP ,,

3/1 ,3/2

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

3/2 ,3/1

2/1 ,2/1

FDFBPFDTCP

FDFBPFDTBP

FDFAPFDTAP

TDFBPTDTCP

TDFBPTDTBP

TDFAPTDTAP

FDPTDP

17

Naïve Bayes: Example

3

1

42

2

21

32

32

31

21

31

32

31

21

31

32

31

,,,,

,,

,,

,,

,,

FDPFDTCFBTAPTDPTDTCFBTAP

TDPTDTCFBTAP

TCFBTAP

TDPTDTCFBTAP

TCFBTATDP

18

Summarizer 4: Supervised Summarization (SIGIR 1995)

FeaturesGiven sentence Si

Fi1 : the position of a sentence Si in a certain paragraph; Fi2 : the length of a sentence Si; Fi3 : ∑TFw*SFw; (Term frequency , Sentence Frequency) Fi4 : the similarity (cosine) between Si and the title; Fi5 : the similarity between Si and all text in the page; Fi6 : the similarity between Si and meta-data in the page; Fi7 : the number of occurrences of word in Si from a

special word set (italic or bold or underlined words) . Fi8 : the average font size of the words in Si

19

Summarizer 4: Supervised Summarization (Cont.)

Classifier

Each sentence will then be assigned a score by the above equation.

8

1

8

1821

)(

)()|()...,|(

jj

jj

fp

SspSsfpfffSsp

)( Ssp

)( ifp

)|( Ssfp i

stands for the compression rate

is the probability of each feature i

is the conditional probability of each feature i

20

Ensemble Summarizers

The final score for each sentence is calculated by summing the individual score factors obtained for each method used.

Schema1: We assigned the weight of each summarization method in proportion to the performance of each method (the value of micro-F1).

Schema2-5: We increased the value of wi (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one.

sup4321 SwSwSwSwS cblsaluhn

21

Experiment Setup Dataset

2 millions Web pages from the LookSmart Web directory .

500 thousand pages with manually created descriptions. Randomly sampled 30% (includes 153,019 pages). Distributed among 64 categories (Only the top two level

categories on LookSmart Website) Classifier

NB (Naïve Bayesian) and SVM (Support Vector Machine) Evaluation

10-fold cross validation Precision,Recall,F1 Micro (gives equal weight to every document) vs. Macro

21

P RF

P R

22

Experiment 1: feasibility study

Baseline: Remaining text by removing the html tags

Human-authored summary as the ideal summary for the page;

Conclusion: Good summary can improve classification performance obviously.

0102030405060708090

Precision Recall Micro-F1

Baseline(Plain Text) Human Summary(Description)

0102030405060708090

Precision Recall Micro-F1

Baseline(Plain Text) Human Summary(Description)

NB

SVM

14.8%

13.2%

23

Experiment 2: Evaluation on Automatic Summarizers

Similar improvement among unsupervised methods Unsupervised methods are better than the supervised method All automatic methods are not as good as human summary

microP microR micro-F1

Baseline 70.7 57.7 63.6

Human 81.5 66.2 73.0

Summ1 77.9 63.3 69.8

Summ2 77.2 62.7 69.2

Summ3 75.9 61.7 68.1

Summ4 75.2 60.9 67.3


Baseline 72.4 59.3 65.1

Human 82.1 66.9 73.7

Summ1 77.3 62.8 69.3

Summ2 78.6 63.7 70.3

Summ3 79.2 64.3 71.0

Supervised 76.3 61.8 68.3

NB SVM

Summ1 = Luhn; Summ2 = Content Body; Summ3 = LSA; Summ4 = Supervised

24

Experiment 2: Evaluation on Automatic Summarizers (Cont.)

The ensemble of summarizers achieves similar improvement as human summary.

0

10

20

30

40

50

60

70

80

90


Baseline

Human SummaryEnsemble of Summarizer

0

10

20

30

40

50

60

70

80

90


Baseline

Human SummaryEnsemble of Summarizer

14.8%

12.9%

13.2%

11.5%

NB SVM

25

Experiment3: Parameter Tuning --Compression Rate

Compression rate is the most parameter in consideration

Most of automatic methods achieve best result when the compression rate is 20% or 30%

0.20 0.15 0.10 0.05

CB65.0±0.

567.0±0.

469.2±0

.466.7±0.

3

Performance of CB with different Threshold with NB

10% 20% 30% 50%

Luhn66.1±0.

569.8±0.

567.4±0.

464.5±0.

3

LSA66.3±0.

667.0±0.

568.1±0.

563.4±0.

3

Supervised

66.1±0.5

67.3±0.4

64.8±0.4

62.9±0.3

Hybrid66.9±0.

469.3±0.

471.8±0.

367.1±0.

3

Performance on different compression rate with NB

26

Experiment3: Parameter Tuning --Weight Schema

Schema1:weight of each summarization method in proportion to the performance of each method

Schema2-5:Increase the value of (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one


Origin80.2±0.

365.0±0.

371.8±0.

3

Schema181.0±0.

365.6±0.

372.5±0.

3

Schema281.3±0.

466.1±0.

472.9±0.

4

Schema379.5±0.

464.4±0.

471.2±0.

4

Schema481.1±0.

365.5±0.

372.5±0.

3

Schema579.7±0.

464.7±0.

471.4±0.

4

27

Analysis Why summarization helps?

Summarization can extract the main topic of a Web page while remove noises.

#of pages

Total Size(k)

Average Size/page (k)

A 100 31210 31.2

B 500 54500 10.9A: 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline system B: 500 pages randomly from the testing pages

Conclusion: Summarization is helpful especially for large web pages

28

Conclusion Summarization techniques can be

helpful for classification. A new summarizer based on Web-

page structure analysis. Modification to Luhn’s method New features for Web-page

supervised summarization

29

Future Work Improve the summarization

performance Take Hypertext/Anchor Text into

consideration. Use the Hyperlink Structure. Use Query Log.

Apply Summarization to other applications Clustering

30

Thanks

1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation ： Yao-Min Huang Date ： 09/15/2004.

Documents

webpage classification

goal of summarization

page layout analysis

web users

column vectors

structure of web pages

significant words

sentence matrix