1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004
Jan 04, 2016
1
SIGIR 2004Web-page Classification through Summarization
Dou Shen Zheng Chen* Qiang Yang
Presentation : Yao-Min Huang
Date : 09/15/2004
2
Outline
Motivation Related work Architecture Overview Summarizer Methods (1~4) Experiments Conclusion Future Work
3
Motivation To facilitate the web users to
find the desired information. Browse: Navigate through
hierarchical collections Search: submit a query to search
engine Much work has been done on
Web-Page Classification. Hyperlink
Summarization is a good method to filter the noise from the web page.
4
Related Work Overview of summarization:
Goal of summarization:Summary generation methods seek to identify document contents that convey the most “important” information within the document
Types of Summarization Indicative vs. informative Extraction vs. Abstract Generic vs. query-oriented Unsupervised vs. Supervised Single-document vs. multi-document
5
Related Work (Cont.) --Summarization in IR
Methods Unsupervised Methods—Cluster and Select Supervised Methods
Applications Generic summaries for indexing in
information retrieval(Tetsuya Sakai SIGIR2001)
Term Selection in Relevance Feedback for IR(A.M.Lam-Adesina SIGIR2001)
6
Architecture Overview
Testing set
Train set
Train Summaries
Testing Summaries
Classifier (NB/SVM)
Result
Ensemble
Summarization
Luhn LSA Supervised
Page-layout analysis
Human
Classification
( 10-fold cross validation )
7
Summarizer 1:Adapted Luhn’s Method (IBM journal 1958)
Assumption:
The more “Significant Words” there are in a sentence and the closer they are, the more meaningful the sentence is.
Approach The sentences with the highest significance factor
are selected to form the summary.
An Example— — — [ # — — # — # # # ] — — — #
significance factor = 5*5/8 = 3.125
frequency is between high-frequence cutoff and low-frequency cutoff
Limit L=2 , which two # could be considered as being significantly related
8
Summarizer 1: Adapted Luhn’s Method (Cont.)
Build Significan
t word pool
Cat1 Cat2 Catm
Significant Word Pool
Significant Word Pool
Significant Word Pool
…
…
…
…
Sentences in a training page from category m
Summary
For train pages
Sentences from
a testing page
Summary
average
For testing pages
A web page
Significant word pool
Sentences in this page
Summary
Original method Adapted method
9
Summarizer 2: Latent Semantic Analysis (SIGIR2001)
A fully automatic mathematical/statistical technique for extracting and inferring relations of expected contextual usage of words in passages of discourse.
OverviewGiven an mn term-by-sentence matrix A; A = U∑VT
∑: diag(1, …,r , 0, …, 0) , sorted singular valuer =rank(A)
Umn: whose column vectors are called left singular vectors (salient patterns among the terms)
Vnn: whose column vectors are called right singular vectors(salient patterns among the sentences)
10
Summarizer 2: Latent Semantic Analysis (Cont.)
****** . . .
. . .
***. . .
U
U*1 U*nU*2
U1*
Um*
U2*
.
.
.=
×∑×
*∆***∆ . . .
. . .
***. . .
VT
V*1'
V*n'
V*2' . . .
V1*'
Vn*'
V2*'
.
.
.
W1
S1
Wm
W2
SnS2
.
.
.
****** . . .
. . .
***. . .
A
Select the sentence which has the largest index valuewith the right singular vector (column vector of V)
11
Summarizer 3: Summarization by Page Layout Analysis (WWW10,2001)
In HTML content, a BO (Basic Object) is a non-breakable element within two tags or an embedded Object.
12
Summarizer 3: Summarization byPage Layout Analysis
Analyze the structure of Web Pages
Compute the similarity graph between objectsNodes = Objects
Weight of Edge = Similarity
Get the core object
Extract the content body (CB)
Header
Search Box
Main Body
Navigation List
Copyright
13
Summarizer 3: Summarization byPage Layout Analysis (Cont.) Detect the Content Body (CB) algorithm
Consider each selected object as a single document and build the TF*IDF index for the object.
Calculate the similarity between any two objects using Cosine similarity computation, and add a link between them if their similarity is greater than a threshold.
A core object is defined as the object having the most edges.
Extract the CB as the combination of all objects that have edges linked to the core object.
Summary All sentences that are included in the content body
give rise to the summary of the Web page.
14
Summarizer 3: Summarization byPage Layout Analysis (Cont.)
Obj1 Obj2 Obj3 Obj4
Obj1 1.00 0.03 0.08 0.00
Obj2 1.00 0.15 0.00
Obj3 1.00 0.02
Obj4 1.00
1 2
3 4
Content Body
0.03
0.08
0.02
0.00
0.15
0.00
15
Naïve Bayes Classifier (lecture of ML&DM berlin 2004)
Assume target function , where each instance described by attributes
Most probable value of is
Naïve Bayes assumption:
Naïve Bayes Classifier:
VXf :x
naaa ,...,, 21
xf
jjn
Vv
n
jjn
Vv
njVv
MAP
vPvaaaP
aaaP
vPvaaaP
aaavPv
j
j
j
,...,,maxarg
,...,,
,...,,maxarg
,...,,maxarg
21
21
21
21
i
jijn vaPvaaaP ,...,, 21
i
jijVv
NB vaPvPvj
maxarg
predict the target value/classification
16
Given a data set Z with 3-dimensional Boolean examples. Train a naïve Bayes classifier to predict the classification
What is the predicted probability ?
Naïve Bayes: Example
Attribute A
Attribute B
Attribute C
Classification D
F T F T
F F T T
T F F T
T F F F
F T T F
F F T F
TCFBTATDP ,,
3/1 ,3/2
3/2 ,3/1
3/2 ,3/1
3/2 ,3/1
3/2 ,3/1
3/2 ,3/1
2/1 ,2/1
FDFBPFDTCP
FDFBPFDTBP
FDFAPFDTAP
TDFBPTDTCP
TDFBPTDTBP
TDFAPTDTAP
FDPTDP
17
Naïve Bayes: Example
3
1
42
2
21
32
32
31
21
31
32
31
21
31
32
31
,,,,
,,
,,
,,
,,
FDPFDTCFBTAPTDPTDTCFBTAP
TDPTDTCFBTAP
TCFBTAP
TDPTDTCFBTAP
TCFBTATDP
18
Summarizer 4: Supervised Summarization (SIGIR 1995)
FeaturesGiven sentence Si
Fi1 : the position of a sentence Si in a certain paragraph; Fi2 : the length of a sentence Si; Fi3 : ∑TFw*SFw; (Term frequency , Sentence Frequency) Fi4 : the similarity (cosine) between Si and the title; Fi5 : the similarity between Si and all text in the page; Fi6 : the similarity between Si and meta-data in the page; Fi7 : the number of occurrences of word in Si from a
special word set (italic or bold or underlined words) . Fi8 : the average font size of the words in Si
19
Summarizer 4: Supervised Summarization (Cont.)
Classifier
Each sentence will then be assigned a score by the above equation.
8
1
8
1821
)(
)()|()...,|(
jj
jj
fp
SspSsfpfffSsp
)( Ssp
)( ifp
)|( Ssfp i
stands for the compression rate
is the probability of each feature i
is the conditional probability of each feature i
20
Ensemble Summarizers
The final score for each sentence is calculated by summing the individual score factors obtained for each method used.
Schema1: We assigned the weight of each summarization method in proportion to the performance of each method (the value of micro-F1).
Schema2-5: We increased the value of wi (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one.
sup4321 SwSwSwSwS cblsaluhn
21
Experiment Setup Dataset
2 millions Web pages from the LookSmart Web directory .
500 thousand pages with manually created descriptions. Randomly sampled 30% (includes 153,019 pages). Distributed among 64 categories (Only the top two level
categories on LookSmart Website) Classifier
NB (Naïve Bayesian) and SVM (Support Vector Machine) Evaluation
10-fold cross validation Precision,Recall,F1 Micro (gives equal weight to every document) vs. Macro
21
P RF
P R
22
Experiment 1: feasibility study
Baseline: Remaining text by removing the html tags
Human-authored summary as the ideal summary for the page;
Conclusion: Good summary can improve classification performance obviously.
0102030405060708090
Precision Recall Micro-F1
Baseline(Plain Text) Human Summary(Description)
0102030405060708090
Precision Recall Micro-F1
Baseline(Plain Text) Human Summary(Description)
NB
SVM
14.8%
13.2%
23
Experiment 2: Evaluation on Automatic Summarizers
Similar improvement among unsupervised methods Unsupervised methods are better than the supervised method All automatic methods are not as good as human summary
microP microR micro-F1
Baseline 70.7 57.7 63.6
Human 81.5 66.2 73.0
Summ1 77.9 63.3 69.8
Summ2 77.2 62.7 69.2
Summ3 75.9 61.7 68.1
Summ4 75.2 60.9 67.3
microP microR micro-F1
Baseline 72.4 59.3 65.1
Human 82.1 66.9 73.7
Summ1 77.3 62.8 69.3
Summ2 78.6 63.7 70.3
Summ3 79.2 64.3 71.0
Supervised 76.3 61.8 68.3
NB SVM
Summ1 = Luhn; Summ2 = Content Body; Summ3 = LSA; Summ4 = Supervised
24
Experiment 2: Evaluation on Automatic Summarizers (Cont.)
The ensemble of summarizers achieves similar improvement as human summary.
0
10
20
30
40
50
60
70
80
90
microP microR micro-F1
Baseline
Human SummaryEnsemble of Summarizer
0
10
20
30
40
50
60
70
80
90
microP microR micro-F1
Baseline
Human SummaryEnsemble of Summarizer
14.8%
12.9%
13.2%
11.5%
NB SVM
25
Experiment3: Parameter Tuning --Compression Rate
Compression rate is the most parameter in consideration
Most of automatic methods achieve best result when the compression rate is 20% or 30%
0.20 0.15 0.10 0.05
CB65.0±0.
567.0±0.
469.2±0
.466.7±0.
3
Performance of CB with different Threshold with NB
10% 20% 30% 50%
Luhn66.1±0.
569.8±0.
567.4±0.
464.5±0.
3
LSA66.3±0.
667.0±0.
568.1±0.
563.4±0.
3
Supervised
66.1±0.5
67.3±0.4
64.8±0.4
62.9±0.3
Hybrid66.9±0.
469.3±0.
471.8±0.
367.1±0.
3
Performance on different compression rate with NB
26
Experiment3: Parameter Tuning --Weight Schema
Schema1:weight of each summarization method in proportion to the performance of each method
Schema2-5:Increase the value of (i=1, 2, 3, 4) to 2 in Schema2-5 respectively and kept others as one
microP microR micro-F1
Origin80.2±0.
365.0±0.
371.8±0.
3
Schema181.0±0.
365.6±0.
372.5±0.
3
Schema281.3±0.
466.1±0.
472.9±0.
4
Schema379.5±0.
464.4±0.
471.2±0.
4
Schema481.1±0.
365.5±0.
372.5±0.
3
Schema579.7±0.
464.7±0.
471.4±0.
4
27
Analysis Why summarization helps?
Summarization can extract the main topic of a Web page while remove noises.
#of pages
Total Size(k)
Average Size/page (k)
A 100 31210 31.2
B 500 54500 10.9A: 100 Web pages that are correctly labeled by all our summarization based approaches but wrongly labeled by the baseline system B: 500 pages randomly from the testing pages
Conclusion: Summarization is helpful especially for large web pages
28
Conclusion Summarization techniques can be
helpful for classification. A new summarizer based on Web-
page structure analysis. Modification to Luhn’s method New features for Web-page
supervised summarization
29
Future Work Improve the summarization
performance Take Hypertext/Anchor Text into
consideration. Use the Hyperlink Structure. Use Query Log.
Apply Summarization to other applications Clustering