Page 1
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
Author: Satoshi Oyama
Takashi Kokubo
Toru lshida
國立雲林科技大學National Yunlin University of Science and Technology
Domain-Specific Web Search with Keyword Spices
Knowledge and Data Engineering, IEEE Transactions on , Jan. 2004 ,IEEE JNL
Page 2
Intelligent Database Systems Lab
Outline Motivation Objective Introduction Domain-specific web search with keyword spices Algorithm for extracting keyword spices Experiments Conclusions Opinion
N.Y.U.S.T.
I.M.
Page 3
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.Motivation
naïve queries may find many irrelevant pages obtain more relevant pages
depend on much experience and skill previous, domain-specific collect and index
relevant page manually constructed: cost, scalable
Page 4
Intelligent Database Systems Lab
Objective
Domain-specific search engines return: relevant to certain domains filter irrelevant web pages
N.Y.U.S.T.
I.M.
Page 5
Intelligent Database Systems Lab
1-1.Introduction
Domain-specific web search engines Looking for a recipe
Only input ‘beef’, find few recipes Input ‘beef pepper’, find other recipes
N.Y.U.S.T.
I.M.
Page 6
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.牛肉 牛肉、胡椒
Page 7
Intelligent Database Systems Lab
1-2.IntroductionN.Y.U.S.T.
I.M.
Page 8
Intelligent Database Systems Lab
1-3.IntroductionN.Y.U.S.T.
I.M.
Page 9
Intelligent Database Systems Lab
1-4.IntroductionN.Y.U.S.T.
I.M.
Domain-specific search engines return: relevant to certain domains filter irrelevant web pages
download irrelevant and relevant, classify them Use Decision-Tree
Page 10
Intelligent Database Systems Lab
2-1.Domain-Specific web search with keyword spices
Domain-Specific Web search as a Text Classification problem
Domain-Specific which collect sample web pages according to the assumption of user’s input
N.Y.U.S.T.
I.M.
Page 11
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
D : all web documents Dt: the set of documents relevant to a certain domain
N.Y.U.S.T.
I.M.
Page 12
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
set of all keywords in the domain be the hypothesis space composed of all Boolean
expressions is regarded as a Boolean variable A Boolean expression of keywords can be regarded as a
function from D to 1, keywords is contained in the document 0, otherwise
N.Y.U.S.T.
I.M.
Words in domain-specific
output
1 1 1 0 0 0 1
2 0 1 0 1 1 0
3 0 1 1 0 0 1
Page 13
Intelligent Database Systems Lab
2-1. Domain-specific web search as a text classification
Finding hypothesis h that minimizes the error rate:
N.Y.U.S.T.
I.M.
Page 14
Intelligent Database Systems Lab
2-2.Collecting sample web pages by user’s input
It’s difficult with random sampling. assume all candidates keyword have the same probability
of occurrence in the “recipe domain”, input “beef,” “salmon(鮭魚 ),” “
potato,” etc. as sample keywords and download the same web pages for each keyword
N.Y.U.S.T.
I.M.
Page 15
Intelligent Database Systems Lab
2-2.Collecting sample web pages by user’s input
N.Y.U.S.T.
I.M.
Page 16
Intelligent Database Systems Lab
3-1.Identifying keyword spicesN.Y.U.S.T.
I.M.
classify sample pages into two classes T or F by hand a decision tree learning algorithm to discover keyword
spices each node is an attribute value of a branch indicates the value of the attribute each leaf is a class
No “tablespoon” , has “recipe”, no “home”, no “top, class T
Page 17
Intelligent Database Systems Lab
3-1. Extracting keyword spicesN.Y.U.S.T.
I.M.
Words in domain-specific output
d1 1 1 0 0 0 1
d2 0 1 0 1 1 0
d3 0 1 1 0 0 1
Classified by humans
Web pages collected by user’s input keyword
Page 18
Intelligent Database Systems Lab
3-1.Identifying keyword spicesN.Y.U.S.T.
I.M.
Page 19
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Decision trees are very large. Too-complex queries can’t be accepted. overfitting problem
N.Y.U.S.T.
I.M.
Page 20
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Simplify the induced Boolean expression
1.For each conjunction c in h we remove
keywords (Boolean literals) from c to simplify.
2.We remove conjunctions from disjunctive
normal from h to simplify it.
N.Y.U.S.T.
I.M.
Page 21
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
Precision P and recall R are defined over validation
Harmonic mean of P and R
N.Y.U.S.T.
I.M.
Page 22
Intelligent Database Systems Lab
3-2.Simplifying keyword spices
greater contribution to F
weighted harmonic mean of F
N.Y.U.S.T.
I.M.
Page 23
Intelligent Database Systems Lab
4.ExperimtentsN.Y.U.S.T.
I.M.
Page 24
Intelligent Database Systems Lab
4-1.Experimtents-extracting keyword spices
N.Y.U.S.T.
I.M.
Page 25
Intelligent Database Systems Lab
4-1.Experimtents-extracting keyword spices
N.Y.U.S.T.
I.M.
Page 26
Intelligent Database Systems Lab
4-1.Extracting keyword spices
sample pages were split randomly in the recipe domain
N.Y.U.S.T.
I.M.
Page 27
Intelligent Database Systems Lab
keyword spices discovered for a recipe search engines
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Page 28
Intelligent Database Systems Lab
trade off between precision and recall
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Page 29
Intelligent Database Systems Lab
When , keyword spices extracted for the domain of …
N.Y.U.S.T.
I.M.4-1.Extracting keyword spices
Page 30
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Page 31
Intelligent Database Systems Lab
to test queries in each domain
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Page 32
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Page 33
Intelligent Database Systems Lab
precision values of the sample queries conjoined with “recipe”
keyword “recipe” finds fewer relevant than the query with keyword spice, for example: “beef recipe”
N.Y.U.S.T.
I.M.
4-2.Evluation Using a General-Purpose search engine
Page 34
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Page 35
Intelligent Database Systems Lab
precision values of the sample queries in the filtering model
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Page 36
Intelligent Database Systems Lab
numbers of relevant pages returned by the …
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Page 37
Intelligent Database Systems Lab
for example “shrimp”, must download 5 pages to obtain one result and so is quite inefficient
N.Y.U.S.T.
I.M.
4-3.Comparison to the Filtering model
Page 38
Intelligent Database Systems Lab
5.Future Work
training examples classified by human cost
N.Y.U.S.T.
I.M.
Page 39
Intelligent Database Systems Lab
5.Future Work
1. Using a Web Directory as a Source for Training examples Web directories such as Yahoo, Open Direct
ory,…,… estimate bias
N.Y.U.S.T.
I.M.
Page 40
Intelligent Database Systems Lab
5.Future Work
2. Learning Classifiers from Partially Labeled Data Proposed an algorithm
augment a small to huge
N.Y.U.S.T.
I.M.
Page 41
Intelligent Database Systems Lab
6.Conclusion
keyword spices human
Cost, effective
N.Y.U.S.T.
I.M.
Page 42
Intelligent Database Systems Lab
Opinion
dependent on human seriously assume all candidates keyword have the same
probability of occurrence ……
N.Y.U.S.T.
I.M.
Page 43
Intelligent Database Systems Lab
Opinion
Pr(TL)?Pr(TL’)?
N.Y.U.S.T.
I.M.
)Pr()'Pr(
)'Pr(
)Pr(
)'Pr()|'Pr(
)'Pr()Pr(
)Pr(
)Pr(
)Pr()|Pr(
)|Pr()Pr()Pr(
)Pr()|Pr(
WiTLWiTL
WiTL
Wi
WiTLWiTL
WiTLWiTL
WiTL
Wi
WiTLWiTL
TLWiTLTL
TLWiTLWi
)'Pr()'Pr(
)Pr()Pr(
)'|Pr(
)|Pr(
TLTLWi
TLTLWi
TLWi
TLWi
Page 44
Intelligent Database Systems Lab
Opinion
• Poster Probability Rule
X
N.Y.U.S.T.
I.M.
)|'Pr(
)|Pr(
)(lim
)(lim
)Pr(
)'Pr(
)Pr(
)'Pr(
)|'Pr(
)|Pr(
)'|Pr(
)|Pr(
0
1
WiTL
WiTL
xf
xf
TL
TL
TL
TL
WiTL
WiTL
TLWi
TLWi
x
x
assume all candidates keyword have the same probability of occurrence
Page 45
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 46
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 47
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 48
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 49
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 50
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 51
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 52
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 53
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 54
Intelligent Database Systems Lab
N.Y.U.S.T.
I.M.
Page 55
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Keyword Spices Modified
Page 56
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Information Retrieval
Page 57
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Machine Learning (cluster,classify)
Page 58
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Content Web Mining
Page 59
Intelligent Database Systems Lab
Advisor: Dr. Hsu
Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology
Dictionary which can represent a distance between Words
Page 60
Intelligent Database Systems Lab
Advisor:Graduate: Chien-Shing Chen
國立雲林科技大學National Yunlin University of Science and Technology