XClean: Providing Valid Spelling Suggestions for XML Keyword Queries Yifei Lu 1 , Wei Wang 1 , Jianxin Li 2 and Chengfei Liu 2 1 University of New South Wales 2 Swinburne University of Technology
Feb 22, 2016
XClean: Providing Valid Spelling Suggestions for XML Keyword Queries
Yifei Lu1, Wei Wang1, Jianxin Li2 and Chengfei Liu2
1 University of New South Wales2 Swinburne University of Technology
XML Keyword Search
2
Query: jiawi minning
User: I want to find data mining paper coauthored by Jiawei Han
author
DBLP
paper paper book
title title author
Mining concept
link mining
author
Eric
author
Jiawei Han
Jiawei Han
Manning
author
Jian Pei
3
Challenges
Must offer highly plausible suggestion
The suggested query should have non-empty results
Must be highly efficient
4
Poor Suggestion
5
Empty Result
Pu and Yu [PVLDB08] will suggest “jian manning” Worse than “jiawei mining” No meaningful connection
Empty Result
6
Query: jiawi minning
author
DBLP
paper paper book
title title author
Mining concept
link mining
author
Eric
author
Jiawei Han
Jiawei Han
Manning
author
Jian Pei
Problem Definition
Data A set of XML document trees Form a single tree by adding a virtual root node.
Query = { jiawi minning}
Candidate Query Space
Query Cleaning Find top-k queries from the Candidate Query Space Rank by 7
€
Pr(C |Q,T),C∈S
jiawijiawei
jianminingminning
manning
Confusion Set:Valid words in vocabulary,
with edit distance ≤ threshold€
S€
T
€
Q
Ranking Candidate Queries
How to model By Bayes’ Theorem
Rank by
8
€
Pr(C |Q,T) =Pr(Q |C,T)Pr(C |T)
Pr(Q |T)
Error Model Query Likelihood Model
€
Pr(Q |C,T)⋅Pr(C |T)€
Pr(C |Q,T)
Error Model
Modeling Typographical Errors The more similar the more likely Similarity measured by Edit Distance
Independence Assumption
9
Edit Distance
€
Pr(q |w) =1z⋅ exp(−β⋅ ed(q,w))€
Pr(Q |C,T)
€
Pr(Q |C,T) = Pr(Q |C) = Pr(Q[ j] |C[ j])1≤ j≤l∏
minninglinking
findingmanning
running
bindingmining
ed=1
ed=2
Query Likelihood Model
Modeling Query Generation Probability A good query finds good results is a set of disjoint entities (sub-trees) Measure the query likelihood on each entity Aggregate through all entities
10
€
Pr(C |T) = Pr(C | r)⋅Pr(r |T)r∈entities∑
€
Pr(C |T)
Entity Prior
r1r2 r3
(assume uniform)
author
DBLP
paper paper book
title title author
Mining concept
link mining
author
Jian
author
Jiawei Han
Jiawei Han
Manning
€
T
Language Modeling
Modeling query likelihood on entities Extract text in the sub-tree Build a Language Model
11
€
Pr(C | r)
Word Freq Pr(w)mining 2 0.2data 2 0.2jiawei 1 0.1concept 1 0.1drifting 1 0.1han 1 0.1knowledge 1 0.1discovery 1 0.1
€
Pr(C | r1) = Pr( jiawei | r1)⋅Pr(mining | r1)= 0.2 × 0.2 = 0.04
Smoothing is used to avoid zero probability
r1
DBLP
paper
title
Mining concept drifting data
……
author
Jiawei Han
booktitle
Data mining and knowledge
discovery
How to find the entities Each entity is a potential search result Different semantics can be applied
SLCA, ELCA, etc. Specific Return Type
One for each query Popular type But not too deep
Finding the entities
12p=/DBLP/paper
author
DBLP
paper paper book
title title author
Mining concept
link mining
author
Eric
author
Jiawei Han
Jiawei Han
Manning
13
Summary: Ranking Framework
€
Pr(Q |C,T)⋅ Pr(C | r)Pr(r |T)r∑
Error Model Entity PriorQuery
likelihood on each entity
14
Algorithm
Naïve Algorithm Enumerate all possible candidate queries Find the entities and compute the score for each
candidate query Problems:
Multiple passes of data Not all candidates are needed
author
DBLP
paper paper book
title author
link
author
Jian
author
Jiawei Jiawei Manning
1. Jiawei mining2. Jian mining3. Jiawei Manning4. Jian Manning
author
Jian
XClean Example
15
1
jiawei
jian
1.1.1.1.1 1.2.2.1
1.1.1.2.1 1.3.1.1.1
mining
manning1.2.1.1
1.3.2.1.1
author
DBLP
paper paper book
title
authormining
authors
jian
author
jiawei
jiaweimanning
author author
jian
authors
1.1 1.2 1.3
1.1.1 1.2.1 1.2.2 1.3.1
1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1
1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1
Query: jiawi minning
p1 p2
p3
p4
p1
p2
p3
p4
XClean Example
16
1
1.1.1.1.1 1.2.2.1
1.1.1.2.1 1.3.1.1.1
1.2.1.1
1.3.2.1.1
author
DBLP
paper paper book
title
authormining
authors
jian
author
jiawei
jiaweimanning
author author
jian
authors
1.1 1.2 1.3
1.1.1 1.2.1 1.2.2 1.3.1
1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1
1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1
Query: jiawi minning
p1
p2
p3
p4
“Jiawei mining” is generated“Jian mining” is skipped
jiawei
jian
mining
manning
p1
p2
p3
p4
XClean Example
17
1
1.1.1.1.1 1.2.2.1
1.1.1.2.1 1.3.1.1.1
1.2.1.1
1.3.2.1.1
author
DBLP
paper paper book
title
authormining
authors
jian
author
jiawei
jiaweimanning
author author
jian
authors
1.1 1.2 1.3
1.1.1 1.2.1 1.2.2 1.3.1
1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1
1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1
Query: jiawi minning
p2 p4
jiawei
jian
mining
manning
p1
p2
p3
p4
“jian manning” is generated
Experiment Settings
Algorithms XClean PY08: Pu and Yu [PVLDB08] SE1: Search Engine 1 SE2: Search Engine 2
Measures Mean Reciprocal Rank Precision@N Time
18
19
Experiment Settings
Datasets
Queries Clean: original clean queries
INEX: 285 DBLP: 49
Random: random edit operations on each keyword Rule: replace each word with a common misspelling
Dataset size(MB) #node Max depth
Avg depth
Queries
INEX 5,878 52M 50 5.58 285DBLP 526 12M 7 3.8 49
Experiment Results
Mean Reciprocal Rank (MRR)
20
€
MRR =1N
1rank(Qi)1≤i≤N
∑
Experiment Results
Precision@N Percentage of queries for which the correct suggestion is in
top-N suggestions
21
Experiment Results
Time Query processing time
22
Conclusion
Contributions A probabilistic framework for keyword query cleaning on XML
database. An Error Model based on edit distance A Query Likelihood Model that exploits XML tree structures and
keyword search semantics
Future work Concatenation/Splitting of words Cognitive Errors
23
Thank you!Questions?
24
XClean Algorithm
1) Find variants for each query keyword , and compute the error probability
2) Retrieve the XML nodes containing each variant through an inverted index
3) The nodes of all variants of form a virtual list4) Find the entity nodes that have at least one child node
from each virtual lista) Compute the for each candidate query found in
each entity b) Accumulate the scores in a global hash table
5) Output top-k candidate queries
25
€
Pr(qi |wij )
€
wij
€
qi
€
qi
€
Pr(C | r)
€
C
€
r