XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Yifei Lu1, Wei Wang1, Jianxin Li2 and Chengfei Liu2

1 University of New South Wales2 Swinburne University of Technology

XML Keyword Search

2

Query: jiawi minning

User: I want to find data mining paper coauthored by Jiawei Han

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

author

Jian Pei

3

Challenges

Must offer highly plausible suggestion

The suggested query should have non-empty results

Must be highly efficient

4

Poor Suggestion

5

Empty Result

Pu and Yu [PVLDB08] will suggest “jian manning” Worse than “jiawei mining” No meaningful connection

Empty Result

6


author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

author

Jian Pei

Problem Definition

Data A set of XML document trees Form a single tree by adding a virtual root node.

Query = { jiawi minning}

Candidate Query Space

Query Cleaning Find top-k queries from the Candidate Query Space Rank by 7

€

Pr(C |Q,T),C∈S

jiawijiawei

jianminingminning

manning

Confusion Set:Valid words in vocabulary,

with edit distance ≤ threshold€

S€

T

€

Q

Ranking Candidate Queries

How to model By Bayes’ Theorem

Rank by

8

€

Pr(C |Q,T) =Pr(Q |C,T)Pr(C |T)

Pr(Q |T)

Error Model Query Likelihood Model

€

Pr(Q |C,T)⋅Pr(C |T)€

Pr(C |Q,T)

Error Model

Modeling Typographical Errors The more similar the more likely Similarity measured by Edit Distance

Independence Assumption

9

Edit Distance

€

Pr(q |w) =1z⋅ exp(−β⋅ ed(q,w))€

Pr(Q |C,T)

€

Pr(Q |C,T) = Pr(Q |C) = Pr(Q[ j] |C[ j])1≤ j≤l∏

minninglinking

findingmanning

running

bindingmining

ed=1

ed=2

Query Likelihood Model

Modeling Query Generation Probability A good query finds good results is a set of disjoint entities (sub-trees) Measure the query likelihood on each entity Aggregate through all entities

10

€

Pr(C |T) = Pr(C | r)⋅Pr(r |T)r∈entities∑

€

Pr(C |T)

Entity Prior

r1r2 r3

(assume uniform)

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Jian

author

Jiawei Han

Jiawei Han

Manning

€

T

Language Modeling

Modeling query likelihood on entities Extract text in the sub-tree Build a Language Model

11

€

Pr(C | r)

Word Freq Pr(w)mining 2 0.2data 2 0.2jiawei 1 0.1concept 1 0.1drifting 1 0.1han 1 0.1knowledge 1 0.1discovery 1 0.1

€

Pr(C | r1) = Pr( jiawei | r1)⋅Pr(mining | r1)= 0.2 × 0.2 = 0.04

Smoothing is used to avoid zero probability

r1

DBLP

paper

title

Mining concept drifting data

……

author

Jiawei Han

booktitle

Data mining and knowledge

discovery

How to find the entities Each entity is a potential search result Different semantics can be applied

SLCA, ELCA, etc. Specific Return Type

One for each query Popular type But not too deep

Finding the entities

12p=/DBLP/paper

author

DBLP

paper paper book

title title author

Mining concept

link mining

author

Eric

author

Jiawei Han

Jiawei Han

Manning

13

Summary: Ranking Framework

€

Pr(Q |C,T)⋅ Pr(C | r)Pr(r |T)r∑

Error Model Entity PriorQuery

likelihood on each entity

14

Algorithm

Naïve Algorithm Enumerate all possible candidate queries Find the entities and compute the score for each

candidate query Problems:

Multiple passes of data Not all candidates are needed

author

DBLP

paper paper book

title author

link

author

Jian

author

Jiawei Jiawei Manning

1. Jiawei mining2. Jian mining3. Jiawei Manning4. Jian Manning

author

Jian

XClean Example

15

1

jiawei

jian

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

mining

manning1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1


p1 p2

p3

p4

p1

p2

p3

p4

XClean Example

16

1

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1


p1

p2

p3

p4

“Jiawei mining” is generated“Jian mining” is skipped

jiawei

jian

mining

manning

p1

p2

p3

p4

XClean Example

17

1

1.1.1.1.1 1.2.2.1

1.1.1.2.1 1.3.1.1.1

1.2.1.1

1.3.2.1.1

author

DBLP

paper paper book

title

authormining

authors

jian

author

jiawei

jiaweimanning

author author

jian

authors

1.1 1.2 1.3

1.1.1 1.2.1 1.2.2 1.3.1

1.1.1.1 1.1.1.2 1.2.1.1 1.2.2.1 1.3.1.1 1.3.2.1

1.1.1.1.1 1.1.1.2.1 1.3.1.1.1 1.3.2.1.1


p2 p4

jiawei

jian

mining

manning

p1

p2

p3

p4

“jian manning” is generated

Experiment Settings

Algorithms XClean PY08: Pu and Yu [PVLDB08] SE1: Search Engine 1 SE2: Search Engine 2

Measures Mean Reciprocal Rank Precision@N Time

18

19

Experiment Settings

Datasets

Queries Clean: original clean queries

INEX: 285 DBLP: 49

Random: random edit operations on each keyword Rule: replace each word with a common misspelling

Dataset size(MB) #node Max depth

Avg depth

Queries

INEX 5,878 52M 50 5.58 285DBLP 526 12M 7 3.8 49

Experiment Results

Mean Reciprocal Rank (MRR)

20

€

MRR =1N

1rank(Qi)1≤i≤N

∑

Experiment Results

Precision@N Percentage of queries for which the correct suggestion is in

top-N suggestions

21

Experiment Results

Time Query processing time

22

Conclusion

Contributions A probabilistic framework for keyword query cleaning on XML

database. An Error Model based on edit distance A Query Likelihood Model that exploits XML tree structures and

keyword search semantics

Future work Concatenation/Splitting of words Cognitive Errors

23

Thank you!Questions?

24

XClean Algorithm

1) Find variants for each query keyword , and compute the error probability

2) Retrieve the XML nodes containing each variant through an inverted index

3) The nodes of all variants of form a virtual list4) Find the entity nodes that have at least one child node

from each virtual lista) Compute the for each candidate query found in

each entity b) Accumulate the scores in a global hash table

5) Output top-k candidate queries

25

€

Pr(qi |wij )

€

wij

€

qi

€

qi

€

Pr(C | r)

€

C

€

r

XClean: Providing Valid Spelling Suggestions for XML Keyword Queries

Documents

suggested query

candidate queryproblems

ranking candidate querieshow

good results

jiawei mining2

jiawei manning4

possible candidate queriesfind

entitieseach entity