XML data management and approximate string matching Jiaheng Lu Key Lab of Data Engineering and Knowledge Engineering Renmin University of China November 22 2010 Presentation in TeleCom ParisTech
Mar 27, 2015
XML data management and approximate string matching
Jiaheng Lu
Key Lab of Data Engineering and Knowledge Engineering
Renmin University of China
November 22 2010
Presentation in TeleCom ParisTech
Research experience Associate Professor: Renmin University of China
XML data management, Cloud data management, Approximate search
Post-doc: University of California, Irvine Data integration, Approximate string match
PhD National University of Singapore XML data management
Outline
XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing
Approximate string matching Approximate string search Approximate member extraction
XML twig query processing
XPath: Section[Title]/Paragraph//Figure Twig pattern
Section
Title Paragraph
Figure
XML twig query processing (Cont.) Problem Statement
Given a query twig pattern Q, and an XML database D, we need to compute ALL the answers to Q in D.
E.g. Consider Query and Document:
Document: s1
s2
f1
p1
t1
t2
Section
title figure
Query solutions: (s1, t1, f1) (s2, t2, f1) (s1, t2, f1)
Query:
Previous work: TwigStack
TwigStack [1] is a holistic algorithm for XML twig matching on containment labeling scheme. Two steps in TwigStack :
(1) intermediate path solutions are output to match each query root-to-leaf path; and
(2) these intermediate path solutions are merged to get the final results.
[1] N. Bruno, D. Srivastava, and N. Koudas. Holistic twig joins: optimal xml pattern matching. In Proceedings of ACM SIGMOD, 2002.
Running example: TwigStack algorithm
s
t f
Query:
s (1,12,1)
t
f
(2,3,2)
(8,9,4)
Data streams:
(5,6,3)
(4,11,2)
State of stacks:
Output path intermediate solutions:
(1,12,1) (2,3,2)
s//t:
(1,12,1) (5,6,3)(4,11,2) (5,6,3)
s//f:
(1,12,1) (8,9,4)(4,11,2) (8,9,4)
Final results:
(1,12,1) (2,3,2) (8,9,4)(1,12,1) (5,6,3) (8,9,4)(4,11,2) (5,6,3) (8,9,4)
(1,12,1) (4,11,2)
(2,3,2) (5,6,3)
(8,9,4)
Limitations of TwigStack
(1) TwigStack may output many useless intermediate results for queries with parent-child relationship
(2) TwigStack cannot process XML twig queries with ordered predicates, like “Proceeding”, “Following” in XPath
(3) TwigStack cannot answer queries with wildcards in branching nodes.
E.g. *
B C
The parent of B should be an ancestor of C
XML twig query processing (Cont.)
Several efficient pattern matching algorithms TJFast (VLDB 05)(citation: 173) iTwigJoin (SIGMOD 05) TwigStackList (CIKM 04) TreeMatch (TKDE 10)
Motivation: new labeling scheme
TwigStackList and iTwigJoin are all based on the containment labeling scheme
Why not try Dewey labeling scheme for XML
twig pattern query ?
Oh, it is really a novel idea!
Original Dewey Labeling Scheme
In Dewey labeling scheme, each element is presented by an integer sequence: (i) the root is labeled by a empty stringε (ii) for a non-root element u, label(u)= label(s).x, where
u is the x-th child of s.
For example:
s1
s2
f1
f2t1
t2
1 2 3
2.1 2.2
ε
Main problem of the original Dewey
If we use the original Dewey labeling scheme to answer the twig query, we need to read labels for all query node. Thus, this is not a better solution than pervious algorithms.
Extend the original Dewey labeling scheme so that given the label of any element e, we can know the path of e from this label alone
Modular function We need to know some schema information: DTD (Document
Type Definitions ) or XML schema Given DTD information: book → author, title, chapter* Our solution: using modular function, we create a match between
an element tag and an integer number. We define Xauthormod 3 = 0 Xtitlemod 3 = 1 Xchaptermod 3 = 2;
where, Xt is the last integer of the label of tag t.
bookε
0
titleauthor 1
chapter2
chapter
5
Why not 3 as the original Dewey ?
The number of distinct tags under book
Derive element tag
From a label , we can derive its tag name. book → author, title, chapter* Recall that we define: Xauthor mod 3 = 0 Xtitle mod
3 = 1 Xchapter mod 3 = 2.book
ε
0
titleauthor 1
chapter2
chapter
5
? ? ? ?
More examples for assigning labels Let us consider a more complicated DTD a → (b | c )*, d?, c+ We define: Xbmod 3 = 0 Xcmod 3 = 1 Xd mod 3 = 2
(Why do we use mod 3 instead of 4?)
aε
0
db
2c4
c
7
Derive the path from a label
By following a finite state transducer (FST), we may recursively derive the whole path from any extended Dewey label.
For example:DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
Mod 2=1
Question: Given a label 5.1.0, what is the corresponding path ?
Document:
FST:
chapter
section
paragraphsection
Derive the path from a label By following a finite state transducer (FST), we may recursively derive
the whole path from any extended Dewey label. For example:DTD:
book → author, title, chapter*
chapter → (paragraph | section)*
section → (paragraph | section)*
book
chapter
sectionauthor title
Document:chapter
section
paragraph section
Following the above red path, we get
5.1.0 denotes :
book/ chapter/section/paragraph
book
author
title
chapter
paragraph
section
Mod 3=0
Mod 3=1
Mod 3=2 Mod 2=0
Mod 2=1
Mod 2=0
FST:
Mod 2=1
Two properties of extended Dewey Find Ancestor Label
From a label of any element, we can derive the labels of its all ancestors.
Find Ancestor Name From a label of any element, we can derive the tag
names of its all ancestors.
Two properties enable us to design a new and efficient algorithm for XML twig pattern matching.
A new algorithm: TJFast For each node n in the query, there exists a corresponding
input stream Tn.
Tn contains the extended Dewey labels of elements of tag n. Those labels are arranged by the document order.
For each branching node b of twig pattern, there is a corresponding set Sb, which contains elements possibly involving query answers. (Compared to TwigStackList, what difference? )
During any point of computing, the size of set Sb is bounded by the depth of the XML document.
An example for TJFast algorithmDocument: Query:
A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0TD:
TC:
DTD:
a -> a*,d*, b*
b -> d*, c*
d -> c*
Root0
…
0.5.0
A set for the branching node A
Why are there only two streams?
{ }
An example for TJFast algorithmDocument:
Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
0.0.1 a1/a2/d1derive
0.3.2.1 a1/a3/b1/c1derive
By finite state transducer of extended Dewey labeling scheme
TD:
TC:
{ }
An example for TJFast algorithm
Document: Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Both a1 and a3 possibly involve in query answers. (Why not a2 ?)TD:
TC:
{ }
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Then we insert a1, a3 to the set,
Output Path solutions:
A//D A/B//C
(a1, d1) (a3, b1, c1)
TD:
TC:
An example for TJFast algorithm{a1,a3}
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0Move the cursor of TD from d1 to d2
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)
{a1,a3}
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Move the cursor of stream TD from
d2 to d3
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2)(a3, d2)(a1, d3)
{a1,a3}
Document:Query: A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
0.0
0.0.1
0.3
0.3.1
0.3.2
0.3.2.1
0.5
0.5.0.0
0.3.2.1, 0.5.0.0
0.0.1 , 0.3.1, 0.5.0
Root0
…
0.5.0
Move the cursor of stream TC from c1 to c2
TD:
TC:
An example for TJFast algorithm
Output Path solutions:A//D A/B//C(a1, d1) (a3, b1, c1)(a1, d2) (a1, b2, c2)(a3, d2)(a1, d3)
{a1,a3}
Document:
Query:A
D B
C
a1
a2 a3 b2
d2 b1
c2
d3
c1
d1
A// D:<a1, d1>, <a1, d2>,<a1, d3>,<a3, d2>
A/B//C:<a1,b2, c2>,<a3, b1,c1>
Phase 1. Intermediate paths
<a1,d1,b2,c2>,<a1,d2, b2,c2>,
<a1,d3,b2,c2>,<a3,d2, b1,c1>,
<A, D, B,C>
Phase 2. Final solutions
Join
Sort and merge-join in TJFast
TJFast+L
Apply extended Dewey labeling scheme on tag+level streaming scheme, we propose TJFast+L algorithm by extending TJFast
Two benefits of TJFast+L over TJFast reduce I/O cost by reading less elements enlarge optimal query classes
Optimal query classes
Only P-C in all edges
A
B C C
A
B
D D
Optimal Class of TJFast
Optimal Class of TJFast+L
Only A-D in branching edges
XML twig query processing Jiaheng Lu, Ting Chen, Tok Wang Ling: Efficient processing of XML twig patterns with
parent child edges: a look-ahead approach. CIKM 2004:533-542 Jiaheng Lu, Tok Wang Ling, Chee Yong Chan, Ting Chen: From Region Encoding To
Extended Dewey: On Efficient Processing of XML Twig Pattern Matching. VLDB 2005:193-204
Jiaheng Lu, Tok Wang Ling: Labeling and Querying Dynamic XML Trees. APWeb 2004:180-189
Jiaheng Lu, Ting Chen, Tok Wang Ling: TJFast: effective processing of XML twig pattern matching. WWW (Special interest tracks and posters) 2005:1118-1119
Jiaheng Lu, Tok Wang Ling, Tian Yu, Changqing Li, Wei Ni: Efficient Processing of Ordered XML Twig Pattern. DEXA 2005:300-309
Jiaheng Lu: Benchmarking Holistic Approaches to XML Tree Pattern Query Processing - (Extended Abstract of Invited Talk). DASFAA Workshops 2010:170-178
Tian Yu, Tok Wang Ling, Jiaheng Lu: TwigStackList-: A Holistic Twig Join Algorithm for Twig Query with Not-Predicates on XML Data. DASFAA 2006:249-263
Zhifeng Bao, Tok Wang Ling, Jiaheng Lu, Bo Chen: SemanticTwig: A Semantic Approach to Optimize XML Query Processing. DASFAA 2008:282-298
Ting Chen, Jiaheng Lu, Tok Wang Ling: On Boosting Holism in XML Twig Pattern Matching using Structural Indexing Techniques. SIGMOD 2005:455-466
……
Outline
XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing
课题背景: XQuery vs. 关键字查询
XQuery: for $a in doc(“bib.xml”)//author $n in $a/name where $n=”Mike” return $a//inproceedings
Query papers by “Mike”
Keyword search:
Mike , inproceedings
Complicated
The proposed keyword search returns the set of smallest trees containing all keywords.
bib
author author
name publications hobby
title
inproceedings articles
year
Mikeward
Paperfolding
title year
Base line of XML key
Information Retrival
20022002
name publications hobby
title
inproceedings article
year
JohnHopking Read
book
title year
Data Mining
KeywordSearch
in XML
20092007
Keywords:
Mike hobby
article 2009
Paper
XML keyword search
– Search intention identification
– Query result retrieval
– Result ranking– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
– Detailed papers: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
(one of best papers to be invited in TKDE Journal)
XML keyword search
XML Keyword search Inspired by IR style keyword search on the web Enables user to access information in XML
database XML data modeled as a rooted, labeled tree Recent research efforts
Efficiency Effectiveness
Effectiveness
Capture user’s search intentionIdentify the target that user intends to search forInfer the predicate constraint that user intends to search via
Result rankingRank the query results according to their objective
relevance to user search intention
State of the Art Search semantics design
LCA (Lowest Common Ancestor) Node v is a LCA of keyword set K={w1, w2,…,wk} if the sub-tree
rooted at v contains at least one occurrence of all keywords in K, after excluding the sub-elements that already contain all keywords in K
SLCA (Smallest LCA) Node v is a SLCA of keyword set K={w1, w2,…,wk} if
(1) v is a LCA of K (2) no proper descendant of v is LCA of K
XSeek Infers the search intention based on the concept of objects and an
analysis of the matching between keyword and data node
State of the Art (cont)
Efficient result retrieval Designed based on a certain search semantics XKSearch, Multiway SLCA etc.
Result ranking XRANK, XKSEarch, EASE They only consider
Structural compactness of matching results Keyword proximity Similarity at node level
Problems Unaddressed
Not address the user search intention adequately! Meaningfulness of query result
SLCA is less meaningful in many cases Keyword Ambiguity Problems
1. A keyword can appear both as an xml node type and as the text value of some other nodes
2. A keyword can appear in the text values of different xml node types and carry different meanings
Neither SLCA nor Xseek can well address keyword ambiguity
Problems——Keyword AmbiguityQ = “customer, interest, art”
Ambiguity 1: customer, interest; Ambiguity 2: art Intention: find customer whose interest is art less relevant or irrelevant result to be returned also --- C1,C3, B1’s title
customers
storeDB
books
... ...book
title publisherIDauthors
author“B 2 ”
...
“Edward Martin”
“Sophia Jones”
author
customer
IDname
interest
interests...
“art”“Rock Davis”
“C 4 ”
...
“Daniel Jones”“John Williams”
book
title...
IDauthors
author“B 1 ”
author
“Art of Customer Interest Care”
customer
IDname
addressinterest
streetcity
interestscontact
no.
“1”
“Art Street”...
...
“fashion”“Mary Smith”
“C 1 ”
customer
IDname
interest
interests
“rock music”“Art Smith”
“C 3 ”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C 2 ”
...
......
...name
“Oxford”
Problems——Keyword Ambiguity (cont)
Q = “customer, interest, art” “art” can be the value of interest node(C2, C4), name node(C3), or street
node of customer(C1), or title node of book(B1) “customer” can be tag name of customer node, or (part of) value of title
of(B1) - How to rank C1 to C4 and B1?customers
storeDB
books
... ...book
title publisherIDauthors
author“B 2 ”
...
“Edward Martin”
“Sophia Jones”
author
customer
IDname
interest
interests...
“art”“Rock Davis”
“C 4 ”
...
“Daniel Jones”“John Williams”
book
title...
IDauthors
author“B 1 ”
author
“Art of Customer Interest Care”
customer
IDname
addressinterest
streetcity
interestscontact
no.
“1”
“Art Street”...
...
“fashion”“Mary Smith”
“C 1 ”
customer
IDname
interest
interests
“rock music”“Art Smith”
“C 3 ”
purchase
purchases
customer
ID name
interest
interests
“street art”“John Martin”
“C 2 ”
...
......
...name
“Oxford”
Objectives & Challenges
ChallengesI. How to decide which sub-tree(s) with appropriate node types can capture
user desired information
II. How to return sub-trees of an appropriate size (i.e. contain enough but non-overwhelming information)
III.How to rank those sub-trees by their relevance
• Address the below as a single problem – Search intention identification
– Query result retrieval
– Result ranking– Extend original TF*IDF from text database to XML database,
while capture the hierarchical structure of XML data
Challenges
Difficulty in applying TF*IDF to XMLXML DB carries semantic information while text DB contains
pure text information. XML TF*IDF must be aware of the underlying semantics.
All contents of XML data are stored in leaf nodes onlyWhat is analogy of “flat document” in XML?
o Sub-tree classified according to its prefix pathNormalization factor is not simply the size of sub-tree
o Structure of sub-trees may also infest the ranks
Our Approach Extend IR-style keyword search techniques (like TF*IDF) from text
database to XML database, in order to capture the hierarchical structure of xml document by analyzing the knowledge of statistics of underlying XML data
Major Contributions
1. Identify user’s desired search-for node and search-via node(s) in a heuristic way Define XML TF (term frequency) and XML DF (document frequency) Confidence Formulas for search for/via candidates
2. Define XML TF*IDF Similarity Propose 3 guidelines specifically for xml keyword search Take keyword ambiguity problems into account
3. Design a Keyword Search Engine XReal
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling: XReal: an interactive XML keyword searching. CIKM 2010:1933-1934
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Liang Xu, Huayu Wu: An Effective Object-Level XML Keyword Search. DASFAA 2010:93-109
Zhifeng Bao, Jiaheng Lu, Tok Wang Ling, Bo Chen: Towards an Effective XML Keyword Search. TKDE, 22(8):1077-1092 (2010)
Zhifeng Bao, Bo Chen, Tok Wang Ling, Jiaheng Lu: Demonstrating Effective Ranked XML Keyword Search with Meaningful Result Display. DASFAA 2009:750-754
Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu: Effective XML Keyword Search with Relevance Oriented Ranking. ICDE 2009:517-528
Bo Chen, Jiaheng Lu, Tok Wang Ling: Exploiting ID References for Effective Keyword Search in XML Documents. DASFAA 2008:529-537
Jianjun Xu, Jiaheng Lu, Wei Wang, Baile Shi: Effective Keyword Search in XML Documents Based on MIU. DASFAA 2006:702-716
……
XML keyword search
Outline
XML data management XML twig query processing XML keyword search Graphical and interactive XML query processing
Graphical and interactive XML search
Auto-completion XML search Order-sensitive XML twig query XML query suggestion
Demo online: http://datasearch.ruc.edu.cn:8080/LotusX/
Outline
XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing
Approximate string matching Approximate string search Approximate member extraction
Motivation: Data Cleaning
Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Real-world data is dirty
Typos
Inconsistent representations
(PO Box vs. P.O. Box)
Approximately check against
clean dictionary
Should clearly be “Niels Bohr”
Motivation: Record Linkage
Name Hobbies AddressBrad Pitt … …Forest Whittacker … …George Bush … …Angelina Jolie … …Arnold Schwarzenegger … …
Phone Age Name… … Brad Pitt… … Arnold Schwarzeneger … … George Bush… … Angelina Jolie … … Forrest Whittaker
We want to link records belonging to the same entity
No exact match!
The same entity may have similar representations
Arnold Schwarzeneger versusArnold Schwarzenegger
Forrest Whittaker versusForest Whittacker
Motivation: Query Relaxation
http://www.google.com/jobs/britney.html
Errors in queries
Errors in data
Bring query and meaningful
results closer together
Actual queries gathered by Google
What is Approximate String Search?
String Collection: (People)
Brad PittForest WhittackerGeorge BushAngelina JolieArnold Schwarzeneger………
Queries against collection:Find all entries similar to “Forrest Whitaker”Find all entries similar to “Arnold Schwarzenegger”Find all entries similar to “Brittany Spears”
What do we mean by similar to?- Edit Distance- Jaccard Similarity- Cosine Similaity- Dice- Etc.
The similar to predicate can help our described applications!
How can we support these types of queries efficiently?
Approximate Query Answering
Main Idea: Use q-grams as signatures for a string
irvine
2-grams {ir, rv, vi, in, ne}
Intuition: Similar strings share a certain number of grams
Inverted index on grams supports finding all data strings sharing enough grams with a query
Sliding Window
Approximate Query Example
Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne}
tf vi ir ef rv ne unin ……
Lookup Grams
2-grams134579
59
15
1239
39
79
569
Inverted Lists
(stringIDs)
12456
Each edit operations can “destroy” at most q gramsAnswers must share at least T = 5 – 1 * 2 = 3 grams
T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.
Candidates = {1, 5, 9}May have false positivesNeed to compute real similarity
Outline
XML data management XML twig query processing XML keyword search XML Keyword refinement Graphical and interactive XML query processing
Approximate string matching Approximate string search Approximate member extraction
Introduction: An Example
A dictionary of strings we are interested in E.g. product names, postal addresses…
We are going to locate their “approximate occurrence” in documents. See the meaning of “approximate occurrence” in
the following example:
Problem Definition
Given a dictionary R and a threshold δ, extract all proper substrings m from input documents S such that there exists r R, and Similarity (r, ∈m) ≥δ(or Distance(r, m) ≤k). Here we call r a piece of evidence for m. Similarity() is a function measuring the similarity
of two strings Strings are viewed as sets of tokens (words) An example for Sim(): Jaccard similarity:
)(
)(),(
mrwt
mrwtmrJ
Why pre-pruning is needed
We need evidence to decide whether a substring m should be extracted Simple verification on all dictionary strings may be
inefficient Pre-pruning and post-verifying is beneficial But should it be running-time-specific or filtering-
power-specific? Less time or less survivors?
The issue of compromise comes again
Balance between the two stages should be reached:
More(less)filtration time
Strong(weak)filtration power
Fewer(more)candidates
Less(more)verification time
Overall performance
=Tf+Tv ?????
State-of-the-art techniques ——K-signature scheme
K-signature scheme Proposed by Chakrabarti et al. (SIGMOD 2008) Choose several top-weighted tokens in a string as signatures to
represent it: s => Sig(s) Observation: if r cannot match m, r is likely to have insufficient
signature overlapping with m K is a parameter for filtration power tuning
Potential evidence loss A counter-example found when k=3 We tried and only proved that it works for k=1 and k=∞
Proposed by Chakrabarti et al. (SIGMOD 2008) Each dictionary string encoded into a solid 0-1 matrix An ‘1’ for each occurrence of a <token,sig-token> tuple
(‘1’- rectangle) Bitwise-or all solid matrices to get the matrix of R
Observation: if m is an approximate member of R, the matrix of m must have enough intersections with that of R.
Formalized into an NPC problem Solution causes too weak filtering power
State-of-the-art techniques ——Inverted Signature-based Hashtable
If Sim(m,r) ≥δ, what do we have ?wt(Sig(m)∩Sig(r)) ≥ τ(m)
wt(Sig(m)∩Sig(r)) ≥ min{τ(m),τ(r) }
So the threshold does not remain constant involves unknown evidence
Our solution: Use inverted lists to count sig-token overlappings. Note that sig-tokens usually have low document
frequency (e.g. IDF as weights)
Our proposed theorem
Signature-based Inverted Lists (SLH) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the string’s id)
in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera", r2 =“nikon digital slr
camera”, r3=“canon slr camera”}. wt(digital, camera, canon, nikon, slr, eos, 5d) = (1, 1, 2, 2, 2, 7 ,9).
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
5d, 9.0
canon, 2.0
camera, 1.0
eos, 7.0
nikon, 2.0
slr, 2.0
11
11
22
11
22
22
33
33
Our algorithms and evaluations ——EvSCAN:Filtration by SIL
Approximate string matching Jiaheng Lu, Jialong Han, Xiaofeng Meng: Efficient algorithms for
approximate member extraction using signature-based inverted lists. CIKM 2009:315-324
Alexander Behm, Shengyue Ji, Chen Li, Jiaheng Lu: Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. ICDE 2009:604-615
Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008:257-266
Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du: Efficient Algorithm for Computing Link-Based Similarity in Real World Networks. ICDM 2009:734-739
……
Thank youQ&A