Top Banner
Presented by: Aneeta Kolhe
22

Efficient Approximate Entity Extraction with Edit Distance Constraints

Jan 01, 2016

Download

Documents

abbot-wilson

Efficient Approximate Entity Extraction with Edit Distance Constraints. Presented by: Aneeta Kolhe. Introduction. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search. Problem. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Efficient Approximate Entity Extraction with Edit Distance Constraints

Presented by: Aneeta Kolhe

Page 2: Efficient Approximate Entity Extraction with Edit Distance Constraints

• Named Entity Recognition finds approximate matches in text.

• Important task for information extraction and integration, text mining and also for web search.

Page 3: Efficient Approximate Entity Extraction with Edit Distance Constraints

Approximate dictionary matching. Previous solution – Token based similarity

constraints Proposed solution – Neighborhood

generation method

Page 4: Efficient Approximate Entity Extraction with Edit Distance Constraints

It uses Jaccard co-efficient similarity

It may miss some match.

It may result in too many matches.

Page 5: Efficient Approximate Entity Extraction with Edit Distance Constraints

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched

unless use low jaccard similarity of 0.33.

“alqaeda” will match “al gore” as well as “al pacino”

Hence we use edit distance

Page 6: Efficient Approximate Entity Extraction with Edit Distance Constraints

Problem Definition:

For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit

distance from one of the entities in E

Solution: Iterate through all the valid substrings of the document D

Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.

Consider each substring as a query segment.

Page 7: Efficient Approximate Entity Extraction with Edit Distance Constraints

at least one partition with at most one edit error

select k т = (т +1)/2Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ]т = 3 , k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

Page 8: Efficient Approximate Entity Extraction with Edit Distance Constraints

Shifting the first partition s by 2 => s = [cdef]

scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider

scaling within the range of [−2, 2]. Last partition, we only need to consider the

combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).

For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].

Page 9: Efficient Approximate Entity Extraction with Edit Distance Constraints

1st partition: 5 variations intermediate partitions: 5*(2 т +1)

variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).

Page 10: Efficient Approximate Entity Extraction with Edit Distance Constraints
Page 11: Efficient Approximate Entity Extraction with Edit Distance Constraints

s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

< [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s

partition variation [fghijkl ] generated from s’s second partition.

Page 12: Efficient Approximate Entity Extraction with Edit Distance Constraints

The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.

Assume l p is set to 3. Then 1-variantsare generated from only the following

prefixes. <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further

reduced to O(l p т²).

Page 13: Efficient Approximate Entity Extraction with Edit Distance Constraints

to index short and long entities in the dictionary, and store them in two

inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used

to generate its 1-variant family, which will be indexed.

Page 14: Efficient Approximate Entity Extraction with Edit Distance Constraints

Algorithm : BuildIndex (E, , lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1 .. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1 .. lp], 1); for each v 2 V do Ilong <- Ilong U <e, i >; return (Ishort, Ilong)

Page 15: Efficient Approximate Entity Extraction with Edit Distance Constraints

Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p .. p + lp − 1], E, т );/* matching entities of length in [lmin, kт lp)

*/

Page 16: Efficient Approximate Entity Extraction with Edit Distance Constraints

R <- ф; /* holds results */

C <- ф ; /* holds candidates */

V <- GenVariants(s, 1) ; /* gen 1-variant family */

for each v Є V do for each <e, pid > Є Ilongv do C <- C U <e, pid > ; /*

duplicates removed */ 7 for each <e, pid > Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R <-R <seg, e > Return R

Page 17: Efficient Approximate Entity Extraction with Edit Distance Constraints

Search short(s) We need to generate the т-variant families for

each possible length l between Lmin − т and lp If the current query segment is shorter than lp,

every candidate pair formed by probing the index needs to be verified

Otherwise, we need to perform verification for 2 т + 1 possible query segments.

Page 18: Efficient Approximate Entity Extraction with Edit Distance Constraints

For example, enumerate 1-variants of the string [ abcdef ] from left to right.

no variant starts with abc in the index. Algorithm still enumerate other three 1-

variants containing abc. To avoid this set parameter lpp set to lp/2.

Page 19: Efficient Approximate Entity Extraction with Edit Distance Constraints

Consider 4 possible cases:

Prefix Match

Suffix Match

Action

True true enumerate all 1-variants of q[1 .. lp]

False False discard q as there is no match

False True enumerate all 1-variants of q[1 .. lpp]

False False enumerate all 1-variants of q[(lpp + 1) .. lp]

Page 20: Efficient Approximate Entity Extraction with Edit Distance Constraints

Successfully reduced the size of neighborhood

Proposed an efficient query processing algorithm

Optimized the algorithm to share computation

Avoid unnecessary variant enumeration

Page 21: Efficient Approximate Entity Extraction with Edit Distance Constraints
Page 22: Efficient Approximate Entity Extraction with Edit Distance Constraints