Top Banner
Distance functions and IE – 4? William W. Cohen CALD
52

Distance functions and IE – 4? William W. Cohen CALD.

Jan 05, 2016

Download

Documents

Clarissa Welch
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Distance functions and IE – 4? William W. Cohen CALD.

Distance functions and IE – 4?

William W. Cohen

CALD

Page 2: Distance functions and IE – 4? William W. Cohen CALD.

Announcements

• Current statistics:– days with unscheduled student talks: 6– students with unscheduled student talks: 4– Projects are due: 4/28 (last day of class)– Additional requirement: draft (for comments)

no later than 4/21

Page 3: Distance functions and IE – 4? William W. Cohen CALD.

The data integration problem

Page 4: Distance functions and IE – 4? William W. Cohen CALD.

String distance metrics so far...

• Term-based (e.g. TF/IDF as in WHIRL)– Distance depends on set of words contained in both s and t – so sensitive

to spelling errors.– Usually weight words to account for “importance”– Fast comparison: O(n log n) for |s|+|t|=n

• Edit-distance metrics– Distance is shortest sequence of edit commands that transform s to t.– No notion of word importance– More expensive: O(n2)

• Other metrics– Jaro metric & variants– Monge-Elkan’s recursive string matching– etc?

• Which metrics work best, for which problems?

Page 5: Distance functions and IE – 4? William W. Cohen CALD.

Jaro metric

Page 6: Distance functions and IE – 4? William W. Cohen CALD.

Winkler-Jaro metric

Page 7: Distance functions and IE – 4? William W. Cohen CALD.
Page 8: Distance functions and IE – 4? William W. Cohen CALD.

String distance metrics so far...

• Term-based (e.g. TF/IDF as in WHIRL)– Distance depends on set of words contained in both s and t – so sensitive

to spelling errors.– Usually weight words to account for “importance”– Fast comparison: O(n log n) for |s|+|t|=n

• Edit-distance metrics– Distance is shortest sequence of edit commands that transform s to t.– No notion of word importance– More expensive: O(n2)

• Other metrics– Jaro metric & variants– Monge-Elkan’s recursive string matching– etc?

• Which metrics work best, for which problems?

Page 9: Distance functions and IE – 4? William W. Cohen CALD.

So which metric should you use?

• Java toolkit of string-matching methods from AI, Statistics, IR and DB communities

• Tools for evaluating performance on test data• Exploratory tool for adding, testing, combining

string distances– e.g. SecondString implements a generic “Winkler

rescorer” which can rescale any distance function with range of [0,1]

• URL – http://secondstring.sourceforge.net• Distribution also includes several sample

matching problems.

SecondString (Cohen, Ravikumar, Fienberg):

Page 10: Distance functions and IE – 4? William W. Cohen CALD.

SecondString distance functions

• Edit-distance like:– Levenshtein – unit costs– untuned Smith-Waterman– Monge-Elkan (tuned Smith-Waterman)– Jaro and Jaro-Winkler

Page 11: Distance functions and IE – 4? William W. Cohen CALD.

Results - Edit Distances

Monge-Elkan is the best on average....

Page 12: Distance functions and IE – 4? William W. Cohen CALD.

Edit distances

Page 13: Distance functions and IE – 4? William W. Cohen CALD.

SecondString distance functions

• Term-based, for sets of terms S and T:– TFIDF distance– Jaccard distance:

– Language models: construct PS and PT and use

||

||),(

TS

TSTSsim

Page 14: Distance functions and IE – 4? William W. Cohen CALD.

SecondString distance functions

• Term-based, for sets of terms S and T:– TFIDF distance– Jaccard distance– Jensen-Shannon distance

• smoothing toward union of S,T reduces cost of disagreeing on common terms

• unsmoothed PS, Dirichlet smoothing, Jelenik-Mercer

– “Simplified Fellegi-Sunter”

Page 15: Distance functions and IE – 4? William W. Cohen CALD.
Page 16: Distance functions and IE – 4? William W. Cohen CALD.

Results – Token Distances

Page 17: Distance functions and IE – 4? William W. Cohen CALD.
Page 18: Distance functions and IE – 4? William W. Cohen CALD.

SecondString distance functions

• Hybrid term-based & edit-distance based:– Monge-Elkan’s “recursive matching scheme”,

segmenting strings at token boundaries (rather than separators like commas)

– SoftTFIDF• Like TFIDF but consider not just tokens in both S

and T, but tokens in S “close to” something in T (“close to” relative to some distance metric)

• Downweight close tokens slightly

Page 19: Distance functions and IE – 4? William W. Cohen CALD.
Page 20: Distance functions and IE – 4? William W. Cohen CALD.

Results – Hybrid distances

Page 21: Distance functions and IE – 4? William W. Cohen CALD.

Results - Overall

Page 22: Distance functions and IE – 4? William W. Cohen CALD.
Page 23: Distance functions and IE – 4? William W. Cohen CALD.

Prospective test on two clustering tasks

Page 24: Distance functions and IE – 4? William W. Cohen CALD.

An anomolous dataset

Page 25: Distance functions and IE – 4? William W. Cohen CALD.

An anomalous dataset: census

Page 26: Distance functions and IE – 4? William W. Cohen CALD.

An anomalous dataset: census

Why?

Page 27: Distance functions and IE – 4? William W. Cohen CALD.

Other results with SecondString

• Distance functions over structured data records (first name, last name, street, house number)

• Learning to combine distance functions

• Unsupervised/semi-supervised training for distance functions over structured data

Page 28: Distance functions and IE – 4? William W. Cohen CALD.

Combining Information Extraction and Similarity Computations

2) Krauthammer et al

1) Bunescu et al

Page 29: Distance functions and IE – 4? William W. Cohen CALD.

Experiments

• Hand-tagged 50 abstracts for gene/protein entities (pre-selected to be about human genes)

• Collected dictionary of 40,000+ protein names from on-line sources– not complete– example matching is not sufficient

• Approach: use hand-coded heuristics to propose likely generalizations of existing dictionary entries.– not hand-coded or off-the-shelf similarity metrics

Page 30: Distance functions and IE – 4? William W. Cohen CALD.

Example name generalizations

Page 31: Distance functions and IE – 4? William W. Cohen CALD.

Basic idea behind the algorithm

original dictionary

carefully-tuned heuristics (aka hacks)

similar (but not identical process) applied to word n-grams from text to do IE: extract if n-gram -> CD

Page 32: Distance functions and IE – 4? William W. Cohen CALD.

Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

Page 33: Distance functions and IE – 4? William W. Cohen CALD.

Example: canonicalizing “short names” (different procedure for “full names” and “one-word” names)

NF-25 in ODNF<n>

NF<n>Nf<n>

“... NF-kappa B...” NF<g><l>

NF in CD?(<x><g><l>)

NF => CD(from <x><n>)

Recognize:

Page 34: Distance functions and IE – 4? William W. Cohen CALD.

Results

• Why is precision less than 100%?

• When should you use “similarity by normalization”?

• Could a simpler algorithm do as well?

• Is there overfitting? (50 abstracts, <750 proteins)

Page 35: Distance functions and IE – 4? William W. Cohen CALD.

...

Page 36: Distance functions and IE – 4? William W. Cohen CALD.

Combining Information Extraction and Similarity Computations

2) Krauthammer et al

1) Bunescu et al

Page 37: Distance functions and IE – 4? William W. Cohen CALD.

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– want to find subsequences (genes) that are

highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance

Page 38: Distance functions and IE – 4? William W. Cohen CALD.

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– want to find subsequences (genes) that are

highly similar (and hence probably related)– want to ignore “accidental” matches– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance

Page 39: Distance functions and IE – 4? William W. Cohen CALD.

Smith-Waterman distance

c o h e n d o r f

m 0 0 0 0 0 0 0 0 0

c 1 0 0 0 0 0 0 0 0

c 0 0 0 0 0 0 0 0 0

o 0 2 1 0 0 0 2 1 0

h 0 1 4 3 2 1 1 1 0

n 0 0 3 3 5 4 3 2 1

s 0 0 2 2 4 4 3 2 1

k 0 0 1 1 3 3 3 2 1

i 0 0 0 0 2 2 2 2 1

dist=5

Page 40: Distance functions and IE – 4? William W. Cohen CALD.

In general “peaks” in the matrix scores indicate highly similar substrings.

Page 41: Distance functions and IE – 4? William W. Cohen CALD.

Background

• Common task in proteomics/genomics: – look for (soft) matches to a query sequence in a

large “database” of sequences.– possible technique is Smith-Waterman (local

alignment)• want char-char “reward” for alignment to reflect

confidence that the alignment is not due to chance• based on substitutability theory for amino acids

– doesn’t scale well• BLAST and FASTA: fast approximate S-W

Page 42: Distance functions and IE – 4? William W. Cohen CALD.

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.

• FASTA:– Use inverted indices to find out where these

words appear in the DB sequence– Use S-W only near DB sections that contain

some of these words

Page 43: Distance functions and IE – 4? William W. Cohen CALD.

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.

• BLAST:– Generate variations of these words by looking

for changes that would lead to strong similarities

– Discard “low IDF” words (where accidental matches are likely)

– Use expanded set of n-grams to focus search

Page 44: Distance functions and IE – 4? William W. Cohen CALD.

query string

words and expansions

Page 45: Distance functions and IE – 4? William W. Cohen CALD.

BLAST/FASTA ideas

• Find all char n-grams (“words”) in the query string.• BLAST:

– Generate variations of these words by looking for changes that would lead to strong similarities

– Discard “low IDF” words (where accidental matches are likely)– Use expanded set of n-grams to focus search

• The BLAST program:– Widely used, – Fast implementation, – Supports asking multiple queries against a database at once...– Can one use it find soft matches of protein names (from a

dictionary) in text?

Page 46: Distance functions and IE – 4? William W. Cohen CALD.

Basic idea:

• Protein database• Query strings• Proposed alignment

(query->database)• Query algorithm:

BLAST

• Biomedical paper• Protein name dictionary• Extracted protein name

(dict. entry->text)• IE system:

dictionaries+BLAST (optimized for this problem)

Page 47: Distance functions and IE – 4? William W. Cohen CALD.

1) Mapping text to DNA sequences(Q: what sort of char similarity is this?)

Page 48: Distance functions and IE – 4? William W. Cohen CALD.

2) Optimizing blast

• Split protein-name database into several parts (for short, medium-length, long protein names)

• Require space chars before and after “short” protein names.

• Manually search (grid search?) for better settings for certain key parameters for each protein-name subdatabase – With what data?

• Evaluate on one review article, 1162 protein names– inter-annotator agreement not great (70-85%)

Page 49: Distance functions and IE – 4? William W. Cohen CALD.

2) Optimizing blast

Page 50: Distance functions and IE – 4? William W. Cohen CALD.

2) Optimizing blast

Page 51: Distance functions and IE – 4? William W. Cohen CALD.

Results

Page 52: Distance functions and IE – 4? William W. Cohen CALD.

Results

Overall: precision 71.1%, recall 78.8% (opt)