Top Banner
1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst
89

1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

Dec 31, 2015

Download

Documents

Nicholas Bailey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

1

Interactive Information Extractionand Social Network Analysis

Andrew McCallum Information Extraction and Synthesis Laboratory

UMass Amherst

Page 2: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

9

Motivation

• Capture confidence of records in extracted database

• Alerts data mining to possible errors in database

First Name Last Name Confidence

Bill Gates 0.96

Bill banks 0.43

Page 3: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

10

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Finite State Lattice

p(y | x) =1

Z(x)Φy (y t , y t−1)Φxy (x t , y t )

t=1

T

Page 4: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

11

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Constrained Forward-Backward

p(Arden Bement = PERSON | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t , y t )

t=1

T

∏y∈C

Page 5: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

12

Forward-Backward Confidence Estimation

improves accuracy/coverage

op

timal

ourforward-backwardconfidence

traditionaltoken-wiseconfidence

no use ofconfidence

Page 6: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

13

Application of Confidence Estimation

Interactive Information Extraction:– To correct predictions, direct user to least confident field

Page 7: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

14

Interactive Information Extraction

IE algorithm calculates confidence scores UI uses confidence scores to alert user to

possible errors IE algorithm takes corrections into account

and propagates correction to other fields

Page 8: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

15

User Correction

User Corrects a field, e.g. dragging Stanley to the First Name field

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

Page 9: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

16

Remove Paths

User Corrects a field, e.g. dragging Stanley to the First Name field

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

Page 10: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

17

Constrained Viterbi

Viterbi algorithm is constrained to pass through the designated state.

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

Adjacent field changed: Correction Propagation

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

Page 11: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

18

Constrained Viterbi

After fixing least confident field,constrained Viterbi automatically reduces error by another 23%.

Recent work reduces annotation effort further– simplifies annotation to multiple-choice

First Name Last Name City

Bill Gates Redmond WA

Bill Gates Redmond

A)

B)

Page 12: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

19

User feedback “in the wild”as labeling

Labeling forClassification

Easy:Often found in user interfaces

e.g. CALO IRIS, Apple Mail

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Seminar announcement

Todo request

Other

Labeling forExtraction

Painful:Difficult even for paid labelers

Complex tools

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Click, drag, adjust, label,Click, drag, adjust, label,...

Page 13: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

20

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Task: Information Extraction.Fields: NAME COMPANY ADDRESS (and others)

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

Page 14: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

21

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

Page 15: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

22

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

29% percent reduction in user actions needed to train

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

Page 16: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

24

Piecewise Training in Factorial CRFsfor Transfer Learning

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

Too little labeled training data.

60k words training. GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Page 17: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

25

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Newswire English words

[Sutton, McCallum, 2005]

Train on “related” task with more data.

200k words training.

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Page 18: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

26

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Email English words

[Sutton, McCallum, 2005]

At test time, label email with newswire NEs...

Page 19: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

27

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

…then use these labels as features for final task

Page 20: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

28

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Seminar Announcement entities

English words

[Sutton, McCallum, 2005]

Use joint inference at test time.

An alternative to hierarchical Bayes.Needn’t know anything about parameterization of subtask.

AccuracyNo transfer < Cascaded Transfer < Joint Inference Transfer

Page 21: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

A Conditional Random Field for Discriminatively-trained

Finite-state String Edit Distance

Andrew McCallum

Kedar Bellare

Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

31

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Page 23: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

32

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

Apex International Hotel Grassmarket Street

Apex Internat’l Grasmarket Street

Records are duplicates of the same hotel?

Page 24: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

33

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA

AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 25: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

34

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

Il a achete une pomme

He bought an apple

Page 26: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

35

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

– Textual Entailment He bought a new car last night

He purchased a brand new automobile yesterday evening

Page 27: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

36

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

Lowest costalignment

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

operation cost

Total cost = 6= Levenshtein Distance

dele

te

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0

Align two strings William W. CohonWillleam Cohen

x1 =

x2 =

[1966]

Page 28: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

37

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

insert

subst

D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1

D(i,j-1) + 1

Dynamic program

total cost =distance

Page 29: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

38

Levenshtein Distancewith Markov Dependencies

Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

Learn these costsfrom training data

subst

insertdelete

3DDPtable

repeateddelete

is cheaper

copy

12

12

Page 30: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

39

Ristad & Yianilos (1997)

Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏ complete data likelihood

Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏a:x1 ,x 2

∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)

Match score =

O = p(x1

( j ),x2

( j ))j

∏Given training set ofmatching string pairs, objective fn is

Page 31: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

40

Ristad & Yianilos Regrets

Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”

Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.

Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”

So, consider model trained by conditional probability

Page 32: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

41

Conditional Probability (Sequence) Models

We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

Page 33: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

42

Jointyt-1 yt

xt

yt+1

xt+1xt-1

...

...

[Lafferty, McCallum, Pereira 2001]

From HMMs to Conditional Random Fields

P(y,x) = P(y t | y t−1)P(x t | y t )t=1

|x |

vs = s1,s2,...sn

v o = o1,o2,...on

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

Conditional

P(y | x) =1

P(x)P(y t | y t−1)P(x t | y t )

t=1

|v o |

∏yt-1 yt yt+1

xt xt+1xt-1

...

...

=1

Z(x)Φs(y t ,y t−1)Φo(x t , y t )

t=1

|x |

(A super-special case of Conditional Random Fields.)

where

Set parameters by maximum likelihood, using optimization method on L.

Φo(x t , y t ) = exp λ k fk (y t ,x t )k

∑ ⎛

⎝ ⎜

⎠ ⎟

Linear-chain ^

Page 34: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

44

CRF String Edit Distance

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

p(a | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

joint complete data likelihood

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

conditional complete data likelihood

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2

| at )t

Want to train from set of string pairs,each labeled one of {match, non-match}

match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”

Page 35: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

45

CRF String Edit Distance FSM

subst

insertdelete

copy

Page 36: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

46

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

p(m | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏a∈Sm

∑conditional incomplete data likelihood

Page 37: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

47

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.8

Probability summed overall alignments in non-match states

0.2

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

Page 38: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

48

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.1

Probability summed overall alignments in non-match states

0.9

x1 = “Tom Dietterich”x2 = “Tom Dean”

Page 39: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

49

Parameter Estimation

Expectation Maximization E-step: Estimate distribution over alignments,

, using current parameters M-step: Change parameters to maximize the

complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

O = log p(m( j ) | x1

( j ),x2

( j ))( )j

∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood

The complete log likelihood

log p(m( j ) | a,x1

( j ),x2

( j ))p(a | x1

( j ),x2

( j ))( )a

∑j

p(a | x1

( j ),x2

( j ))

This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Page 40: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

50

Efficient Training

Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries

Use beam search during E-step[Pal, Sutton, McCallum 2005]

Unlike completely observed CRFs, objective function is not convex.

Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

Page 41: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

51

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

T o m m i J a a k k o l a

Tommi

Jakola

Page 42: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

52

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Bruce Croft”x2 = “Tom Dean”

B r u c e C r o f t

Tom

Dean

Page 43: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

53

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Jaime Carbonell”x2 = “Jamie Callan”

J a i m e C a r b o n e l l

Jamie

Callan

Page 44: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

55

Summary of Advantages

Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet

Extremely flexible edit operations– Single operation may make arbitrary jumps in both

strings, of size determined by input features Discriminative Training

– Maximize ability to predict match vs non-match

Page 45: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

56

Experimental Results:Data Sets

Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”

“Abe’s Grill, East Main Street” People names, UIS DB generator

– synthetic noise– E.g. “John Smith” vs “Snith, John”

CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”

“Russell & Norvig, “Artificial Intelligence: An Intro...”

Page 46: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

57

Experimental Results:Features

same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

Page 47: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

58

Experimental Results:Edit Operations

insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string

Page 48: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

59

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

Page 49: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

60

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

0.964 0.918 0.917 0.976

Restaurantname

0.2900.3540.3650.433

0.448

Restaurantaddress

0.6860.7120.3800.532

0.783

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

CRF Edit Distance

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

Page 50: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

61

Experimental Results

F1

0.8560.981

Without skip-if-present-in-other-stringWith skip-if-present-in-other-string

Data set: person names, with word-order noise added

Page 51: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

63

Y/N

Y/N

Y/N

Joint Co-reference Decisions,Discriminative Model

Stuart Russell

Stuart Russell

[Culotta & McCallum 2005]

S. Russel

People

Page 52: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

64

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Co-reference for Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

Page 53: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

65

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Joint Co-reference of Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

Reduces error by 22%

Page 54: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

68

Social network from my email

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 55: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

69

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

Page 56: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

70

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

Page 57: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

71

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

Page 58: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

72

From LDA to Author-Recipient-Topic

(ART)

Page 59: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

73

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

Page 60: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

74

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

Page 61: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

75

Enron Email Corpus

250k email messages 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

Page 62: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

76

Topics, and prominent senders / receivers

discovered by ARTTopic names,by hand

Page 63: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

77

Topics, and prominent sender/receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Page 64: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

78

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Page 65: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

79

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

Page 66: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

80

Traditional SNA Author-TopicART

Different roles Very similarNot very similar

Geaconne = “Secretary”Hayslett = “Vice President & CTO”

Comparing Role Discovery Tracy Geaconne Rod Hayslett

Page 67: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

81

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

Page 68: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

82

McCallum Email Corpus 2004

January - October 2004 23k email messages 825 people

From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

Page 69: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

83

McCallum Email Blockstructure

Page 70: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

84

Four most prominent topicsin discussions with ____?

Page 71: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

85

Page 72: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

86

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

Page 73: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

89

Page 74: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

90

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

Page 75: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

91

Role-Author-Recipient-Topic Models

Page 76: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

92

Results with RART:People in “Role #3” in Academic Email

olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing

committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support

Page 77: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

93

Roles for allan (James Allan)

Role #3 I.T. support Role #2 Natural Language

researcher

Roles for pereira (Fernando Pereira)

Role #2 Natural Language researcher Role #4 SRI CALO project participant Role #6 Grant proposal writer Role #10 Grant proposal coordinator Role #8 Guests at McCallum’s house

Page 78: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

94

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

Page 79: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

95

Traditional SNA Author-TopicART

Block structured NotNot

ART & RART: Roles but not Groups

Enron TransWestern Division

Page 80: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

96

A Group Model:“Stochastic Blockstructures Model”

Page 81: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

97

Group-Topic Model

[Wang, Mohanty, McCallum 2005]

Page 82: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

98

U.S. Senate Data sets

3426 bills from 16 years of voting records from the U.S. Senate

Yea / Nea / Abstain (absent) Each bill comes with an abstract (text

describing the contents of the bill).

Page 83: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

99

Topics Discovered

Traditional“Mixtures of Unigrams”

Group-TopicModel

Page 84: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

100

Groups Discovered

Groups from topicEducation + Domestic

Agreement Index

Page 85: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

101

Senators who change Coalition Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

Page 86: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

102

U.N. Data Set

931 U.N. Resolutions, voted on by 192 countries, from 1990-2003.

Yes / No / Abstain votes List of keywords summarizes the content of the

resolution.

Also experiments later with resolutions from 1960-2003

Page 87: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

103

Topics Discovered

Traditionalmixture ofunigrams

Group-TopicModel

Page 88: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

104

GroupsDiscovered

Page 89: 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.

105

Groups and Topics, Trends over Time