1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst
Dec 31, 2015
1
Interactive Information Extractionand Social Network Analysis
Andrew McCallum Information Extraction and Synthesis Laboratory
UMass Amherst
9
Motivation
• Capture confidence of records in extracted database
• Alerts data mining to possible errors in database
First Name Last Name Confidence
Bill Gates 0.96
Bill banks 0.43
10
Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]
yt - 1
yt
xt
yt+1
xt +1
xt -1
. . . Lattice ofFSM states
observations
yt+2
xt +2
yt+3
xt +3
said Arden Bement NSF Director …
output sequence
input sequence
OTHER
TITLE
ORG
PERSON
Finite State Lattice
€
p(y | x) =1
Z(x)Φy (y t , y t−1)Φxy (x t , y t )
t=1
T
∏
11
Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]
yt - 1
yt
xt
yt+1
xt +1
xt -1
. . . Lattice ofFSM states
observations
yt+2
xt +2
yt+3
xt +3
said Arden Bement NSF Director …
output sequence
input sequence
OTHER
TITLE
ORG
PERSON
Constrained Forward-Backward
€
p(Arden Bement = PERSON | x) =1
Z(x)Φy (y t ,y t−1)Φxy (x t , y t )
t=1
T
∏y∈C
∑
12
Forward-Backward Confidence Estimation
improves accuracy/coverage
op
timal
ourforward-backwardconfidence
traditionaltoken-wiseconfidence
no use ofconfidence
13
Application of Confidence Estimation
Interactive Information Extraction:– To correct predictions, direct user to least confident field
14
Interactive Information Extraction
IE algorithm calculates confidence scores UI uses confidence scores to alert user to
possible errors IE algorithm takes corrections into account
and propagates correction to other fields
15
User Correction
User Corrects a field, e.g. dragging Stanley to the First Name field
First Name
Last Name
Address Line
Charles Stanley 100 Charles Street
y1 y2 y3 y4 y5
x1 x2 x3 x4 x5
16
Remove Paths
User Corrects a field, e.g. dragging Stanley to the First Name field
First Name
Last Name
Address Line
Charles Stanley 100 Charles Street
y1 y2 y3 y4 y5
x1 x2 x3 x4 x5
17
Constrained Viterbi
Viterbi algorithm is constrained to pass through the designated state.
First Name
Last Name
Address Line
Charles Stanley 100 Charles Street
Adjacent field changed: Correction Propagation
y1 y2 y3 y4 y5
x1 x2 x3 x4 x5
18
Constrained Viterbi
After fixing least confident field,constrained Viterbi automatically reduces error by another 23%.
Recent work reduces annotation effort further– simplifies annotation to multiple-choice
First Name Last Name City
Bill Gates Redmond WA
Bill Gates Redmond
A)
B)
19
User feedback “in the wild”as labeling
Labeling forClassification
Easy:Often found in user interfaces
e.g. CALO IRIS, Apple Mail
Seminar:How to Organize your Life
by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq
3:30 pmThursday March 31
In this seminar we will learn how to use CALO to...
Seminar announcement
Todo request
Other
Labeling forExtraction
Painful:Difficult even for paid labelers
Complex tools
Seminar:How to Organize your Life
by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq
3:30 pmThursday March 31
In this seminar we will learn how to use CALO to...
Click, drag, adjust, label,Click, drag, adjust, label,...
20
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Task: Information Extraction.Fields: NAME COMPANY ADDRESS (and others)
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
user corrects labels, not segmentations
Interface presents top hypothesized segmentations
21
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
user corrects labels, not segmentations
Interface presents top hypothesized segmentations
Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)
22
Multiple-choice Annotation forLearning Extractors “in the wild”
[Culotta, McCallum 2005]
Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.
29% percent reduction in user actions needed to train
Interface presents top hypothesized segmentations
Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)
24
Piecewise Training in Factorial CRFsfor Transfer Learning
Emailed seminar ann’mt entities
Email English words
[Sutton, McCallum, 2005]
Too little labeled training data.
60k words training. GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
25
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Newswire English words
[Sutton, McCallum, 2005]
Train on “related” task with more data.
200k words training.
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
26
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Email English words
[Sutton, McCallum, 2005]
At test time, label email with newswire NEs...
27
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Emailed seminar ann’mt entities
Email English words
[Sutton, McCallum, 2005]
…then use these labels as features for final task
28
Piecewise Training in Factorial CRFsfor Transfer Learning
Newswire named entities
Seminar Announcement entities
English words
[Sutton, McCallum, 2005]
Use joint inference at test time.
An alternative to hierarchical Bayes.Needn’t know anything about parameterization of subtask.
AccuracyNo transfer < Cascaded Transfer < Joint Inference Transfer
A Conditional Random Field for Discriminatively-trained
Finite-state String Edit Distance
Andrew McCallum
Kedar Bellare
Fernando Pereira
Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
31
String Edit Distance
Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
32
String Edit Distance
Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
Applications– Database Record Deduplication
Apex International Hotel Grassmarket Street
Apex Internat’l Grasmarket Street
Records are duplicates of the same hotel?
33
String Edit Distance
Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
Applications– Database Record Deduplication
– Biological Sequences
AGCTCTTACGATAGAGGACTCCAGA
AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
34
String Edit Distance
Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
Applications– Database Record Deduplication
– Biological Sequences
– Machine Translation
Il a achete une pomme
He bought an apple
35
String Edit Distance
Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations
that transform string x into y.
Applications– Database Record Deduplication
– Biological Sequences
– Machine Translation
– Textual Entailment He bought a new car last night
He purchased a brand new automobile yesterday evening
36
Levenshtein Distance
copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)
Edit operations
Lowest costalignment
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
operation cost
Total cost = 6= Levenshtein Distance
dele
te
0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0
Align two strings William W. CohonWillleam Cohen
x1 =
x2 =
[1966]
37
Levenshtein Distance
copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2
insert
subst
D(i,j) = score of best alignment from x1... xi to y1... yj.
D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1
D(i,j-1) + 1
Dynamic program
total cost =distance
38
Levenshtein Distancewith Markov Dependencies
Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1
Edit operations
W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2
Learn these costsfrom training data
subst
insertdelete
3DDPtable
repeateddelete
is cheaper
copy
12
12
39
Ristad & Yianilos (1997)
Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings
€
p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2
| at )t
∏ complete data likelihood
Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
dele
te
1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14
x1
x2
a.i1
a.ea.i2
string 1
alignment
string 2
€
p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2
| at )t
∏a:x1 ,x 2
∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)
Match score =
€
O = p(x1
( j ),x2
( j ))j
∏Given training set ofmatching string pairs, objective fn is
40
Ristad & Yianilos Regrets
Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”
Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.
Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”
So, consider model trained by conditional probability
41
Conditional Probability (Sequence) Models
We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):
– Can examine features, but not responsible for generating them.
– Don’t have to explicitly model their dependencies.
42
Jointyt-1 yt
xt
yt+1
xt+1xt-1
...
...
[Lafferty, McCallum, Pereira 2001]
From HMMs to Conditional Random Fields
€
P(y,x) = P(y t | y t−1)P(x t | y t )t=1
|x |
∏
€
vs = s1,s2,...sn
v o = o1,o2,...on
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
Conditional
€
P(y | x) =1
P(x)P(y t | y t−1)P(x t | y t )
t=1
|v o |
∏yt-1 yt yt+1
xt xt+1xt-1
...
...
€
=1
Z(x)Φs(y t ,y t−1)Φo(x t , y t )
t=1
|x |
∏
(A super-special case of Conditional Random Fields.)
where
Set parameters by maximum likelihood, using optimization method on L.
€
Φo(x t , y t ) = exp λ k fk (y t ,x t )k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Linear-chain ^
44
CRF String Edit Distance
W i l l i a m _ W . _ C o h o n
W i l l l e a m _ C o h e n
cop
y
cop
y
cop
y
cop
y
sub
st
cop
y
cop
y
cop
y
cop
y
inse
rt
cop
y
dele
te
dele
te
sub
st
cop
y
cop
y
dele
te
€
p(a | x1,x2) =1
Zx1 ,x 2
Φ(at ,at−1,x1,x2)t
∏
joint complete data likelihood
1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14
x1
x2
a.i1
a.ea.i2
string 1
alignment
string 2
conditional complete data likelihood
€
p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2
| at )t
∏
Want to train from set of string pairs,each labeled one of {match, non-match}
match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”
46
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
€
p(m | x1,x2) =1
Zx1 ,x 2
Φ(at ,at−1,x1,x2)t
∏a∈Sm
∑conditional incomplete data likelihood
47
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
Probability summed overall alignments in match states
0.8
Probability summed overall alignments in non-match states
0.2
x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”
48
CRF String Edit Distance FSM
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
Probability summed overall alignments in match states
0.1
Probability summed overall alignments in non-match states
0.9
x1 = “Tom Dietterich”x2 = “Tom Dean”
49
Parameter Estimation
Expectation Maximization E-step: Estimate distribution over alignments,
, using current parameters M-step: Change parameters to maximize the
complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)
€
O = log p(m( j ) | x1
( j ),x2
( j ))( )j
∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood
The complete log likelihood
€
log p(m( j ) | a,x1
( j ),x2
( j ))p(a | x1
( j ),x2
( j ))( )a
∑j
∑
€
p(a | x1
( j ),x2
( j ))
This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.
50
Efficient Training
Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries
Use beam search during E-step[Pal, Sutton, McCallum 2005]
Unlike completely observed CRFs, objective function is not convex.
Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.
51
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”
T o m m i J a a k k o l a
Tommi
Jakola
52
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Bruce Croft”x2 = “Tom Dean”
B r u c e C r o f t
Tom
Dean
53
What Alignments are Learned?
subst
insertdelete
copy
subst
insertdelete
copy
Start
matchm = 1
non-matchm = 0
x1 = “Jaime Carbonell”x2 = “Jamie Callan”
J a i m e C a r b o n e l l
Jamie
Callan
55
Summary of Advantages
Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet
Extremely flexible edit operations– Single operation may make arbitrary jumps in both
strings, of size determined by input features Discriminative Training
– Maximize ability to predict match vs non-match
56
Experimental Results:Data Sets
Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”
“Abe’s Grill, East Main Street” People names, UIS DB generator
– synthetic noise– E.g. “John Smith” vs “Snith, John”
CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”
“Russell & Norvig, “Artificial Intelligence: An Intro...”
57
Experimental Results:Features
same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character
58
Experimental Results:Edit Operations
insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string
59
Experimental Results
CiteSeerReason Face Reinf Constraint
0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913
Restaurantname
0.2900.3540.3650.433
Restaurantaddress
0.6860.7120.3800.532
Distancemetric
LevenshteinLearned Leven.VectorLearned Vector
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
60
Experimental Results
CiteSeerReason Face Reinf Constraint
0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913
0.964 0.918 0.917 0.976
Restaurantname
0.2900.3540.3650.433
0.448
Restaurantaddress
0.6860.7120.3800.532
0.783
Distancemetric
LevenshteinLearned Leven.VectorLearned Vector
CRF Edit Distance
[Bilenko & Mooney 2003]
F1 (average of precision and recall)
61
Experimental Results
F1
0.8560.981
Without skip-if-present-in-other-stringWith skip-if-present-in-other-string
Data set: person names, with word-order noise added
63
Y/N
Y/N
Y/N
Joint Co-reference Decisions,Discriminative Model
Stuart Russell
Stuart Russell
[Culotta & McCallum 2005]
S. Russel
People
64
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Co-reference for Multiple Entity Types
Stuart Russell
Stuart Russell
University of California at Berkeley
[Culotta & McCallum 2005]
S. Russel
Berkeley
Berkeley
People Organizations
65
Y/N
Y/N
Y/N
Y/N
Y/N
Y/N
Joint Co-reference of Multiple Entity Types
Stuart Russell
Stuart Russell
University of California at Berkeley
[Culotta & McCallum 2005]
S. Russel
Berkeley
Berkeley
People Organizations
Reduces error by 22%
68
Social network from my email
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
69
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Sample a distributionover topics,
For each document:
Sample a topic, z
For each word in doc
Sample a wordfrom the topic, w
Example:
70% Iraq war30% US election
Iraq war
“bombing”
GenerativeProcess:
70
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
71
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
74
Outline
Email, motivation
ART Graphical Model.
Experimental Results
– Enron Email (corpus)
– Academic Email (one person)
RART: Roles for ART
Group-Topic Model
– Experiments on voting data
– Voting data from U.S. Senate and the U.N.
75
Enron Email Corpus
250k email messages 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
77
Topics, and prominent sender/receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
78
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
79
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
80
Traditional SNA Author-TopicART
Different roles Very similarNot very similar
Geaconne = “Secretary”Hayslett = “Vice President & CTO”
Comparing Role Discovery Tracy Geaconne Rod Hayslett
81
Traditional SNA Author-TopicART
Different roles Very differentVery similar
Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”
Comparing Role Discovery Lynn Blair Kimberly Watson
82
McCallum Email Corpus 2004
January - October 2004 23k email messages 825 people
From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
NIPS registration receipt.CALO registration receipt.
Thanks,Kate
86
Two most prominent topicsin discussions with ____?
Words Problove 0.030514house 0.015402
0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009
0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539
Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558
90
Outline
Email, motivation
ART Graphical Model.
Experimental Results
– Enron Email (corpus)
– Academic Email (one person)
RART: Roles for ART
Group-Topic Model
– Experiments on voting data
– Voting data from U.S. Senate and the U.N.
92
Results with RART:People in “Role #3” in Academic Email
olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing
committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support
93
Roles for allan (James Allan)
Role #3 I.T. support Role #2 Natural Language
researcher
Roles for pereira (Fernando Pereira)
Role #2 Natural Language researcher Role #4 SRI CALO project participant Role #6 Grant proposal writer Role #10 Grant proposal coordinator Role #8 Guests at McCallum’s house
94
Outline
Email, motivation
ART Graphical Model.
Experimental Results
– Enron Email (corpus)
– Academic Email (one person)
RART: Roles for ART
Group-Topic Model
– Experiments on voting data
– Voting data from U.S. Senate and the U.N.
95
Traditional SNA Author-TopicART
Block structured NotNot
ART & RART: Roles but not Groups
Enron TransWestern Division
98
U.S. Senate Data sets
3426 bills from 16 years of voting records from the U.S. Senate
Yea / Nea / Abstain (absent) Each bill comes with an abstract (text
describing the contents of the bill).
101
Senators who change Coalition Dependent on Topic
e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid
102
U.N. Data Set
931 U.N. Resolutions, voted on by 192 countries, from 1990-2003.
Yes / No / Abstain votes List of keywords summarizes the content of the
resolution.
Also experiments later with resolutions from 1960-2003