Information Extraction, Conditional Random Fields, and Social Network Analysis Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li, Andres Corrada, Xuerui Wang
107
Embed
Information Extraction, Conditional Random Fields, and Social Network Analysis
Information Extraction, Conditional Random Fields, and Social Network Analysis. Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li, Andres Corrada, Xuerui Wang. Goal:. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Information Extraction,Conditional Random Fields,and Social Network Analysis
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with
Aron Culotta, Charles Sutton, Ben Wellner, Khashayar Rohanimanesh, Wei Li,Andres Corrada, Xuerui Wang
Goal:
Mine actionable knowledgefrom unstructured text.
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
A Portal for Job Openings
Job
Op
enin
gs:
Cat
ego
ry =
Hig
h T
ech
Key
wo
rd =
Jav
a L
oca
tio
n =
U.S
.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Data Mining the Extracted Job Information
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy of Sciences
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
– Joint Segmentation and Co-ref (Iterated Conditional Samples.)
• Two example projects
– Email, contact address book, and Social Network Analysis
– Research Paper search and analysis
Hidden Markov Models
St - 1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)
∏=
−∝||
11 )|()|(),(
o
ttttt soPssPosP
vvv
HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
IE with Hidden Markov Models
Yesterday Rich Caruana spoke this example sentence.
Yesterday Rich Caruana spoke this example sentence.
Person name: Rich Caruana
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
person name
location name
background
We want More than an Atomic View of Words
Would like richer representation of text: many arbitrary, overlapping features of the words.
St - 1
St
Ot
St+1
Ot +1
Ot -1
identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”
…
…part of
noun phrase
is “Wisniewski”
ends in “-ski”
Problems with Richer Representationand a Joint Model
These arbitrary features are not independent.– Multiple levels of granularity (chars, words, phrases)
Support Vector Machines (SVMs) 89.7[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs) 93.9[Peng, McCallum, 2004]
error40%
Table Extraction from Government ReportsCash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households. Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------- : : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------- Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :------------------ : : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------- : 1,000 Head --- Pounds --- Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 --------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
CRFLabels:• Non-Table• Table Title• Table Header• Table Data Row• Table Section Data Row• Table Footnote• ... (12 in all)
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Features:• Percentage of digit chars• Percentage of alpha chars• Indented• Contains 5+ consecutive spaces• Whitespace in this line aligns with prev.• ...• Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
100+ documents from www.fedstats.gov
Table Extraction Experimental Results
Line labels,percent correct
Table segments,F1
95 % 92 %
65 % 64 %
85 % -
HMM
StatelessMaxEnt
CRF w/outconjunctions
CRF
52 % 68 %
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Feature Induction for CRFs
1. Begin with knowledge of atomic features, but no features yet in the model.
2. Consider many candidate features, including atomic and conjunctions.
3. Evaluate each candidate feature.
4. Add to the model some that are ranked highest.
5. Train the model.
[McCallum, 2003, UAI]
Candidate Feature Evaluation
Common method: Information Gain
∑∈
−=Ff
fCHfPCHFC )|( )()(),(InfoGain
True optimization criterion: Likelihood of training data
Λ+Λ −= LLf fλλ ),(Gain Likelihood
Technical meat is in how to calculate this efficiently for CRFs• Mean field approximation• Emphasize error instances (related to Boosting)• Newton's method to set
[McCallum, 2003, UAI]
Named Entity Recognition
CRICKET - MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.
N Two words in common 29Y One word in common 13Y "Normalized" mentions are string identical 39Y Capitalized word in common 17Y > 50% character tri-gram overlap 19N < 25% character tri-gram overlap -34Y In same sentence 9Y Within two sentences 8N Further than 3 sentences apart -1Y "Hobbs Distance" < 3 11N Number of entities in between two mentions = 0 12N Number of entities in between two mentions > 4 -3Y Font matches 1Y Default -19
OVERALL SCORE = 98 > threshold=0
Pair-wise Affinity Metric
Y/N?
The Problem
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
affinity = 98
affinity = 11
affinity = 104
Pair-wise mergingdecisions are beingmade independentlyfrom each otherY
Y
N
Affinity measures are noisy and imperfect.
They should be madein relational dependencewith each other.
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30Y/N
Y/N
Y/N
[McCallum & Wellner, 2003, ICML](MRF)
Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.
11
€
P(v y |
v x ) =
1
Z v x
exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
30Y/N
Y/N
Y/N
[McCallum & Wellner, 2003](MRF)
Make pair-wise mergingdecisions in dependent relation to each other by- calculating a joint prob.- including all edge weights- adding dependence on consistent triangles.
11 ∞−
€
P(v y |
v x ) =
1
Z v x
exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
Y
Y
N
[McCallum & Wellner, 2003]
€
P(v y |
v x ) =
1
Z v x
exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
(MRF)
infinity
45)
30)
(11)
A Markov Random Field for Co-reference
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
N
Y
N
[McCallum & Wellner, 2003]
€
P(v y |
v x ) =
1
Z v x
exp λ l f l (x i, x j , y ij ) + λ ' f '(y ij , y jk,y ik )i, j,k
∑l
∑i, j
∑ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
(MRF)
64
45)
30)
(11)
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
11
30
. . . Condoleezza Rice . . .
134
10
€
log P(v y |
v x )( )∝ λ l f l (x i,x j , y ij )
l
∑i, j
∑ = w ij
i, j w/inparitions
∑ − w ij
i, j acrossparitions
∑
106
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . . . . . Condoleezza Rice . . .
= 22
45
11
30
134
10
106
€
log P(v y |
v x )( )∝ λ l f l (x i,x j , y ij )
l
∑i, j
∑ = w ij
i, j w/inparitions
∑ − w ij
i, j acrossparitions
∑
Inference in these MRFs = Graph Partitioning[Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . . . . . Condoleezza Rice . . .
€
log P(v y |
v x )( )∝ λ l f l (x i, x j , y ij )
l
∑i, j
∑ = w ij
i, j w/inparitions
∑ + w'iji, j acrossparitions
∑ = 314
45
11
30
134
10
106
Co-reference Experimental Results
Proper noun co-reference
DARPA ACE broadcast news transcripts, 117 stories
Partition F1 Pair F1Single-link threshold 16 % 18 %Best prev match [Morton] 83 % 89 %MRFs 88 % 92 %
error=30% error=28%
DARPA MUC-6 newswire article corpus, 30 stories
Partition F1 Pair F1Single-link threshold 11% 7 %Best prev match [Morton] 70 % 76 %MRFs 74 % 80 %
error=13% error=17%
[McCallum & Wellner, 2003]
Joint co-reference among all pairsAffinity Matrix CRF
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
99Y/N
Y/N
Y/N
11
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
~25% reduction in error on co-reference ofproper nouns in newswire.
[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Variant of Iterated Conditional Modes
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Besag, 1986]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
see also [Marthi, Milch, Russell, 2003]
Joint IE and Coreference from Research Paper Citations
Textual citation mentions(noisy, with duplicates)
Paper database, with fields,clean, duplicates collapsed
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…
4. Joint segmentation and co-reference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
Citation Segmentation and Coreference
Y?N
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Resolving conflicts
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Perform jointly.
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
J Besag 1986 On the…
AUT AUT YR TITL TITL
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
Citation mention attributes
J Besag 1986 On the…
AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”
c
x
s
IE + Coreference Model
c
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Structure for each citation mention
x
s
IE + Coreference Model
c
Binary coreference variablesfor each pair of mentions
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
x
s
IE + Coreference Model
c
y n
n
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Binary coreference variablesfor each pair of mentions
Contact info and name extraction performance (25 fields)
Example keywords extracted
1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)
2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Clustering words into topics withLatent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
Example topicsinduced from a large collection of text
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
[Tennenbaum et al]
STORYSTORIES
TELLCHARACTER
CHARACTERSAUTHOR
READTOLD
SETTINGTALESPLOT
TELLINGSHORT
FICTIONACTION
TRUEEVENTSTELLSTALE
NOVEL
MINDWORLDDREAM
DREAMSTHOUGHT
IMAGINATIONMOMENT
THOUGHTSOWNREALLIFE
IMAGINESENSE
CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE
WATERFISHSEA
SWIMSWIMMING
POOLLIKE
SHELLSHARKTANK
SHELLSSHARKSDIVING
DOLPHINSSWAMLONGSEALDIVE
DOLPHINUNDERWATER
DISEASEBACTERIADISEASES
GERMSFEVERCAUSE
CAUSEDSPREADVIRUSES
INFECTIONVIRUS
MICROORGANISMSPERSON
INFECTIOUSCOMMONCAUSING
SMALLPOXBODY
INFECTIONSCERTAIN
FIELDMAGNETIC
MAGNETWIRE
NEEDLECURRENT
COILPOLESIRON
COMPASSLINESCORE
ELECTRICDIRECTION
FORCEMAGNETS
BEMAGNETISM
POLEINDUCED
SCIENCESTUDY
SCIENTISTSSCIENTIFIC
KNOWLEDGEWORK
RESEARCHCHEMISTRY
TECHNOLOGYMANY
MATHEMATICSBIOLOGY
FIELDPHYSICS
LABORATORYSTUDIESWORLD
SCIENTISTSTUDYINGSCIENCES
BALLGAMETEAM
FOOTBALLBASEBALLPLAYERS
PLAYFIELD
PLAYERBASKETBALL
COACHPLAYEDPLAYING
HITTENNISTEAMSGAMESSPORTS
BATTERRY
JOBWORKJOBS
CAREEREXPERIENCE
EMPLOYMENTOPPORTUNITIES
WORKINGTRAINING
SKILLSCAREERS
POSITIONSFIND
POSITIONFIELD
OCCUPATIONSREQUIRE
OPPORTUNITYEARNABLE
Example topicsinduced from a large collection of text
[Tennenbaum et al]
From LDA to Author-Recipient-Topic(ART)
Inference and Estimation
Gibbs Sampling:- Easy to implement- Reasonably fast
Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.
DP
Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]
Topics, and prominent sender/receiversdiscovered by ART
Topics, and prominent sender/receiversdiscovered by ART
Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice Presidence of Regulatory Affairs”Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
connection strength (A,B) =
distribution overauthored topics
Traditional SNA
distribution overrecipients
distribution overauthored topics
Author-TopicART
Comparing Role Discovery Tracy Geaconne Dan McCarty
Traditional SNA Author-TopicART
Similar roles Different rolesDifferent roles
Geaconne = “Secretary”McCarty = “Vice President”
Traditional SNA Author-TopicART
Different roles Very similarNot very similar
Geaconne = “Secretary”Hayslett = “Vice President & CTO”
Comparing Role Discovery Tracy Geaconne Rod Hayslett
There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:
Pairs with highestrank difference between ART & SNA
5 other professors3 other ML researchers
Role-Author-Recipient-Topic Models
Main Application Project:
Main Application Project:
ResearchPaper
Cites
Main Application Project:
ResearchPaper
Cites
Person
UniversityVenue
Grant
Groups
Expertise
Summary
• Conditional Random Fields combine the benefits of– Conditional probability models (arbitrary features)– Markov models (for sequences or other relations)