Bianca Pereira From Entity Recognition to Entity Linking 07/05/2014 Based on the paper “From Entity Recognition to Entity Linking: a Survey of Advanced Entity Linking Techniques” from Dai et al. 2012
Jul 05, 2015
Bianca Pereira
From Entity Recognition to
Entity Linking
07/05/2014
Based on the paper “From Entity Recognition to Entity Linking: a Survey of
Advanced Entity Linking Techniques” from Dai et al. 2012
Outline
• Motivation
• Overview of Entity Linking
• Instance-based Entity Linking Approach
• Experiments
• Conclusion
• Analysis of the Paper
• Relation with my PhD
THE BATTLE OF THE BOOGIE
Named Entity Recognition
k
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
Named Entity Recognition
k
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 06/05/2014)
Overview of Entity Linking
Databases
BiomedicalNatural Language
Processing
AI
Databases
BiomedicalNLP
AI
Entity Linking
Source: http://en.wikipedia.org/wiki/Mick_Jackson_(singer) (visited in 20/11/2013)
http://www.discogs.com/artist
/87624-Mick-Jackson
http://www.discogs.com/artist/6432
65-Elmar-Krohn?noanv=1
http://www.discogs.com/artist/49239
6-Dave-Jackson-2?noanv=1
http://www.discogs.com/artist/
148391-Sylvester-Levay
http://www.discogs.com/artist/16
9154-Jacksons-The
Tasks Inspired Entity Linking
Link-The-Wiki Track in INEX Web People Search Task in
SemEval
URL1
URL2
URLn
…
Person 1 Person 2
Person 3Person 4
Entity Linking Tasks
Entity Linking in TAC-KBP Gene Normalization in
BioCreative
http://www.discogs.com/artist
/87624-Mick-Jackson
NIL
Syntenin-1 ID:100754014
ID:6386mda-9
…
Problem Definition
Article-wide Salient Entity Linking Problem
Article-wide Entity Linking Problem
Instance-based Entity Linking Problem
Article-wide Salient EL Problem
Source: http://en.wikipedia.org/wiki/Michael_jordan (visited in 20/11/2013)
Article-wide EL Problem
Source: http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
Instance-based EL Problem
Source: http://en.wikipedia.org/wiki/Blame_It_on_the_Boogie (visited in 06/05/2014)
Instance-based Entity Linking Approach
Instance-based Entity Linking Approach
Challenges
1. Lack of suitable corpus for developing instance-based EL
systems.
2. Lack of context information for disambiguating each
individual instance.
The synthetic replicate of urocortin was found to bind with high
affinity to type 1 and type 2 CRF receptors and, based upon its
anatomic localization within the brain, was proposed to be a
natural ligand for the type 2 CRF receptors.
Classification
Local Classification
URL1 URL2 URL3 URL4
Classification
Local Classification Relational Classification
URL1 URL2 URL3 URL4
URL1 URL2 URL3 URL4
Classification
Local Classification Relational Classification
URL1 URL2 URL3 URL4
URL1 URL2 URL3 URL4
URL9
9
Collective Classification
URL1 URL2 URL3 URL4
URL3
5
URL4
7
URL9
9 URL1
5
URL2
0
URL5
URL1
3
Collective Entity Disambiguation
1. Discourse SalienceIn a given discourse there is precisely one entity that is the center of
attention.
2. TransitivityIf two mentions refer to the same entity, and one mention has been
linked to a database entry, the other should also be linked to the same entry.
Markov Logic Network Formulation
Observed FeaturesSaliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
ID1 ID2 ID3 ID4
ID2
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Markov Logic Network Formulation
Observed FeaturesSaliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
Observed Features of the NeighborsTransitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
ID1
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Markov Logic Network Formulation
Observed FeaturesSaliency: Precede(x,y) ^ LinkTo(x,id) ^ Candidate (y,id) => LinkTo(y,id)
Observed Features of the NeighborsTransitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
Unobserved Features of the NeighborsProtein-protein interaction: LinkTo(x,idi) ^ Candidate(y, idj) ^ PPIPartner(idi,
idj) => LinkTo(y, idj)
Syntanin-1 mda-9
ID1
ID2
ID9
Collective INFERENCE
URL1 URL2 URL3 URL4
URL3
5
URL4
7
URL9
9 URL1
5
URL2
0
URL5
URL1
3
Joint Inference
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Transitivity: Coreference(x,y) ^ LinkTo(x,idi) => LinkTo(y,idi)
syntetin-1 syntetin-1
syntetin
mda-9
TACIP18
Joint Inference
New Constraints
Transitivity2: Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) =>
LinkTo(y, idi)
URL5
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
?
Joint Inference
New Constraints
Transitivity2: Coreference(x,y) ^ LinkTo(x,idi) ^ ¬exist idj.LinkTo(y, idj) =>
LinkTo(y, idi)
Coreference(x,y) => SuitablyLink(x) ^ SuitablyLink(y)
LinkTo(x,id) => SuitablyLink(x)
…Here, we demonstrate that rat syntetin-1, previously
published as syntenin-1 (syntenin), mda-9, or TACIP18 in
human, is a neurofascin-binding protein that exhibits a wide-
spread tissue expression pattern with a relative maximum in
brain. …
Experiments
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Syntenin-1URL5
U
Corpus
IGML Corpus (Instance-based Gene Mention Linking)
Training Set Test Set
Number of articles 282 262
Number of gene mentions 2,813 3,143
Number of linked Entrez Gene IDs 2,861 3,187
Number of words per article 215.86 228.91
Number of mentions per article 10.01 12.00
Number of words per mention 1.52 1.35
Number of IDs per mention 1.02 1.01
Human and rat syntenin-1 The mammalian syntenin-1
Corpus – Gene Mention Recognition
Set Precision Recall F-Measure
Training 55.3 83.4 66.5
Test 66.2 82.7 65.1
Corpus– Training Set
0
10
20
30
40
50
60
70
80
90
Precision Recall Fmeasure
Optimal Linking
Best Linking
Worst Linking
Corpus – Test Set
0
10
20
30
40
50
60
70
80
90
Precision Recall F-Measure
Optimal Linking
Best Linking
Worst Linking
Evaluation
Training Test
Feature P R F P R F
Saliency Discourse 79.2 50.2 61.5 79.5 59.0 67.7
Protein-protein Interaction 79.4 51.1 62.2 80.1 59.8 68.5
Transitivity 78.5 49.5 60.7 78.6 58.8 67.2
Evaluation
Training Test
P R F P R F
Random Baseline 68.4 51.6 58.8 68.3 59.8 63.
8
Collective 79.1 52.0 62.8 78.4 61.0 68.
6
Collective + Filtering 79.3 52.0 62.9 78.8 61.0 68.
8
Individual 74.9 54.3 62.9 75.7 61.7 68.
0
Collective + Individual 74.5 55.7 63.7 74.9 64.8 69.
5
Collective + Individual + Filtering 79.9 54.9 65.1 77.8 65.3 71.
0
Conclusion
- Overview of Entity Linking
- Why is Instance-based Entity Linking more challenging?
- Suggestion of a solution to the problem
Analysis
Cons
- The results do not lead to any conclusion.
- Too much abbreviations in the paper.
- Does the approach converge to a optimal solution?
- How long does it take to give a solution?
- Is there any case that could not be disambiguated by
human annotators?
Pros
- Outlier
- “The instance-based EL task requires deeper linguistic
analysis and domain dependent knowledge to infer each
instance’s identity.”
Databases
BiomedicalNatural Language
Processing
AI
Semantic Web
How is it Related to my PhD?
I am working on the Entity Linking Topic.
Generic Approach Focus on Linguistic Features Linked Data as Knowledge Base Scalability
Thank you!