Entity Centric Coreference
Resolution with Model Stacking
Kevin Clark and Christopher D. Manning
(ACL-IJCNLP 2015)
(Tables are taken from the above-mentioned paper)
Presented by Mamoru Komachi
ACL 2015 Reading Group @ Tokyo Institute of Technology
August 26th, 2015
Entity-level information allows early coreference
decisions to inform later ones
Entity-centric coreference systems build up
coreference clusters incrementally (Raghunathan et
al., 2010; Stoyanov and Eisner, 2012; Ma et al.,
2014)
2
Hillary Clinton files for divorce from Bill Clinton ahead
of her campaign for presidency for 2016.
….
Clinton is confident that her poll numbers will skyrocket
once the divorce is final.
?!?
Problem: How to build up clusters
effectively?
Model stacking
Two mention pair models: classification model
and ranking model
Generates clusters features for clusters of
mentions
Imitation learning
Assigns exact costs to actions based on
coreference evaluation metrics
Uses the scores of the pairwise models to reduce
the search space
3
Mention Pair ModelsPrevious approach using local information
4
Two models for predicting whether a given pair of
mentions belong to the same coreference cluster
Is it a coreferent?
Classification model
Which one best suites for the mention?
Ranking model
5
Bill arrived, but nobody saw him.
I talked to him on the phone.
Logistic classifiers for classification model
M: set of all mentions in the training set
T(m): set of true antecedents of a mention m
F(m): set of false antecedents of m
Considers each pair of mentions independently
6
Logistic classifiers for ranking model
Considers candidate antecedents simultaneously
Max-margin training encourages the model to find
the single best antecedent for a mention, but it is
not robust for a downstream clustering model
7
Features for mention pair model
Distance features: the distance between the two mentions in sentences or number of mentions
Syntactic features: number of embedded NPs under a mention, POS tags of the first, last, and head word
Semantic features: named entity type, speaker identification
Rule-based features: exact and partial string matching
Lexical features: the first, last, and head word of the current mention
8
Entity-Centric
Coreference ModelProposed approach using cluster features
9
Entity-centric model can exhibit high
coherency
Best first clustering (Ng and Cardie, 2002)
Assigns the most probable preceding mention
classified as coreferent with it as the antecedent
Only relies on local information
Entity-centric model (this work)
Operates between pairs of clusters instead of pairs
of mentions
Builds up coreference chains with agglomerative
clustering, by merging clusters if it predicts they are
representing the same one
10
Inference
Reducing the search
space by using a
threshold from
mention-pair models
Sort P to perform
easy-first clustering
s is a scoring
function to make a
binary decision for
merge action
11
Learning entity-centric model by imitation learning
Sequential prediction problem: future observations
depend on previous actions
Imitation learning (in this work, DAgger (Ross al.,
2011)), is useful for this problem (Argall et al., 2009)
Training the agent on the gold labels alone assumes
that all previous decisions were correct, but it is
problematic in coreference, where the error rate is
quite high
DAgger exposes the system to states at train time
similar to the ones it will face at test time12
Learning cluster merging policy
by DAgger (Ross et al., 2011)
Iterative algorithm
aggregating a dataset D
consisting of states and the
actions performed by the
expert policy in those
states
b controls the probability of
the expert’s policy and
current policy (decays
exponentially as the
iteration number increases)
13
Adding cost to actions: Directly tune to
optimize coreference metrics
Merging clusters (order of merge operations is also
important) influence the score
How a particular local decision will affect the final
score of the coreference system?
Problem: standard coreference metrics do not
decompose into clusters
Answer: rolling out the actions from the current state
14A(s): set of actions that can be taken from the state s
Cluster features for classification model
and ranking model
Between clusters features
Minimum and maximum probability of coreference
Average probability and average log prob. of coreference
Average probability and log probability of coreference for a
particular pair of grammar types of mentions (pron or not)
15
Only 56 features for entity-centric model
State features
Whether a preceding mention pair in the list of
mention pairs has the same candidate anaphor as
the current one
The index of the current mention pair in the list
divided by the size of the list (what percentage of the
list have we seen so far?)
…
Entity-centric model doesn’t rely on sparse lexical
features. Instead, it employs model stacking to
exploit strong features (with scores learned from
pairwise model)16
Results and discussionsCoNLL 2012 English coreference task
17
Experimental setup:
CoNLL 2012 Shared Task
English portion of OntoNotes
Training: 2802, development: 343, test:345 documents
Use the provided pre-processing (parse trees, NE, etc)
Common evaluation metrics
MUC, B3, CEAFE
CoNLL F1 (the average F1 score of the three metrics)
CoNLL scorer version 8.01
Rule-based mention detection (Raghunathan et al., 2010)
18
Results: Entity-centric model outperforms best-
first clustering in both classification and ranking
19
Entity-centric model beats other state-of-
the-art coreference models
20
This work primarily optimize for B3 metric during training
State-of-the-art systems use latent antecedents to learn
scoring functions over mention pairs, but are trained to
maximize global objective functions
Entity-centric model directly learns a coreference
model that maximizes an evaluation metric
Post-processing of mention pair and ranking models
Closest-first clustering (Soon et al., 2001)
Best-first clustering (Ng and Cardie, 2002)
Global inference models
Global inference with integer linear programming
(Denis and Baldridge, 2007; Finkel and Manning,
2008)
Graph partitioning (McCallum and Wellner, 2005;
Nicolae and Nicolae, 2006)
Correlational clustering (McCallum and Wellner,
2003; Finely and Joachims, 2005)21
Previous approaches do not directly tune
against coreference metrics
Non-local entity-level information
Cluster model (Luo et al., 2004; Yang et al., 2008;
Rahman and Ng, 2011)
Joint inference (McCallum and Wellner, 2003;
Culotta et al., 2006; Poon and Domingos, 2008;
Haghighi and Klein, 2010)
Learning trajectories of decisions
Imitation learning (Daume et al., 2005; Ma et al.,
2014)
Structured perceptron (Stoyanov and Eisner, 2012;
Fernandes et al., 2012; Bjoerkelund and Kuhn, 2014)
22
Summary Proposed an entity-centric coreference model using
the scores produced by mention pair models as
features
Pairwise scores are learned using standard
coreference metrics
Imitation learning can be used to learn how to build
up coreference chains incrementally
Proposed model outperforms the commonly used
best-first method and current state-of-the-art
23