-
Learning with Noise: Enhance Distantly Supervised Relation
Extractionwith Dynamic Transition Matrix
Bingfeng Luo1, Yansong Feng∗1, Zheng Wang2, Zhanxing
Zhu3,Songfang Huang4, Rui Yan1 and Dongyan Zhao1
1ICST, Peking University, China2School of Computing and
Communications, Lancaster University, UK
3Peking University, China4IBM China Research Lab, China
{bf
luo,fengyansong,zhanxing.zhu,ruiyan,zhaody}@[email protected]@cn.ibm.com
Abstract
Distant supervision significantly reduceshuman efforts in
building training data formany classification tasks. While
promis-ing, this technique often introduces noiseto the generated
training data, which canseverely affect the model performance.
Inthis paper, we take a deep look at the appli-cation of distant
supervision in relation ex-traction. We show that the dynamic
transi-tion matrix can effectively characterize thenoise in the
training data built by distantsupervision. The transition matrix
can beeffectively trained using a novel curricu-lum learning based
method without any di-rect supervision about the noise. We
thor-oughly evaluate our approach under a widerange of extraction
scenarios. Experimen-tal results show that our approach
consis-tently improves the extraction results andoutperforms the
state-of-the-art in variousevaluation scenarios.
1 Introduction
Distant supervision (DS) is rapidly emerging as aviable means
for supporting various classificationtasks – from relation
extraction (Mintz et al., 2009)and sentiment classification (Go et
al., 2009) tocross-lingual semantic analysis (Fang and Cohn,2016).
By using knowledge learned from seed ex-amples to label data, DS
automatically prepareslarge scale training data for these
tasks.
While promising, DS does not guarantee per-fect results and
often introduces noise to the gener-ated data. In the context of
relation extraction, DSworks by considering sentences containing
boththe subject and object of a triple
as its supports. However, the generated data arenot always
perfect. For instance, DS could matchthe knowledge base (KB)
triple, in false positive contextslike Donald Trump worked in New
York City. Priorworks (Takamatsu et al., 2012; Ritter et al.,
2013)show that DS often mistakenly labels real posi-tive instances
as negative (false negative) or versavice (false positive), and
there could be confu-sions among positive labels as well. These
noisescan severely affect training and lead to poorly-performing
models.
Tackling the noisy data problem of DS is non-trivial, since
there usually lacks of explicit super-vision to capture the noise.
Previous works havetried to remove sentences containing
unreliablesyntactic patterns (Takamatsu et al., 2012), designnew
models to capture certain types of noise oraggregate multiple
predictions under the at-least-one assumption that at least one of
the alignedsentences supports the triple in KB (Riedel et al.,2010;
Surdeanu et al., 2012; Ritter et al., 2013;Min et al., 2013). These
approaches represent asubstantial leap forward towards making DS
morepractical. however, are either tightly couple to cer-tain types
of noise, or have to rely on manual rulesto filter noise, thus
unable to scale. Recent break-through in neural networks provides a
new wayto reduce the influence of incorrectly labeled databy
aggregating multiple training instances atten-tively for relation
classification, without explicitlycharacterizing the inherent noise
(Lin et al., 2016;Zeng et al., 2015). Although promising,
modelingnoise within neural network architectures is still inits
early stage and much remains to be done.
In this paper, we aim to enhance DS noise mod-eling by providing
the capability to explicitly char-acterize the noise in the
DS-style training data
-
within neural networks architectures. We showthat while noise is
inevitable, it is possible to char-acterize the noise pattern in a
unified frameworkalong with its original classification objective.
Ourkey insight is that the DS-style training data typi-cally
contain useful clues about the noise pattern.For example, we can
infer that since some peo-ple work in their birthplaces, DS could
wrongly la-bel a training sentence describing a working placeas a
born-in relation. Our novel approach tonoisy modeling is to use a
dynamically-generatedtransition matrix for each training instance
to (1)characterize the possibility that the DS labeled re-lation is
confused and (2) indicate its noise pat-tern. To tackle the
challenge of no direct guidanceover the noise pattern, we employ a
curriculumlearning based training method to gradually modelthe
noise pattern over time, and utilize trace regu-larization to
control the behavior of the transitionmatrix during training. Our
approach is flexible –while it does not make any assumptions about
thedata quality, the algorithm can make effective useof the
data-quality prior knowledge to guide thelearning procedure when
such clues are available.
We apply our method to the relation extractiontask and evaluate
under various scenarios on twobenchmark datasets. Experimental
results showthat our approach consistently improves both
ex-traction settings, outperforming the state-of-the-art models in
different settings.
Our work offers an effective way for tacklingthe noisy data
problem of DS, making DS morepractical at scale. Our main
contributions are to(1) design a dynamic transition matrix
structure tocharacterize the noise introduced by DS, and (2)design
a curriculum learning based framework toadaptively guide the
training procedure to learnwith noise.
2 Problem Definition
The task of distantly supervised relation extractionis to
extract knowledge triples, ,from free text with the training data
constructedby aligning existing KB triples with a large cor-pus.
Specifically, given a triple in KB, DS worksby first retrieving all
the sentences containing bothsubj and obj of the triple, and then
constructingthe training data by considering these sentences
assupport to the existence of the triple. This taskcan be conducted
in both the sentence and the baglevels. The former takes a sentence
s containing
Encodersentences embeddings
Prediction
Noise Modeling
predicted distr.
transition matrix
Transformation
31
2
4 Observed distr.
Figure 1: Overview of our approach
both subj and obj as input, and outputs the rela-tion expressed
by the sentence between subj andobj. The latter setting alleviates
the noisy dataproblem by using the at-least-one assumption thatat
least one of the retrieved sentences containingboth subj and obj
supports the triple. It takes a bag of sentences S as input
whereeach sentence s ∈ S contains both subj and obj,and outputs the
relation between subj and obj ex-pressed by this bag.
3 Our approach
In order to deal with the noisy training data ob-tained through
DS, our approach follows four stepsas depicted in Figure 1. First,
each input sentenceis fed to a sentence encoder to generate an
embed-ding vector. Our model then takes the sentenceembeddings as
input and produce a predicted re-lation distribution, p, for the
input sentence (orthe input sentence bag). At the same time,
ourmodel dynamically produces a transition matrix,T, which is used
to characterize the noise patternof sentence (or the bag). Finally,
the predicteddistribution is multiplied by the transition matrixto
produce the observed relation distribution, o,which is used to
match the noisy relation labelsassigned by DS while the predicted
relation dis-tribution p serves as output of our model
duringtesting. One of the key challenges of our approachis on
determining the element values of the transi-tion matrix, which
will be described in Section 4.
3.1 Sentence-level Modeling
Sentence Embedding and Prediction In thiswork, we use a
piecewise convolutional neural net-work (Zeng et al., 2015) for
sentence encoding,but other sentence embedding models can also
beused. We feed the sentence embedding to a fullconnection layer,
and use softmax to generate thepredicted relation distribution,
p.
Noise Modeling First, each sentence embeddingx, generated b
sentence encoder, is passed to a fullconnection layer as a
non-linearity to obtain thesentence embedding xn used specifically
for noisemodeling. We then use softmax to calculate the
-
transition matrix T, for each sentence:
Tij =exp(wTijxn + b)∑|C|j=1 exp(w
Tijxn + b)
(1)
where Tij is the conditional probability for the in-put sentence
to be labeled as relation j by DS,given i as the true relation, b
is a scalar bias, |C| isthe number of relations, wij is the weight
vectorcharacterizing the confusion between i and j.
Here, we dynamically produce a transition ma-trix, T,
specifically for each sentence, but with theparameters (wij) shared
across the dataset. By do-ing so, we are able to adaptively
characterize thenoise pattern for each sentence, with a few
pa-rameters only. In contrast, one could also pro-duce a global
transition matrix for all sentences,with much less computation,
where one need notto compute T on the fly (see Section 6.1).
Observed Distribution When we characterizethe noise in a
sentence with a transition matrix T,if its true relation is i, we
can assume that i mightbe erroneously labeled as relation j by DS
withprobability Tij . We can therefore capture the ob-served
relation distribution, o, by multiplying Tand the predicted
relation distribution, p:
o = TT · p (2)
where o is then normalized to ensure∑
i oi = 1.Rather than using the predicted distribution p
to directly match the relation labeled by DS (Zenget al., 2015;
Lin et al., 2016), here we utilize o tomatch the noisy labels
during training and still usep as output during testing, which
actually capturesthe procedure of how the noisy label is
producedand thus protects p from the noise.
3.2 Bag Level Modeling
Bag Embedding and Prediction One of the keychallenges for bag
level model is how to aggre-gate the embeddings of individual
sentences intothe bag level. In this work, we experiment
twomethods, namely average and attention aggrega-tion (Lin et al.,
2016). The former calculates thebag embedding, s, by averaging the
embeddings ofeach sentence, and then feed it to a softmax
classi-fier for relation classification.
The attention aggregation calculates an atten-tion value, aij ,
for each sentence i in the bag with
respect to each relation j, and aggregates to thebag level as sj
, by the following equations1:
sj =n∑i
aijxi; aij =exp(xTi rj)∑ni′ exp(x
Ti′rj)
(3)
where xi is the embedding of sentence i, n thenumber of
sentences in the bag, and rj is the ran-domly initialized embedding
for relation j. In sim-ilar spirit to (Lin et al., 2016), the
resulting bagembedding sj is fed to a softmax classifier to
pre-dict the probability of relation j for the given bag.
Noise Modeling Since the transition matrix ad-dresses the
transition probability with respect toeach true relation, the
attention mechanism ap-pears to be a natural fit for calculating
the tran-sition matrix in bag level. Similar to attention
ag-gregation above, we calculate the bag embeddingwith respect to
each relation using Equation 3, butwith a separate set of relation
embeddings r′j . Wethen calculate the transition matrix, T, by:
Tij =exp(sTi r
′j + bi)∑|C|
j=1 exp(sTi r′j + bi)
(4)
where si is the bag embedding regarding relationi, and r′j is
the embedding for relation j.
4 Curriculum Learning based Training
One of the key challenges of this work is onhow to train and
produce the transition matrixto model the noise in the training
data withoutany direct guidance and human involvement.
Astraightforward solution is to directly align the ob-served
distribution, o, with respect to the noisylabels by minimizing the
sum of the two terms:CrossEntropy(o)+Regularization. However,doing
so does not guarantee that the prediction dis-tribution, p, will
match the true relation distribu-tion. The problem is at the
beginning of the train-ing, we have no prior knowledge about the
noisepattern, thus, both T and p are less reliable, mak-ing the
training procedure be likely to trap intosome poor local optimum.
Therefore, we requirea technique to guide our model to gradually
adaptto the noisy training data, e.g., learning somethingsimple
first, and then trying to deal with noises.
1While (Lin et al., 2016) use bilinear function to calcu-late
aij , we simply use dot product since we find these twofunctions
perform similarly in our experiments.
-
Fortunately, this is exactly what curriculumlearning can do. The
idea of curriculum learn-ing (Bengio et al., 2009) is simple:
starting withthe easiest aspect of a task, and leveling up the
dif-ficulty gradually, which fits well to our problem.We thus
employ a curriculum learning frameworkto guide our model to
gradually learn how to char-acterize the noise. Another advantage
is to avoidfalling into poor local optimum.
With curriculum learning, our approach pro-vides the flexibility
to combine prior knowledgeof noise, e.g., splitting a dataset into
reliable andless reliable subsets, to improve the effectivenessof
the transition matrix and better model the noise.
4.1 Trace RegularizationBefore proceeding to training details,
we first dis-cuss how we characterize the noise level of thedata by
controlling the trace of its transition ma-trix. Intuitively, if
the noise is small, the transitionmatrix T will tend to become an
identity matrix,i.e., given a set of annotated training sentences,
theobserved relations and their true relations are al-most
identical. Since each row of T sums to 1,the similarity between the
transition matrix andthe identity matrix can be represented by its
trace,trace(T). The larger the trace(T) is, the largerthe diagonal
elements are, and the more similarthe transition matrix T is to the
identity matrix,indicating a lower level of noise. Therefore, wecan
characterize the noise pattern by controllingthe expected value of
trace(T) in the form of reg-ularization. For example, we will
expect a largertrace(T) for reliable data, but a smaller
trace(T)for less reliable data. Another advantage of em-ploying
trace regularization is that it could help re-duce the model
complexity and avoid overfitting.
4.2 TrainingTo tackle the challenge of no direct guidance
overthe noise patterns, we implement a curriculumlearning based
training method to first train themodel without considerations for
noise. In otherwords, we first focus on the loss from the
predic-tion distribution p , and then take the noise model-ing into
account gradually along the training pro-cess, i.e., gradually
increasing the importance ofthe loss from the observed distribution
o while de-creasing the importance of p. In this way, the
pre-diction branch is roughly trained before the modelmanaging to
characterize the noise, thus avoids be-ing stuck into poor local
optimum. We thus design
to minimize the following loss function:
L =
N∑i=1
−((1− α)log(oiyi) + αlog(piyi))
− βtrace(Ti)
(5)
where 00 are two weighting param-eters, yi is the relation
assigned by DS for the i-thinstance, N the total number of training
instances,oiyi is the probability that the observed relation forthe
i-th instance is yi, and piyi is the probability topredict relation
yi for the i-th instance.
Initially, we set α=1, and train our model com-pletely by
minimizing the loss from the predictiondistribution p. That is, we
do not expect to modelthe noise, but focus on the prediction branch
atthis time. As the training progresses, the predic-tion branch
gradually learns the basic predictionability. We then decrease α
and β by 0
-
where βm is the regularization weight for them-thdata subset, M
is the total number of subsets, Nmthe number of instances in m-th
subset, and Tmi,ymi and omi,ymi are the transition matrix, the
re-lation labeled by DS and the observed probabilityof this
relation for the i-th training instance in them-th subset,
respectively. Note that different fromEquation 5, this loss
function does not need to ini-tiate training by minimizing the loss
regarding theprediction distribution p, since one can easily
startby learning from the most reliable split first.
We also use trace regularization for the most re-liable subset,
since there are still some noise anno-tations inevitably appearing
in this split. Specifi-cally, we expect its trace(T) to be large
(using apositive β) so that the elements of T will be cen-tralized
to the diagonal and T will be more similarto the identity matrix.
As for the less reliable sub-set, we expect the trace(T) to be
small (using anegative β) so that the elements of the
transitionmatrix will be diffusive and T will be less similarto the
identity matrix. In other words, the transi-tion matrix is
encouraged to characterize the noise.
Note that this loss function only works for sen-tence level
models. For bag level models, sincereliable and less reliable
sentences are all aggre-gated into a sentence bag, we can not
determinewhich bag is reliable and which is not. However,bag level
models can still build a curriculum bychanging the content of a
bag, e.g., keeping re-liable sentences in the bag first, then
graduallyadding less reliable ones, and training with Equa-tion 5,
which could benefit from the prior knowl-edge of data quality as
well.
5 Evaluation Methodology
Our experiments aim to answer two main ques-tions: (1) is it
possible to model the noise in thetraining data generated through
DS, even whenthere is no prior knowledge to guide us? and
(2)whether the prior knowledge of data quality canhelp our approach
better handle the noise.
We apply our approach to both sentence leveland bag level
extraction models, and evaluate inthe situations where we do not
have prior knowl-edge of the data quality as well as where such
priorknowledge is available.
5.1 Datasets
We evaluate our approach on two datasets.
TIMERE We build TIMERE by using DSto align time-related Wikidata
(Vrandečić andKrötzsch, 2014) KB triples to Wikipedia text.
Itcontains 278,141 sentences with 12 types of re-lations between an
entity mention and a time ex-pression. We choose to use
time-related relationsbecause time expressions speak for themselves
interms of reliability. That is, given a KB triple and its aligned
sentences, the finer-grained the time expression t appears in the
sen-tence, the more likely the sentence supports theexistence of
this triple. For example, a sentencecontaining both Alphabet and
October-2-2015 isvery likely to express the inception-time
ofAlphabet, while a sentence containing both Al-phabet and 2015
could instead talk about manyevents, e.g., releasing financial
report of 2015, hir-ing a new CEO, etc. Using this heuristics,
wecan split the dataset into 3 subsets according todifferent
granularities of the time expressions in-volved, indicating
different levels of reliability.Our criteria for determining the
reliability are asfollows. Instances with full date expressions,
i.e.,Year-Month-Day, can be seen as the most re-liable data, while
those with partial date expres-sions, e.g., Month-Year and
Year-Only, areconsidered as less reliable. Negative data are
con-structed heuristically that any entity-time pairs ina sentence
without corresponding triples in Wiki-data are treated as negative
data. During training,we can access 184,579 negative and 77,777
pos-itive sentences, including 22,214 reliable, 2,094and 53,469
less reliable ones. The validation setand test set are randomly
sampled from the reli-able (full-date) data for relatively fair
evaluationsand contains 2,776, 2,771 positive sentences and5,143,
5,095 negative sentences, respectively.
ENTITYRE is a widely-used entity relation ex-traction dataset,
built by aligning triples in Free-base to the New York Times (NYT)
corpus (Riedelet al., 2010). It contains 52 relations, 136,947
pos-itive and 385,664 negative sentences for training,and 6,444
positive and 166,004 negative sentencesfor testing. Unlike TIMERE,
this dataset does notcontain any prior knowledge about the data
qual-ity. Since the sentence level annotations in EN-TITYRE are too
noisy to serve as gold standard,we only evaluate bag-level models
on ENTITYRE,a standard practice in previous works (Surdeanuet al.,
2012; Zeng et al., 2015; Lin et al., 2016).
-
5.2 Experimental Setup
Hyper-parameters We use 200 convolutionkernels with widow size
3. During training, weuse stochastic gradient descend (SGD) with
batchsize 20. The learning rates for sentence-level andbag-level
models are 0.1 and 0.01, respectively.
Sentence level experiments are performed onTIMERE, using 100-d
word embeddings pre-trained using GloVe (Pennington et al., 2014)
onWikipedia and Gigaword (Parker et al., 2011), and20-d vectors for
distance embeddings. Each of thethree subsets of TIMERE is added
after the previ-ous phase has run for 15 epochs. The trace
regu-larization weights are β1 = 0.01, β2 = −0.01 andβ3 = −0.1,
respectively, from the reliable to themost unreliable, with the
ratio of β3 and β2 fixedto 10 or 5 when tuning.
Bag level experiments are performed on bothTIMERE and ENTITYRE.
For TIMERE, we usethe same parameters as above. For ENTITYRE,we use
50-d word embeddings pre-trained onthe NYT corpus using word2vec
(Mikolov et al.,2013), and 5-d vectors for distance embedding.For
both datasets, α and β in Eq. 5 are initializedto 1 and 0.1,
respectively. We tried various decayrates, {0.95, 0.9, 0.8}, and
steps, {3, 5, 8}. Wefound that using a decay rate of 0.9 with step
of 5gives best performance in most cases.
Evaluation Metric The performance is reportedusing the
precision-recall (PR) curve, which is astandard evaluation metric
in relation extraction.Specifically, the extraction results are
first rankeddecreasingly by their confidence scores, then
theprecision and recall are calculated by setting thethreshold to
be the score of each extraction resultone by one.
Naming Conventions We evaluate our ap-proach under a wide range
of settings for sentencelevel (sent ) and bag level (bag ) models:
(1)mix: trained on all three subsets of TIMERE
mixed together; (2) reliable: trained usingthe reliable subset
of TIMERE only; (3) PR:trained with prior knowledge of annotation
qual-ity, i.e., starting from the reliable data and thenadding the
unreliable data; (4) TM: trained withdynamic transition matrix; (5)
GTM: trained witha global transition matrix. In bag level, we also
in-vestigate the performance of average aggregation( avg) and
attention aggregation ( att).
0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 8 0
0 . 8 5
0 . 9 0
0 . 9 5
1 . 0 0 s e n t _ m i x _ T M s e n t _ P R _ s e g 2 _ T M s e
n t _ P R _ T M
Precis
ion
R e c a l l
s e n t _ m i x s e n t _ r e l i a b l e s e n t _ P R
Figure 2: Sentence Level Results on TIMERE
6 Experimental Results
6.1 Performance on TIMERE
Sentence Level Models The results of sentencelevel models on
TIMERE are shown in Figure2. We can see that mixing all subsets
together(sent mix) gives the worst performance, signif-icantly
worse than using the reliable subset only(sent reliable). This
suggests the noisy na-ture of the training data obtained through DS
andproperly dealing with the noise is the key forDS for a wider
range of applications. Whengetting help from our dynamic transition
matrix,the model (sent mix TM) significantly improvessent mix,
delivering the same level of perfor-mance as sent reliable in most
cases. Thissuggests that our transition matrix can help to
mit-igate the bad influence of noisy training instances.
Now let us consider the PR scenario where onecan build a
curriculum by first training on the reli-able subset, then
gradually moving to both reliableand less reliable data. We can see
that, this simplecurriculum learning based model (sent PR) fur-ther
outperforms sent reliable significantly,indicating that the
curriculum learning frameworknot only reduces the effect of noise,
but also helpsthe model learn from noisy data. When apply-ing the
transition matrix approach into this cur-riculum learning framework
using one reliablesubset and one unreliable subset generated
bymixing our two less reliable subsets, our model(sent PR seg2 TM)
further improves sent PRby utilizing the dynamic transition matrix
tomodel the noise. It is not surprising that whenwe use all three
subsets separately, our model(sent PR TM) significantly outperforms
all othermodels by a large margin.
-
0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0
0 . 9 2
0 . 9 4
0 . 9 6
0 . 9 8
1 . 0 0
Precis
ion
R e c a l l
b a g _ a t t _ m i x b a g _ a t t _ r e l i a b l e b a g _ a
t t _ P R b a g _ a t t _ m i x _ T M b a g _ a t t _ P R _ T M
(a) Attention Aggregation
0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0
0 . 9 2
0 . 9 4
0 . 9 6
0 . 9 8
1 . 0 0
Precis
ion
R e c a l l
b a g _ a v g _ m i x b a g _ a v g _ r e l i a b l e b a g _ a
v g _ P R b a g _ a v g _ m i x _ T M b a g _ a v g _ P R _ T M
(b) Average Aggregation
Figure 3: Bag Level Results on TIMERE
Bag Level Models In this setting, we first lookat the
performance of the bag level models withattention aggregation. The
results are shown inFigure 3(a). Consider the comparison betweenthe
model trained on the reliable subset only(bag att reliable) and the
one trained onthe mixed dataset (bag att mix). In contrastto the
sentence level, bag att mix outperformsbag att reliable by a large
margin, becausebag att mix has taken the at-least-one assump-tion
into consideration through the attention ag-gregation mechanism
(Eq. 3), which can be seenas a denoising step within the bag. This
may alsobe the reason that when we introduce either ourdynamic
transition matrix (bag att mix TM) orthe curriculum of using prior
knowledge of dataquality (bag att PR) into the bag level models,the
improvement regarding bag att mix is notas significant as in the
sentence level.
However, when we apply our dynamic transi-tion matrix into the
curriculum built upon priorknowledge of data quality (bag att PR
TM), theperformance gets further improved. This hap-pens especially
in the high precision part com-pared to bag att PR. We also note
that the baglevel’s at-least-one assumption does not alwayshold,
and there are still false negative and falsepositive problems.
Therefore, using our transi-tion matrix approach with or without
prior knowl-edge of data quality, i.e., bag att mix TM andbag att
PR TM, both improve the performance,and bag att PR TM performs
slightly better.
The results of bag level models with average ag-gregation are
shown in Figure 3(b), where the rel-ative ranking of various
settings is similar to thosewith attention aggregation. A notable
difference
0 . 0 0 . 2 0 . 4 0 . 6 0 . 80 . 9 0
0 . 9 2
0 . 9 4
0 . 9 6
0 . 9 8
1 . 0 0 s e n t _ P R s e n t _ P R _ G T M s e n t _ P R _ T M
b a g _ a t t _ P R b a g _ a t t _ P R _ G T M b a g _ a t t _ P R
_ T M
Precis
ion
R e c a l l
Figure 4: Global TM v.s. Dynamic TM
is that both bag avg PR and bag avg mix TMimprove bag avg mix by
a larger margin com-pared to that in the attention aggregation
setting.The reason may be that the average aggregationmechanism is
not as good as the attention aggre-gation in denoising within the
bag, which leavesmore space for our transition matrix approach
orcurriculum learning with prior knowledge to im-prove. Also note
that bag avg reliable per-forms best in the very-low-recall region
but worstin general. This is because that it ranks higherthe
sentences expressing either birth-date ordeath-date, the simplest
but the most com-mon relations in the dataset, but fails to learn
otherrelations with limited or noisy training instances,given its
relatively simple aggregation strategy.
Global v.s. Dynamic Transition Matrix Wealso compare our dynamic
transition matrixmethod with the global transition matrix
method,which maintains only one transition matrix for alltraining
instances. Specifically, instead of dynam-
-
ically generating a transition matrix for each da-tum, we first
initialize an identity matrix T′ ∈R|C|×|C|, where |C| is the number
of relations (in-cluding no-relation). Then the global transi-tion
matrix T is built by applying softmax to eachrow of T′ so that
∑j Tij = 1:
Tij =eT
′ij∑|C|
j=1 eT ′ij
(7)
where Tij and T ′ij are the elements in the ith row
and jth column of T and T′. The element valuesof matrix T′ are
also updated via backpropagationduring training. As shown in Figure
4, using oneglobal transition matrix ( GTM) is also beneficialand
improves both the sentence level (sent PR)and bag level (bag att
PR) models. However,since the global transition matrix only
captures theglobal noise pattern, it fails to characterize
individ-uals with subtle differences, resulting in a perfor-mance
drop compared to the dynamic one ( TM).
Case Study We find our transition matrixmethod tends to obtain
more significant im-provement on noisier relations. For exam-ple,
time of spacecraft landing is noisier thantime of spacecraft launch
since compared to thelaunching of a spacecraft, there are fewer
sen-tences containing the landing time of a space-craft that talks
directly about the landing. Instead,many of these sentences tend to
talk about theactivities of the crew. Our sent PR TM modelimproves
the F1 of time of spacecraft landingand time of spacecraft launch
over sent PR by9.09% and 2.78%, respectively. The transitionmatrix
makes more significant improvement ontime of spacecraft landing
since there are morenoisy sentences for our method to handle,
whichresults in more significant improvement on thequality of the
training data.
6.2 Performance on ENTITYREWe evaluate our bag level models on
ENTI-TYRE. As shown in Figure 5, it is not surpris-ing that the
basic model with attention aggrega-tion (att) significantly
outperforms the averageone (avg), where att in our bag embedding
issimilar in spirit to (Lin et al., 2016), which has re-ported
the-state-of-the-art performance on ENTI-TYRE. When injected with
our transition matrixapproach, both att TM and avg TM clearly
out-perform their basic versions.
0 . 0 0 . 1 0 . 2 0 . 3 0 . 40 . 20 . 30 . 40 . 50 . 60 . 70 .
80 . 91 . 0
Precis
ion
R e c a l l
a v g a t t a v g _ T M a t t _ T M
Figure 5: Results on ENTITYRE
Method P@R 10 P@R 20 P@R 30Mintz 39.88 28.55 16.81
MultiR 60.94 36.41 -MIML 60.75 33.82 -
avg 58.04 51.25 42.45avg TM 58.56 52.35 43.59
att 61.51 56.36 45.63att TM 67.24 57.61 44.90
Table 1: Comparison with feature-based methods.P@R 10/20/30
refers to the precision when recallequals 10%, 20% and 30%.
Similar to the situations in TIMERE, since atthas taken the
at-least-one assumption into accountthrough its attention-based bag
embedding mech-anism, thus the improvement made by att TM isnot as
large as by avg TM.
We also include the comparison with threefeature-based methods:
Mintz (Mintz et al.,2009) is a multiclass logistic regression
model;MultiR (Hoffmann et al., 2011) is a probabilisticgraphical
model that can handle overlapping rela-tions; MIML (Surdeanu et
al., 2012) is also a prob-abilistic graphical model but operates in
the multi-instance multi-label paradigm. As shown in Ta-ble 1,
although traditional feature-based methodshave reasonable results
in the low recall region,their performances drop quickly as the
recall goesup, and MultiR and MIML did not even reachthe 30%
recall. This indicates that, while human-designed featurs can
effectively capture certain re-lation patterns, their coverage is
relatively low.On the other hand, neural network models havemore
stable performance across different recalls,and att TM performs
generally better than othermodels, indicating again the
effectiveness of ourtransition matrix method.
-
7 Related Work
In addition to relation extraction, distant supervi-sion (DS) is
shown to be effective in generatingtraining data for various NLP
tasks, e.g., tweetsentiment classification (Go et al., 2009),
tweetnamed entity classifying (Ritter et al., 2011), etc.However,
these early applications of DS do notwell address the issue of data
noise.
In relation extraction (RE), recent works havebeen proposed to
reduce the influence of wronglylabeled data. The work presented by
(Takamatsuet al., 2012) removes potential noisy sentencesby
identifying bad syntactic patterns at the pre-processing stage. (Xu
et al., 2013) use pseudo-relevance feedback to find possible false
nega-tive data. (Riedel et al., 2010) make the at-least-one
assumption and propose to alleviate the noiseproblem by considering
RE as a multi-instanceclassification problem. Following this
assumption,people further improves the original paradigm us-ing
probabilistic graphic models (Hoffmann et al.,2011; Surdeanu et
al., 2012), and neural networkmethods (Zeng et al., 2015).
Recently, (Lin et al.,2016) propose to use attention mechanism to
re-duce the noise within a sentence bag. Insteadof characterizing
the noise, these approaches onlyaim to alleviate the effect of
noise.
The at-least-one assumption is often too strongin practice, and
there are still chances that the sen-tence bag may be false
positive or false negative.Thus it is important to model the noise
pattern toguide the learning procedure. (Ritter et al., 2013)and
(Min et al., 2013) try to employ a set of la-tent variables to
represent the true relation. Ourapproach differs from them in two
aspects. We tar-get noise modeling in neutral networks while
theytarget probabilistic graphic models. We further ad-vance their
models by providing the capability tomodel the fine-grained
transition from the true re-lation to the observed, and the
flexibility to com-bine indirect guidance.
Outside of NLP, various methods have beenproposed in computer
vision to model the datanoise using neural networks. (Sukhbaatar et
al.,2015) utilize a global transition matrix with weightdecay to
transform the true label distribution to theobserved. (Reed et al.,
2014) use a hidden layerto represent the true label distribution
but try toforce it to predict both the noisy label and the in-put.
(Chen and Gupta, 2015; Xiao et al., 2015) firstestimate the
transition matrix on a clean dataset
and apply to the noisy data. Our model sharessimilar spirit with
(Misra et al., 2016) in that weall dynamically generate a
transition matrix foreach training instance, but, instead of using
vanillaSGD, we train our model with a novel curriculumlearning
training framework with trace regulariza-tion to control the
behavior of transition matrix.In NLP, the only work in
neural-network-basednoise modeling is to use one single global
transi-tion matrix to model the noise introduced by cross-lingual
projection of training data (Fang and Cohn,2016). Our work advances
them through gener-ating a transition matrix dynamically for each
in-stance, to avoid using one single component tocharacterize both
reliable and unreliable data.
8 Conclusions
In this paper, we investigate the noise problem in-herent in the
DS-style training data. We argue thatthe data speak for themselves
by providing use-ful clues to reveal their noise patterns. We
thuspropose a novel transition matrix based methodto dynamically
characterize the noise underlyingsuch training data in a unified
framework along theoriginal prediction objective. One of our key
inno-vations is to exploit a curriculum learning basedtraining
method to gradually learn to model theunderlying noise pattern
without direct guidance,and to provide the flexibility to exploit
any priorknowledge of the data quality to further improvethe
effectiveness of the transition matrix. We eval-uate our approach
in two learning settings of thedistantly supervised relation
extraction. The ex-perimental results show that the proposed
methodcan better characterize the underlying noise andconsistently
outperform start-of-the-art extractionmodels under various
scenarios.
Acknowledgement
This work is supported by the National High Tech-nology R&D
Program of China (2015AA015403);the National Natural Science
Foundation ofChina (61672057, 61672058); KLSTSPI KeyLab. of
Intelligent Press Media Technol-ogy; the UK Engineering and
Physical SciencesResearch Council under grants
EP/M01567X/1(SANDeRs) and EP/M015793/1 (DIVIDEND);and the Royal
Society International CollaborationGrant (IE161012).
-
ReferencesYoshua Bengio, Jérôme Louradour, Ronan
Collobert,
and Jason Weston. 2009. Curriculum learning. InICML. ACM, pages
41–48.
Xinlei Chen and Abhinav Gupta. 2015. Webly super-vised learning
of convolutional networks. In ICCV .pages 1431–1439.
Meng Fang and Trevor Cohn. 2016. Learning whento trust distant
supervision: An application to low-resource pos tagging using
cross-lingual projection.In CONLL. pages 178–186.
Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit-ter sentiment
classification using distant supervision.CS224N Project Report,
Stanford 1(12).
Raphael Hoffmann, Congle Zhang, Xiao Ling, LukeZettlemoyer, and
Daniel S Weld. 2011. Knowledge-based weak supervision for
information extractionof overlapping relations. In Proceedings of
ACL.pages 541–550.
Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan,and Maosong
Sun. 2016. Neural relation extractionwith selective attention over
instances. In ACL. vol-ume 1, pages 2124–2133.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and
Jeff Dean. 2013. Distributed representa-tions of words and phrases
and their compositional-ity. In NIPS. pages 3111–3119.
Bonan Min, Ralph Grishman, Li Wan, Chang Wang,and David Gondek.
2013. Distant supervision forrelation extraction with an incomplete
knowledgebase. In HLT-NAACL. pages 777–782.
Mike Mintz, Steven Bills, Rion Snow, and Dan Ju-rafsky. 2009.
Distant supervision for relation ex-traction without labeled data.
In ACL. pages 1003–1011.
Ishan Misra, C Lawrence Zitnick, Margaret Mitchell,and Ross
Girshick. 2016. Seeing through the humanreporting bias: Visual
classifiers from noisy human-centric labels. In CVPR. pages
2930–2939.
Robert Parker, David Graff, Junbo Kong, Ke Chen, andKazuaki
Maeda. 2011. English gigaword fifth edi-tion, linguistic data
consortium. Technical report,Linguistic Data Consortium,
Philadelphia.
Jeffrey Pennington, Richard Socher, and Christopher DManning.
2014. Glove: Global vectors for wordrepresentation. In EMNLP.
volume 14, pages 1532–1543.
Scott Reed, Honglak Lee, Dragomir Anguelov, Chris-tian Szegedy,
Dumitru Erhan, and Andrew Rabi-novich. 2014. Training deep neural
networks onnoisy labels with bootstrapping. arXiv
preprintarXiv:1412.6596 .
Sebastian Riedel, Limin Yao, and Andrew McCallum.2010. Modeling
relations and their mentions with-out labeled text. In Joint
European Conferenceon Machine Learning and Knowledge Discovery
inDatabases. Springer, pages 148–163.
Alan Ritter, Alan Ritter, Sam Clark, Oren Etzioni, et al.2011.
Named entity recognition in tweets: an exper-imental study. In
EMNLP. Association for Compu-tational Linguistics, pages
1524–1534.
Alan Ritter, Luke Zettlemoyer, Mausam, and Oren Et-zioni. 2013.
Modeling missing data in distant super-vision for information
extraction. TACL 1:367–378.
Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri,Lubomir
Bourdev, and Rob Fergus. 2015. Trainingconvolutional networks with
noisy labels. In ICLR.
Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati,and
Christopher D Manning. 2012. Multi-instancemulti-label learning for
relation extraction. InEMNLP-CoNLL. pages 455–465.
Shingo Takamatsu, Issei Sato, and Hiroshi Nakagawa.2012.
Reducing wrong labels in distant supervisionfor relation
extraction. In ACL. pages 721–729.
Denny Vrandečić and Markus Krötzsch. 2014. Wiki-data: a free
collaborative knowledgebase. Commu-nications of the ACM
57(10):78–85.
Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xi-aogang Wang.
2015. Learning from massive noisylabeled data for image
classification. In CVPR.pages 2691–2699.
Wei Xu, Raphael Hoffmann, Le Zhao, and Ralph Gr-ishman. 2013.
Filling knowledge base gaps fordistant supervision of relation
extraction. In ACL.pages 665–670.
Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao.2015. Distant
supervision for relation extractionvia piecewise convolutional
neural networks. InEMNLP. pages 1753–1762.