LowResource Semantic Role Labeling Matthew R. Gormley Margaret Mitchell Benjamin Van Durme Mark Dredze CLSP Seminar ; September 12, 2014 Low-Resource Semantic Role Labeling Matthew R. Gormley 1 Margaret Mitchell 2 Benjamin Van Durme 1 Mark Dredze 1 1 Human Language Technology Center of Excellence Johns Hopkins University, Baltimore, MD 21211 2 Microsoft Research Redmond, WA 98052 [email protected]| [email protected]| [email protected]| [email protected]Abstract We explore the extent to which high- resource manual annotations such as tree- banks are necessary for the task of se- mantic role labeling (SRL). We examine how performance changes without syntac- tic supervision, comparing both joint and pipelined methods to induce latent syn- tax. This work highlights a new applica- tion of unsupervised grammar induction and demonstrates several approaches to SRL in the absence of supervised syntax. Our best models obtain competitive results in the high-resource setting and state-of- the-art results in the low resource setting, reaching 72.48% F1 averaged across lan- guages. We release our code for this work along with a larger toolkit for specifying arbitrary graphical structure. 1 1 Introduction The goal of semantic role labeling (SRL) is to identify predicates and arguments and label their semantic contribution in a sentence. Such labeling defines who did what to whom, when, where and how. For example, in the sentence “The kids ran the marathon”, ran assigns a role to kids to denote that they are the runners; and a role tomarathonto denote that it is the race course. Models for SRL have increasingly come to rely on an array of NLP tools (e.g., parsers, lem- matizers) in order to obtain state-of-the-art re- sults (Bj¨ orkelund et al., 2009; Zhao et al., 2009). Each tool is typically trained on hand-annotated data, thus placing SRL at the end of a very high- resource NLP pipeline. However, richly annotated data such as that provided in parsing treebanks is expensive to produce, and may be tied to specific domains (e.g., newswire). Many languages do 1 http://www.cs.jhu.edu/ ˜ mrg/software/ not have such supervised resources (low-resource languages), which makes exploring SRL cross- linguistically difficult. The problem of SRL for low-resource lan- guages is an important one to solve, as solutions pave the way for a wide range of applications: Ac- curate identification of the semantic roles of enti- ties is a critical step for any application sensitive to semantics, from information retrieval to machine translation to question answering. In this work, we explore models that minimize the need for high-resource supervision. We ex- amine approaches in a joint setting where we marginalize over latent syntax to find the optimal semantic role assignment; and a pipeline setting where we first induce an unsupervised grammar. We find that the joint approach is a viable alterna- tive for making reasonable semantic role predic- tions, outperforming the pipeline models. These models can be effectively trained with access to only SRL annotations, and mark a state-of-the-art contribution for low-resource SRL. To better understand the effect of the low- resource grammars and features used in these models, we further include comparisons with (1) models that use higher-resource versions of the same features; (2) state-of-the-art high resource models; and (3) previous work on low-resource grammar induction. In sum, this paper makes several experimental and modeling contributions, summarized below. Experimental contributions: • Comparison of pipeline and joint models for SRL. • Subtractive experiments that consider the re- moval of supervised data. • Analysis of the induced grammars in un- supervised, distantly-supervised, and joint training settings. See our paper from ACL ‘14
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Low-‐Resource Semantic Role Labeling
Matthew R. Gormley Margaret Mitchell
Benjamin Van Durme Mark Dredze
CLSP Seminar ; September 12, 2014
Low-Resource Semantic Role Labeling
Matthew R. Gormley1 Margaret Mitchell2 Benjamin Van Durme1 Mark Dredze1
1 Human Language Technology Center of Excellence
Johns Hopkins University, Baltimore, MD 21211
2 Microsoft Research
Redmond, WA 98052
m
r
g
@
c
s
.
j
h
u
.
e
d
u
| memi
t
c
@
m
i
c
r
o
s
o
f
t
.
c
o
m
| vand
u
r
m
e
@
c
s
.
j
h
u
.
e
d
u
| mdre
d
z
e
@
c
s
.
j
h
u
.
e
d
u
Abstract
We explore the extent to which high-
resource manual annotations such as tree-
banks are necessary for the task of se-
mantic role labeling (SRL). We examine
how performance changes without syntac-
tic supervision, comparing both joint and
pipelined methods to induce latent syn-
tax. This work highlights a new applica-
tion of unsupervised grammar induction
and demonstrates several approaches to
SRL in the absence of supervised syntax.
Our best models obtain competitive results
in the high-resource setting and state-of-
the-art results in the low resource setting,
reaching 72.48% F1 averaged across lan-
guages. We release our code for this work
along with a larger toolkit for specifying
arbitrary graphical structure.1
1 Introduction
The goal of semantic role labeling (SRL) is to
identify predicates and arguments and label their
semantic contribution in a sentence. Such labeling
defines who did what to whom, when, where and
how. For example, in the sentence “The kids ran
the marathon”, ran assigns a role to kids to denote
that they are the runners; and a role to marathon to
denote that it is the race course.
Models for SRL have increasingly come to rely
on an array of NLP tools (e.g., parsers, lem-
matizers) in order to obtain state-of-the-art re-
sults (Bjorkelund et al., 2009; Zhao et al., 2009).
Each tool is typically trained on hand-annotated
data, thus placing SRL at the end of a very high-
resource NLP pipeline. However, richly annotated
data such as that provided in parsing treebanks is
expensive to produce, and may be tied to specific
domains (e.g., newswire). Many languages do
1h
t
t
p
:
/
/
w
w
w
.
c
s
.
j
h
u
.
e
d
u
/˜
m
r
g
/
s
o
f
t
w
a
r
e
/
not have such supervised resources (low-resource
languages), which makes exploring SRL cross-
linguistically difficult.
The problem of SRL for low-resource lan-
guages is an important one to solve, as solutions
pave the way for a wide range of applications: Ac-
curate identification of the semantic roles of enti-
ties is a critical step for any application sensitive to
semantics, from information retrieval to machine
translation to question answering.
In this work, we explore models that minimize
the need for high-resource supervision. We ex-
amine approaches in a joint setting where we
marginalize over latent syntax to find the optimal
semantic role assignment; and a pipeline setting
where we first induce an unsupervised grammar.
We find that the joint approach is a viable alterna-
tive for making reasonable semantic role predic-
tions, outperforming the pipeline models. These
models can be effectively trained with access to
only SRL annotations, and mark a state-of-the-art
contribution for low-resource SRL.
To better understand the effect of the low-
resource grammars and features used in these
models, we further include comparisons with (1)
models that use higher-resource versions of the
same features; (2) state-of-the-art high resource
models; and (3) previous work on low-resource
grammar induction. In sum, this paper makes
several experimental and modeling contributions,
summarized below.
Experimental contributions:
•
Comparison of pipeline and joint models for
SRL.
•
Subtractive experiments that consider the re-
moval of supervised data.
•
Analysis of the induced grammars in un-
supervised, distantly-supervised, and joint
training settings.
See our
paper from
ACL ‘14
Shallow Semantics Our End Task:
– Representation: Semantic Role Labeling (SRL) • Intuitively captures who did what to whom, when and where • Similar to open-‐domain relation extraction
– Languages: Catalan, Czech, German, English, Spanish, Chinese
NNP VBD DT NNP
Morsi chaired the FJP
NN NNP VBZ NN
Pres. #morsi cre8s unrest
NNP RB VBD DT NN
Proyas tb dirigio’ la peli
Agent Theme Agent Theme Holder Agent Patient Role
Syntax Intermediate Tasks: – Dependency parsing
• Captures the structure of the sentence • Diverges from the shallow semantics representation
– Part-‐of-‐speech (POS) tagging
NNP VBD DT NNP
Morsi chaired the FJP
NN NNP VBZ NN
Pres. #morsi cre8s unrest
NNP RB VBD DT NN
Proyas tb dirigio’ la peli
Agent Theme Agent Theme Agent Patient Holder Role
Agent! Theme!Holder!Role!
NN! NNP !VBZ! NN!
num=s! num=s! per=3s "tense=p!
num=s!
president! morsi! create! unrest!
President! Morsi! creates! unrest!
Background: The Supervised SRL Pipeline
Pipelined Training: Train each component of the pipeline independently using the predictions of the previous stage(s) as features.
Tool
Semantic role labeler
Dependency parser
Part-‐of-‐speech tagger
Morphological feature extractor
Lemmatizer
• Costly • Does not work well on informal text • Resources not available across
languages
Agent! Theme!Holder!Role!
NN! NNP !VBZ! NN!
num=s! num=s! per=3s "tense=p!
num=s!
president! morsi! create! unrest!
President! Morsi! creates! unrest!
Background: The Supervised SRL Pipeline
Pipelined Training: Train each component of the pipeline independently using the predictions of the previous stage(s) as features.
Tool
Semantic role labeler
Dependency parser
Part-‐of-‐speech tagger
Morphological feature extractor
Lemmatizer
Our
Emph
asis
This Talk in a Nutshell
• We want to do SRL in a new language.
• Syntax helps, but is very expensive to annotate.
• Having annotated syntax would be nice, but we can make progress without it!
• Define millions of features using 100+ feature templates
• Incorporate feature ideas from: – Koo et al. (2008) – Björkelund et al. (2009) – Zhao et al. (2009) – Lluís et al (2013)
Unigram Templates:
What about pairs of unigram templates?
Features and Feature Selection
Use Information Gain (IG) to find top unigram templates (Martins et al., 2011)
Then combine top unigram templates to find top bigram templates.
Property Possible values1 word form all word forms2 lower case word form all lower-case forms3 5-char word form prefixes all 5-char form prefixes4 capitalization True, False5 top-800 word form top-800 word forms6 brown cluster 000, 1100, 010110001, ...7 brown cluster, length 5 length 5 prefixes of brown clusters8 lemma all word lemmas9 POS tag NNP, CD, JJ, DT, ...10 morphological features Gender, Case, Number, ...
(different across languages)11 dependency label SBJ, NMOD, LOC, ...12 edge direction Up, Down
Table 1: Word and edge properties in templates.
i, i-1, i+1 noFarChildren(wi
) linePath(wp
, wc
)parent(w
i
) rightNearSib(wi
) depPath(wp
, wc
)allChildren(w
i
) leftNearSib(wi
) depPath(wp
, wlca
)rightNearChild(w
i
) firstVSupp(wi
) depPath(wc
, wlca
)rightFarChild(w
i
) lastVSupp(wi
) depPath(wlca
, wroot
)leftNearChild(w
i
) firstNSupp(wi
)leftFarChild(w
i
) lastNSupp(wi
)
Table 2: Word positions used in templates. Basedon current word position (i), positions related tocurrent word w
i
, possible parent, child (wp
, wc
),lowest common ancestor between parent/child(w
lca
), and syntactic root (wroot
).
train our CRF models by maximizing conditionallog-likelihood using stochastic gradient descentwith an adaptive learning rate (AdaGrad) (Duchiet al., 2011) over mini-batches.
The unary and binary factors are defined withexponential family potentials. In the next section,we consider binary features of the observations(the sentence and labels from previous pipelinestages) which are conjoined with the state of thevariables in the factor.
3.3 Features for CRF Models
Our feature design stems from two key ideas.First, for SRL, it has been observed that fea-ture bigrams (the concatenation of simple fea-tures such as a predicate’s POS tag and an ar-gument’s word) are important for state-of-the-art(Zhao et al., 2009; Bjorkelund et al., 2009). Sec-ond, for syntactic dependency parsing, combiningBrown cluster features with word forms or POStags yields high accuracy even with little trainingdata (Koo et al., 2008).
We create binary indicator features for eachmodel using feature templates. Our feature tem-plate definitions build from those used by the topperforming systems in the CoNLL-2009 SharedTask, Zhao et al. (2009) and Bjorkelund et al.(2009) and from features in syntactic dependencyparsing (McDonald et al., 2005; Koo et al., 2008).
Template Possible valuesrelative position before, after, ondistance, continuity Z+
binned distance > 2, 5, 10, 20, 30, or 40geneological relationship parent, child, ancestor, descendantpath-grams the NN went
Table 3: Additional standalone templates.
Template Creation Feature templates are de-fined over triples of hproperty, positions, orderi.Properties, listed in Table 1, are extracted fromword positions within the sentence, shown in Ta-ble 2. Single positions for a word w
i
includeits syntactic parent, its leftmost farthest child(leftFarChild), its rightmost nearest sibling (rightNearSib),etc. Following Zhao et al. (2009), we include thenotion of verb and noun supports and sections ofthe dependency path. Also following Zhao et al.(2009), properties from a set of positions can beput together in three possible orders: as the givensequence, as a sorted list of unique strings, and re-moving all duplicated neighbored strings. We con-sider both template unigrams and bigrams, com-bining two templates in sequence.
Additional templates we include are the relativeposition (Bjorkelund et al., 2009), geneological re-lationship, distance (Zhao et al., 2009), and binneddistance (Koo et al., 2008) between two words inthe path. From Lluıs et al. (2013), we use 1, 2, 3-gram path features of words/POS tags (path-grams),and the number of non-consecutive token pairs ina predicate-argument path (continuity).
3.4 Feature SelectionConstructing all feature template unigrams and bi-grams would yield an unwieldy number of fea-tures. We therefore determine the top N templatebigrams for a dataset and factor a according to aninformation gain measure (Martins et al., 2011):
IG
a,m
=
X
f2Tm
X
xa
p(f, x
a
) log2p(f, x
a
)
p(f)p(x
a
)
where T
m
is the mth feature template, f is a par-ticular instantiation of that template, and x
a
is anassignment to the variables in factor a. The proba-bilities are empirical estimates computed from thetraining data. This is simply the mutual informa-tion of the feature template instantiation with thevariable assignment.
This filtering approach was treated as a sim-ple baseline in Martins et al. (2011) to contrastwith increasingly popular gradient based regular-ization approaches. Unlike the gradient based ap-
Gap between supervised and not-‐so-‐supervised parser is very large
Subtractive Experiments Effectiveness of our joint models as the available supervision is decreased
0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"
CoNLL-‐2009 Supervised Data
Semantic roles
Dependency parses
Morphology
Part-‐of-‐speech tags
Lemmas
CoNLL-‐2009 Supervised Data
Semantic roles
Dependency parses
Morphology
Part-‐of-‐speech tags
Lemmas
Subtractive Experiments Effectiveness of our joint models as the available supervision is decreased
0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"
CoNLL-‐2009 Supervised Data
Semantic roles
Dependency parses
Morphology
Part-‐of-‐speech tags
Lemmas
Subtractive Experiments Effectiveness of our joint models as the available supervision is decreased
0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"
CoNLL-‐2009 Supervised Data
Semantic roles
Dependency parses
Morphology
Part-‐of-‐speech tags
Lemmas
Subtractive Experiments Effectiveness of our joint models as the available supervision is decreased
0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"
CoNLL-‐2009 Supervised Data
Semantic roles
Dependency parses
Morphology
Part-‐of-‐speech tags
Lemmas
Subtractive Experiments Effectiveness of our joint models as the available supervision is decreased
0"10"20"30"40"50"60"70"80"
"+""""(127+32)"
Dep"(40+32)"
Mor"(30+32)"
POS"(23+32)"
Lem"(21+32)"
SRL$F1$
Supervision$Removed$(#$Feat$Templates)$
Catalan"
Spanish"
German"
Learning Curves Lowest resource setting: joint training yields higher SRL F1 than distant-‐supervision.
heads, which runs counter to our head-based syn-tactic representation. This creates a mismatchedtrain/test scenario: we must train our model to pre-dict argument heads, but then test on our modelsability to predict argument spans.8 We thereforetrain our models on the CoNLL-2008 argumentheads,9 and post-process and convert from headsto spans using the conversion algorithm availablefrom Johansson and Nugues (2008).10 The headsare either from an MBR tree or an oracle tree. Thisgives Boxwell et al. (2011) the advantage, sinceour syntactic dependency parses are optimized topick out semantic argument heads, not spans.
Table 5 presents our results. Boxwell et al.(2011) (B’11) uses additional supervision in theform of a CCG tag dictionary derived from su-pervised data with (tdc) and without (tc) a cut-off. Our model does very poorly on the ’05 span-based evaluation because the constituent bracket-ing of the marginalized trees are inaccurate. Thisis elucidated by instead evaluating on the ora-cle spans, where our F1 scores are higher thanBoxwell et al. (2011). We also contrast with rela-vant high-resource methods with span/head con-versions from Johansson and Nugues (2008): Pun-yakanok et al. (2008) (PRY’08) and Johansson andNugues (2008) (JN’08).
Subtractive Study In our subsequent experi-ments, we study the effectiveness of our modelsas the available supervision is decreased. We in-crementally remove dependency syntax, morpho-logical features, POS tags, then lemmas. For theseexperiments, we utilize the coarse-grained featureset (IG
C
), which includes Brown clusters.Across languages, we find the largest drop in
F1 when we remove POS tags; and we find again in F1 when we remove lemmas. This indi-cates that lemmas, which are a high-resource an-notation, may not provide a significant benefit forthis task. The effect of removing morphologicalfeatures is different across languages, with littlechange in performance for Catalan and Spanish,
8We were unable to obtain the system output of Boxwellet al. (2011) in order to convert their spans to dependenciesand evaluate the other mismatched train/test setting.
9CoNLL-2005, -2008, and -2009 were derived from Prop-Bank and share the same source text; -2008 and -2009 useargument heads.
10Specifically, we use their Algorithm 2, which producesthe span dominated by each argument, with special handlingof the case when the argument head dominates that of thepredicate. Also following Johansson and Nugues (2008), werecover the ’05 sentences missing from the ’08 evaluation set.
Table 6: Subtractive experiments. Each row con-tains the F1 for SRL only (without sense disam-biguation) where the supervision type of that rowand all above it have been removed. Removed su-pervision types (Rem) are: syntactic dependencies(Dep), morphology (Mor), POS tags (POS), andlemmas (Lem). #FT indicates the number of fea-ture templates used (unigrams+bigrams).
Figure 3: Learning curve for semantic dependencysupervision in Catalan and German. F1 of SRLonly (without sense disambiguation) shown as thenumber of training sentences is increased.
but a drop in performance for German. This mayreflect a difference between the languages, or mayreflect the difference between the annotation of thelanguages: both the Catalan and Spanish data orig-inated from the Ancora project,11 while the Ger-man data came from another source.
Figure 3 contains the learning curve for SRL su-pervision in our lowest resource setting for twoexample languages, Catalan and German. Thisshows how F1 of SRL changes as we adjustthe number of training examples. We find thatthe joint training approach to grammar inductionyields consistently higher SRL performance thanits distantly supervised counterpart.
4.5 Analysis of Grammar InductionTable 7 shows grammar induction accuracy inlow-resource settings. We find that the gap be-tween the supervised parser and the unsupervisedmethods is quite large, despite the reasonable ac-curacy both methods achieve for the SRL end task.
IGc Features from Information Gain template selection, Coarse-‐Grained properties IGB Features from Information Gain template selection, Björkelund et al. (2009) properties
IGc Features from Information Gain template selection, Coarse-‐Grained properties IGB Features from Information Gain template selection, Björkelund et al. (2009) properties
IGc Features from Information Gain template selection, Coarse-‐Grained properties IGB Features from Information Gain template selection, Björkelund et al. (2009) properties
IGc Features from Information Gain template selection, Coarse-‐Grained properties IGB Features from Information Gain template selection, Björkelund et al. (2009) properties
Low-‐Res
ource
High-‐Re
source
Comparison with Work in Grammar Induction in Low-‐Resource Setting