The Leaf Projection Path View of Parse Trees: Exploring String Kernels for HPSG Parse Selection Kristina Toutanova, Penka Markova, Christopher Manning.
Post on 17-Dec-2015
217 Views
Preview:
Transcript
The Leaf Projection Path View of Parse Trees: Exploring
String Kernels for HPSG Parse Selection
Kristina Toutanova, Penka Markova,Christopher Manning
Computer Science DepartmentStanford University
Motivation: the task“I would like to meet with you again on
Monday” Input: a sentence
Classify to one of the possible parses
focus on discriminating among parses
Motivation: traditional representation of parse trees
Features are pieces of local rule productions with grand-parenting
When using plain context free rules most features make no reference to the input string – naive for a discriminative model!
Lexicalization with the head word introduces more connection to the input
meet
meet
to
on
Motivation: traditional representation of parse trees
All subtrees representation: features are (a restricted kind) of subtrees of the original tree
must choose features or discount larger trees
General idea: representation
Provides broader view of tree contexts Increases connection to the input string (words) Captures examples of non-head dependencies like in “more
careful than his sister” (Bod 98)
Trees are lists of leaf projection paths
Non-head path is included in addition to the head path
Each node is lexicalized with all words dominated by it
Trees must be binarized
General idea: tree kernels Often only a kernel (a similarity measure) between trees is
necessary for ML algorithms.
Measure the similarity between trees by the similarity between projection paths of common words/pos tags in the trees.
General idea: tree kernels from string kernels
Measures of similarity between sequences (strings) have been developed for many domains.
use string kernels between projection paths and combine them into a tree kernel via a convolution
this gives rise to interesting features and more global modeling of the syntactic environment of words
SVPVP-NFVPVP-NFVP-NFVP-NFVP-NFmeet
SVPVP-NFVP-NFVP-NFVPVP-NFVP-NFmeet
SIM
Overview
HPSG syntactic analyses representation Illustration of the leaf projection paths
representation Comparison to traditional rule
representation experimental results
Tree kernels from string kernels on projection paths
Experimental results
HPSG tree representation: derivation trees
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
HPSG – Head Driven Phrase Structure Grammar; lexicalized unification based grammar
ERG grammar of English Node labels are rule names such as head-complement and head-adjunct
The inventory of rules is larger than in traditional HPSG grammars
Full HPSG signs can be recovered from the derivation trees using the grammar We use annotated derivation trees as the main representation for disambiguation
HPSG tree representation: annotation of nodes
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
Annotation with the value of synsem.local.cat.head
Its values are a small set of part-of-speech tags: verb
: verb
: verb : verb
: prep*
HPSG tree representation: syntactic word classes
The word classes are around 500 types in the HPSG type hierarchy.
They show detailed syntactic information including e.g. subcategorization.
p_reg n_deictic_pro
Our representation heavily uses word classes to backoff from words
LET_V1 US
let us
PLAN_ON_V2
plan
ON THAT_DEIX
on thatv_sorb
v_empty_prep_intrans
n_pers_pro
word types lexical item ids
Leaf projection paths representation
THAT_DEIX
IMPERHCOMP
HCOMP HCOMP
LET_V1 US
let usPLAN_ON_V2
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
•The tree is represented as a list of paths from the words to the top.
•The paths are keyed by words and corresponding word classes.
•The head and non-head paths are treated separately.
LET_V1 verb
HCOMP: verb
IMPER: verb
HCOMP: verb
END
letv_sorb
START
letv_sorb
START
END
Leaf projection paths representation
THAT_DEIX
IMPERHCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_pro
v_sorb
•The tree is represented as a list of paths from the words to the top.
•The paths are keyed by words and corresponding word classes.
•The head and non-head paths are treated separately.
PLAN_ON: verb
HCOMP: verb
END
planv_empty_prep_intrans
START
planv_empty_prep_intrans
START
HCOMP: verb
IMPER: verb
END
Leaf projection paths representation
THAT_DEIX
IMPERHCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_pro
v_sorb
Can recover local rules by annotation of nodes with sister and parent categories
Now extract features from this representation for discriminative models
Overview
HPSG syntactic analyses representation Illustration of the leaf projection paths
representation Comparison to traditional rule
representation experimental results
Tree kernels from string kernels on projection paths
Experimental Results
Machine learning task setup
Given m training sentences
Sentence si has pi possible analyses and ti,1 is the correct analysis
Learn a parameter vector and choose for a test sentence the tree t with the maximum score
))(),...,(,( ,1, ipiii tts
)(. tw
w
Linear Models e.g. (Collins 00)
Choosing the parameter vector
Previous formulations (Collins 01, Shen and Joshi 03)
We solve this problem using SVMLight for ranking For all models we extract all features from the
kernel’s feature map and solve the problem with a linear kernel
0:1
1))()((:1
2
1 min
,
,,1,
,,
ji
jijii
jiji
ji
ttwji
Cww
The leaf projection paths view versus the context free rule view
Goals: Compare context free rule models to projection
path models Evaluate the usefulness of non-head paths
Models Projection paths:
Bi-gram model on projection paths (2PP) Bi-gram model on head projection paths only
(2HeadPP) Context free rules:
Joint rule model (J-Rule) Independent rule model (I-Rule)
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
plan (head path)
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
[v_empty_prep_intrans,HCOMP,END,head]
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
plan (head path)
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
[v_empty_prep_intrans,HCOMP,END,head]
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
on (non-head path)
[p_reg,START,HCOMP,non-head]
[p_reg,HCOMP,HCOMP,non-head]
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
on (non-head path)
[p_reg,START,HCOMP,non-head]
[p_reg,HCOMP,HCOMP,non-head]
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
that (non-head path)
[n_deictic_pro,HCOMP,HCOMP,non-head]
[n_deictic_pro,HCOMP,HCOMP,non-head]
The leaf projection paths view versus the context free rule view
2PP has as features bi-grams from the projection paths. Features of 2PP including the node HCOMP
THAT_DEIX
IMPER
HCOMP
HCOMP HCOMP
LET_V1 US
let us
PLAN_ON_V2
plan
HCOMP
ON
onthat
: verb
: verb
: verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
:verb
that (non-head path)
[n_deictic_pro,HCOMP,HCOMP,non-head]
[n_deictic_pro,HCOMP,HCOMP,non-head]
The leaf projection paths view versus the context free rule view
I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
THAT_DEIX
IMPERHCOMP
HCOMP HCOMPLET_V1 US
let usPLAN_ON_V2
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
[v_empty_prep_intrans,PLAN_ON_V2,HCOMP,head]
The leaf projection paths view versus the context free rule view
I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
[p_reg,HCOMP,HCOMP,non-head]
THAT_DEIX
IMPERHCOMP
HCOMP HCOMPLET_V1 US
let usPLAN_ON_V2
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
The leaf projection paths view versus the context free rule view
I-Rule has as features edges of the tree, annotated with the word class of the child and head vs. non-head information
Features of I-Rule including the node HCOMP
[v_empty_prep_intrans,HCOMP,HCOMP,non-head]
THAT_DEIX
IMPERHCOMP
HCOMP HCOMPLET_V1 US
let usPLAN_ON_V2
planHCOMP
ON
onthat
: verb
: verb
: verb : verb
: prep*
v_empty_prep_ intrans
p_regn_deictic_pro
n_pers_prov_sorb
Comparison results Redwoods corpus
3829 ambiguous sentences; average number of words 7.8 average ambiguity 10.810-fold cross-validation ; report exact match accuracy
80.14
80.99 81.07
82.70
78.0
79.0
80.0
81.0
82.0
83.0
Model
Acc
ura
cy
2HeadPP J-Rule I-Rule 2PP
Non-head paths are useful (13% relative error reduction from head only)
The bi-gram model on projection paths performs better than a very similar local rule based model
Overview
HPSG syntactic analyses representation Illustration of the leaf projection paths
representation Comparison to traditional rule
representation experimental results
Tree kernels from string kernels on projection paths
Experimental Results
String kernels on projection paths
We looked at a bi-gram model on projection paths (2PP).
This is a special case of a string kernel (n-gram kernel).
We could use more general string kernels on projection paths --- existing ones, that handle non-contiguous substrings or more complex matching of nodes.
It is straightforward to combine them into tree kernels.
Formal representation of parse trees
)],(),..,,[( 11 mm xkeyxkey
key1=let (head)
X1=“START LET_V1:verb HCOMP:verb HCOMP:verb IMPER:verb END”
key2=v_sorb(head) X2 = X1
key3=let (non-head)
X3=“START END”
key4=v_sorb(non-head) X4 = X3
LET_V1 verb
HCOMP: verb
IMPER: verb
HCOMP: verb
END
letv_sorb
START
letv_sorb
START
END
t
Tree kernels using string kernels on projection paths
)],(),..,,[( 11 mm xkeyxkey
)]','(),..,','[( 11 nn xkeyxkeyt’
m
i
n
jjjii xkeyxkeyKPttKT
1 1
))','(),,(()',(
t
otherwise ,0)),(),,((
if ,),()),(),,((
xykexkeyKP
ykekeyxxKxykexkeyKP
String kernels overview
Define string kernels by their feature map from strings to vectors indexed by feature indices
Example: 1-gram kernel
LET_V1
HCOMP
IMPER
HCOMP
END
START
1 ,1
,2 ,1 ,1
1_
STARTVLET
HCOMPIMEREND
Repetition kernel General idea: Improve on the 1-gram kernel by
better handling repeated symbols.
He eats chocolate from Belgium with fingers .head path of eats when high attachment – (NP PP PP
NP)Rather than the feature for PP having twice as much
weight, there should be a separate feature indicating that there are two PPs.
The feature space is indexed by strings Two discount factors for gaps and for letters
PPPPNPNP
aaa ,...1 2
5. if
5.),,,( ,1),,,(
21
,
NPPPPPNPNPPPPPNP PPPPPP
The Repetition kernel versus 1-gram and 2-gram
1-gram 44,278 features
Repetition 52,994 features
2-gram 104,331 features84.15
83.59
82.21
81 82 83 84 85 86
Repetition achieves 7.8% error reduction from 1-gram
Other string kernels
So far: 1-gram,2-gram, repetition Next: allow general discontinuous n-grams
restricted subsequence kernel Also: allow partial matching
wildcard kernel allowing a wild-card character in the n-gram features; the wildcard matches any character
Lodhi et al. 02; Leslie and Kuang 03
Restricted subsequence kernel
Has parameters k – maximum size of the feature n-gram; g – maximum span in the string; λ1 - gap penalty and λ2 - letter - penalty λ2
when k=2,g=5, λ1 =.5, λ2 =1
LET_V1
HCOMP
IMPER
HCOMP
END
START
0
,...125.,5.15.1,1
1 ,1,2 ,1 ,1
,
,,,
1_
STARTEND
STARTIMPERHCOMPIMPERIMPEREND
STARTVLETHCOMPIMEREND
Varying the string kernels on word class keyed paths
81 82 83 84 85 86
1-gram (13K) 81.432-gram (37K) 82.70
81 82 83 84 85 86
subseq (2,3,.50,2) (81K) 83.22subseq (2,3,.25,2) (81K) 83.48subseq (2,4,. 5,2) (102K) 83.29subseq (3,5,.25,2)(416K) 83.06
Varying the string kernels on word class keyed paths
81 82 83 84 85 86
1-gram (13K) 81.432-gram (37K) 82.70
81 82 83 84 85 86
subseq (2,3,.50,2) (81K) 83.22subseq (2,3,.25,2) (81K) 83.48subseq (2,4,.50,2) (102K) 83.29subseq (3,5,.25,2) (416K) 83.06
Increasing the amount of discontinuity or adding larger n-gram did not help
Adding word keyed paths
83.2283.48 83.29
84.96 84.7584.4
81.0
82.0
83.0
84.0
85.0
86.0
subseq(2,3,.5,2)
subseq(2,3,.25,2)
subseq(2,4,.5,2)
word classes word classes+words
Best previous result from a single classifier 82.7 (mostly local rule based). Relative error reduction is 13%
Fixed the kernel for word keyed paths to 2-gram+repetition
Other models and model combination
Many features are available in the HPSG signs. A single model is likely to over-fit when given too many
features. To better use the additional information, train several
classifiers and combine them by voting
85.484.96
81.0
82.0
83.0
84.0
85.0
86.0
best single model model combination
Best previous result from voting classifiers is 84.23% (Osborne & Balbridge 04)
Conclusions and future work
Summary We presented a new representation of parse trees
leading to a tree kernel It allows the modeling of more global tree contexts as
well as greater lexicalization We demonstrated gains from applying existing
string kernels on projection paths and new kernels useful for the domain (Repetition kernel)
The major gains were due to the representationFuture Work Other sequence kernels better suited for the task Feature selection: which words / word classes
deserve better modeling of their leaf paths Other corpora
top related