ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources 1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef van Genabith, Dublin City University Yusuke Miyao, University of Tokyo Julia Hockenmaier, University of Pennsylvania and University of Edinburgh ESSLLI 2006 18 th European Summer School for Language, Logic and Information, University of Malaga, July – August 2006
345
Embed
ESSLLI 2006 Treebank-Based Acquisition of LFG, HPSG and CCG Resources1 Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources Josef.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources1
Advanced Course: Treebank-Based Acquisition of LFG, HPSG and CCG Resources
Josef van Genabith, Dublin City University
Yusuke Miyao, University of Tokyo
Julia Hockenmaier, University of Pennsylvania and University of Edinburgh
ESSLLI 200618th European Summer School for Language, Logic
and Information, University of Malaga, July – August 2006
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources2
• Josef van Genabith, National Centre for Language Technology NCLT, School of Computing, Dublin City University, Dublin 9, Ireland, [email protected]
• What do grammars do?– Grammars define languages as sets of strings– Grammars define what strings are grammatical
and what strings are not– Grammars tell us about the syntactic structure of
(associated with) strings• “Shallow” vs. “Deep” grammars• Shallow grammars do all of the above• Deep grammars (in addition) relate text to information/meaning
representation• Information: predicate-argument-adjunct structure, deep
dependency relations, logical forms, …• In natural languages, linguistic material is not always
interpreted locally where you encounter it: long-distance dependencies (LDDs)
• Resolution of LDDs crucial to construct accurate and complete information/meaning representations.
• Deep grammars := (text <-> meaning) + (LDD resolution)
• Traditionally, deep constraint-based grammars are hand-crafted• LFG ParGram, HPSG LingoErg, Core Language Engine CLE, Alvey
Tools, RASP, ALPINO, …• Wide-coverage, deep unification (constraint-based) grammar
development is knowledge extensive and expensive!• Very hard to scale hand-crafted grammars to unrestricted text! • English XLE (Riezler et al. 2002); German XLE (Forst and Rohrer
2006); Japanese XLE (Masuichi and Okuma 2003); RASP (Carroll and Briscoe 2002); ALPINO (Bouma, van Noord and Malouf, 2000)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources5
Motivation
• Instance of “knowledge acquisition bottleneck” familiar from classical “rationalist” rule/knowledge-based AI/NLP
• Alternative to classical “rationalist” rule/knowledge-based AI/NLP• “Empiricist” research paradigm (AI/NLP):
– Corpora, treebanks, …, machine-learning-based and statistical approaches, …
– Treebank-based grammar acquisition, probabilistic parsing– Advantage: grammars can be induced (learned) automatically – Very low development cost, wide-coverage, robust, but …
• Most treebank-based grammar induction/parsing technology produces “shallow” grammars
• Shallow grammars don’t resolve LDDs (but see (Johnson 2002); …), do not map strings to information/meaning representations …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources6
Motivation
• Poses a research question:• Can we address the knowledge acquisition bottleneck for
deep grammar development by combining insights from rationalist and empiricist research paradigms?
• Specifically:• Can we automatically acquire wide-coverage “deep”,
probabilistic, constraint-based grammars from treebanks?• How do we use them in parsing?• Can we use them for generation?• Can we acquire resources for different languages and
treebank encodings?• How do these resources compare with hand-crafted
resources?• …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources7
Course Overview
Monday:
Tuesday:
Wednesday:
Thursday:
Friday:
Motivation, Course Overview, Introductions to TAG, LFG, CCG, HPSG and Penn-II TreeBank, TAG Resources
Penn-II-Based Acquisition of LFG Resources
Penn-II-Based Acquisition of CCG Resources
Penn-II-Based Acquisition of HPSG Resources
Multilingual Resources, Formal Semantics, Comparing LFG, CCG, HPSG and TAG-Based Approaches, Demos, Current and Future Work, Discussion
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources8
Course Overview
Tuesday/Wednesday/Thursday
Penn-II-Based Acquisition of XXG Resources:
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Sematics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources9
Grammar Formalisms
Grammar Formalisms
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources10
Grammar formalisms and linguistic theories
• Linguistics aims to explain natural language:– What is universal grammar?– What are language-specific constraints?
• Formalisms are mathematical theories:– They provide a language in which linguistic theories
can be expressed (like calculus for physics)– They define elementary objects (trees, strings,
feature structures) and recursive operations which generate complex objects from simple objects.
– They do impose linguistic constraints (e.g. on the kinds of dependencies they can capture)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources11
Lexicalised Grammar Formalisms:
TAG, CCG, LFG and HPSG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources12
Lexicalised formalisms (TAG, CCG, LFG and HPSG)
• The lexicon:– pairs words with elementary objects– specifies all language-specific information
(number and location of arguments, control and binding theory)
• The grammatical operations:– are universal– define (and impose constraints on) recursion
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources13
TAG, CCG, LFG and HPSG
• They describe different kinds of linguistic objects:– TAG is a theory of trees– CCG is a theory of (syntactic and semantic) types– LFG is a multi-level theory based on a projection
architecture relating different types of linguistic objects (trees, AVMs, linear logic–based semantics)
– HPSG uses single, uniform formalism (typed feature structures) to describe phonological, morphological, syntactic and semantic representations (signs)
• They differ in details:– treatment of wh-movement, coordination, etc.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources14
TAG, CCG, LFG and HPSG
• TAG and CCG are weakly equivalent.
• Both are mildly context-sensitive:– can capture Dutch crossing dependencies – but are still efficiently parseable (in polynomial
time)
• LFG context-sensitive
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources15
Tree-Adjoining Grammar (TAG)
Tree-Adjoining Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources16
(Lexicalized) Tree-Adjoining Grammar
• TAG is a tree-rewriting formalism:– TAG defines operations (substitution and adjunction) on
trees.– The elementary objects in TAG are trees (not strings)
• TAG is lexicalized:– Each elementary tree is anchored to a lexical item (word)– “Extended domain of locality”:
The elementary tree contains all arguments of the anchor.– TAG requires a linguistic theory which specifies the shape
of these elementary trees.
• TAG is mildly context-sensitive:– can capture Dutch crossing dependencies– but is still efficiently parseable
AK Joshi and Y Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A. Salomaa, Eds., Handbook of Formal Languages
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources17
TAG substitution (arguments)
SubstituteX YX Y
X Y
Derivation tree:
Derived tree:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources18
ADJOIN
TAG adjunction (modifiers)
XX*
X
X
X*
Auxiliary tree
Foot node
Derived tree:
Derivation tree:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources19
A small TAG lexicon
S
NP VP
VBZ NP
eats
NP
John
VP
RB VP*
always
NP
tapas
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources20
A TAG derivation
S
NP VP
VBZ NP
eats
NP
John
NP
tapas
VPRB VP*
always
NP
NP
NP
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources21
A TAG derivation
S
NP VP
VBZ NP
eats tapas
VPRB VP*
always
John
VP
VP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources22
A TAG derivation
S
NP
VBZ
VP
NP
eats tapas
VPRB VP*
always
John
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources23
Combinatory Categorial Grammar (CCG)
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources24
Combinatory Categorial Grammar
• CCG is a lexicalized grammar formalism(the “rules” of the grammar are completely general,all language-specific information is given in the lexicon)
• CCG is nearly context-free(can capture Dutch crossing dependencies, but is still efficiently parseable)
• CCG has a flexible constituent structure• CCG has a simple, unified treatment of
extraction and coordination • CCG has a transparent syntax-semantics interface
(every syntactic category and operation has a semantic counterpart)
• CCG rules are monotonic(movement or traces don’t exist)
• CCG rules are type-driven, not structure-driven(this means e.g. that intransitive verbs and VPs are indistinguishable)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources25
• Categories: specify subcat lists of words/constituents.
• Combinatory rules: specify how constituents can combine.
• The lexicon: specifies which categories a word can have.
• Derivations: spell out process of combining constituents.
CCG: the machinery
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources26
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a result when combined with an argument:
VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP
(NP\NP)/NP• Every category has a semantic
interpretation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources27
Function application
• Combines a function with its argument to yield a result:
(S\NP)/NP NP -> S\NPeats tapas eats tapas
NP S\NP -> SJohn eats tapas John eats tapas
• Used in all variants of categorial grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources28
A (C)CG derivation
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources29
Type-raising and function composition
• Type-raising: turns an argument into a function.Corresponds to case:
• We will only be concerned with canonical “normal-form” derivations, which only use function composition and type-raising when syntactically necessary.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources32
CCG: semantics
• Every syntactic category and rule has a semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources33
The CCG lexicon
• Pairs words with their syntactic categories(and semantic interpretation):
eats (S\NP)/NP xy.eats’xyS\NP x.eats’x
• The main bottleneck for wide-coverage CCG parsing
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources34
Why use CCG for statistical parsing?
• CCG derivations are binary trees: we can use standard chart parsing techniques.
• CCG derivations represent long-range dependencies and complement-adjunct distinctions directly:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources35
A comparison with Penn Treebank parsers
• Standard Treebank parsers do not recover the null elements and function tags that are necessary for semantic interpretation:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources36
Lexical-Functional Grammar (LFG)
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources37
Lexical-Functional Grammar LFG
Lexical-Functional Grammar (LFG) (Bresnan & Kaplan 1981, Bresnan 2001, Dalrymple 2001) is a unification- (or constraint-) based theory of grammar.
Two (basic) levels of representation:
• C-structure: represents surface grammatical configurations such as word order, annotated CFG data structures
• F-structure: represents abstract syntactic functions such as SUBJ(ject), OBJ(ect), OBL(ique), PRED(icate), COMP(lement), ADJ(unct) …, AVM attribute-value matrices/structures
F-structure approximates to basic predicate-argument structure, dependency representation, logical form (van Genabith and Crouch, 1996; 1997)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources38
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources39
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources40
Lexical-Functional Grammar LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources41
LFG Grammar Rules and Lexical Entries
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources42
LFG Parse Tree (with Equations/Constraints)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources43
LFG Constraint Resolution (1/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources44
LFG Constraint Resolution (2/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources45
LFG Constraint Resolution (3/3)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources46
LFG Subcategorisation & Long Distance Dependencies
• Subcategorisation:
– Semantic forms (subcat frames): sign< SUBJ, OBJ>
– Completeness: all GFs in semantic form present at local f-structure
– Coherence: only the GFs in semantic form present at local f-structure
• Long Distance Dependencies (LDDs): resolved at f-structure with Functional Uncertainty Equations (regular expressions specifying paths in f-structure).
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources47
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources48
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources49
LFG LDDs: Complement Relative Clause
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources50
Head-Driven Phrase Structure Grammar (HPSG)
Head-Driven Phrase Structure Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources51
Head-Driven Phrase Structure Grammar HPSG
• HPSG (Pollard and Sag 1994, Sag et al. 2003) is a unification-/constraint-based theory of grammar
• HPSG is a lexicalized grammar formalism• HPSG aims to explain generic regularities that underlie
phrase structures, lexicons, and semantics, as well as language-specific/-independent constraints
• Syntactic/semantic constraints are uniformly denoted by signs, which are represented with feature structures
• Two components of HPSG– Lexical entries represent word-specific constraints
(corresponding to elementary objects)– Principles express generic grammatical regularities
(corresponding to grammatical operations)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources52
Sign
• Sign is a formal representation of combinations of phonological forms, syntactic and semantic constraints
• Contains text from different domains:– Wall Street Journal (50,000 sentences, 1 Million words)– Switchboard– Brown corpus– ATIS
• The annotation:– POS-tagged (Ratnaparkhi’s MXPOST) – Manually annotated with phrase-structure trees– Traces and other null elements used to represent non-local
dependencies (movement, PRO, etc.)– Designed to facilitate extraction of predicate-argument
structure
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources63
A Treebank tree
• Relatively flat structures:– There is no noun level– VP arguments and adjuncts appear at the same level
• Co-indexed null elements indicate long-range dependencies• Function tags indicate complement-adjunct distinction (?)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources64
Penn-II Treebank
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources65
Penn-II Treebank
• Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind , the Treasury said .
• Standard evaluation metric for Treebank parsers.Two components: – Precision: how many of the proposed NTs are correct?– Recall: how many of the correct NTs are proposed?
• Measures recovery of nonterminals(span + syntactic category)
• Ignores function tags and null elements
Has biased research towards parsers that produce linguistically shallow output (Collins, Charniak)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources79
Treebank-Based Acquisition
of TAG resources
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources80
Extracting a TAG from the Treebank
• Two different approaches:– F. Xia. Automatic Grammar Generation From Two
Different Perspectives. PhD thesis, University of Pennsylvania, 2001.
– J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining Grammars from Treebanks, Natural Language Engineering (forthcoming)
• This lecture: just the basic ideas!
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources81
Extracting a TAG from the Penn Treebank
• Input: a Treebank tree (= the TAG derived tree)
•Output: a set of elementary trees(= the TAG lexicon)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources82
Extracting a TAG: the head
- Identify the head path (requires a head percolation table)
S
VPVBG
making
VP
- Find the arguments of the head (requires an argument table)- Ignore modifiers (requires an adjunct table)
- Merge unary productions (VP -> VP)
NP-SBJ
NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources83
Extracting a TAG: the head
• This is the elementary tree for the head:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources84
Extracting a TAG: arguments
• Arguments are combined via substitution• Recurse on the arguments:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources85
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees(use adjunction to be combined with the head)
• Auxiliary trees require a foot node (with the same label as the root)
is
VBZ
VP
VP
ADVP-MNR
officially
NP
DTthe
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources86
Extracting a TAG: adjuncts
• Adjuncts require auxiliary trees(use adjunction to be combined with the head)
• Auxiliary trees require a foot node (with the same label as the root)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources87
Special cases
• Coordination• Null elements (e.g. traces for wh-
movement):The trace has to be part of the elementary treeof the main verb
• Punctuation marks
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources88
Wh-movement: relative clauses
(NP (NP a charge))
(SBAR (WHNP-2 (-NONE- 0))
(S (NP-SBJ Mr. Coleman))
(VP (VBZ denies)
(NP (-NONE- *T*-2)))))))
NP
NP
NP
SBAR
NP
S
VP
VBZ
WHNP
-NONE-
-NONE-
*T*-2
0
denies
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources89
Evaluating an extracted grammar/lexicon
• Grammar/lexicon size?– Depends on head table, argument/adjunct distinction,
treatment of null elements, mapping of Treebank labels/POS tags to categories in extracted grammar etc.
– For TAGs, between 3,000-8,500 elementary tree types,and 100,000-130,000 lexical entries.
• Lexical coverage? – For TAGs, around 92-93%
• Distribution of tree types?• Convergence?• Quality?
– Inspection, comparison with manual grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources90
References: TAG extraction
TAG:A.K. Joshi and Y. Schabes (1996) Tree Adjoining Grammars. In G. Rosenberg and A.
Salomaa, Eds., Handbook of Formal Languages
TAG extraction:F. Xia. Automatic Grammar Generation From Two Different Perspectives. PhD
thesis, University of Pennsylvania, 2001.J. Chen, S. Bangalore, K. Vijaj-Shanker. Automated Extraction of Tree-Adjoining
Grammars from Treebanks, Natural Language Engineering (forthcoming)Also: L. Shen and A.K. Joshi, Building an LTAG Treebank, Technical Report MS-CIS-
05-15, CIS Department, University of Pennsylvania, 2005
Parsing with extracted TAGs:D. Chiang. Statistical parsing with an automatically extracted tree adjoining
grammar. In Data Oriented Parsing, CSLI Publications, pages 299–316.
L. Shen and A.K. Joshi. Incremental LTAG parsing, HLT/EMNLP 2005
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources91
Penn-II-Based Acquisition of LFG Resources
Lexical-Functional Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources92
Penn-II-Based Acquisition of LFG Resources
• Introduction
• Treebank Preprocessing/Clean-Up
• Treebank Annotation/Conversion
• Grammar and Lexicon Extraction
• Parsing (Architectures, Probability Models, Evaluation)
• Generation (Architectures, Probability Models, Evaluation)
• Other (Semantics, Domain Variation, …)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources93
Introduction: Penn-II & LFG
• If we had f-structure annotated version of Penn-II, we could use (standard) machine learning methods to extract probabilistic, wide-coverage LFG resources
• Penn-II is a 2nd generation treebank – contains lots of annotations to support derivation of deep meaning representations: trees, Penn-II “functional” tags, traces & coindexation – f-structure annotation algorithm can exploit those.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources94
Introduction: Penn-II & LFG
• What is the task?
• Given a Penn-II tree, the f-structure annotation algorithm has to traverse the tree and associate all tree nodes with f-structure equations (including lexical equations at the leaves of the tree).
• A simple example
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources95
Introduction: Penn-II & LFG
S
NP-SBJ VP
NN NNS
Factory payrolls
VBD PP-TMP
fellIN NP
NNPin
↑=↓
↑subj=↓
↑=↓
↑=↓
↓↑adjunct
↑=↓ ↓↑adjunct
↑=↓
↑obj=↓
↑=↓
September
Factory payrolls fell in September.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources96
Introduction: Penn-II & LFG
subj : pred : payroll num : pl pers : 3 adjunct : 2 : pred : factory num : sg pers : 3adjunct : 1 : pred : in obj : pred : september num : sg pers : 3pred : falltense : past
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources97
Treebank Preprocessing/Clean-Up: Penn-II & LFG
• Penn-II treebank: often flat analyses (coordination, NPs …), a certain amount of noise: inconsistent annotations, errors …
• No treebank preprocessing or clean-up in the LFG approach
• Take Penn-II treebank as is, but
• Remove all trees with FRAG or X labelled constituents
• Frag = fragments, X = not known how to annotate
• Total of 48,424 trees as they are.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources98
Treebank Annotation: Penn-II & LFG
• Annotation-based (rather than conversion-based)• Automatic annotation of nodes in Penn-II treebank tress
with f-structure equations• F-structure Annotation Algorithm• Annotation Algorithm exploits:
– Head information – Categorial information– Configurational information– Penn-II functional tags– Trace information
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources99
Treebank Annotation: Penn-II & LFG
• Architecture of a modular algorithm to assign LFG f-structure equations to trees in the Penn-II treebank:
Left-Right Context Annotation Principles
Coordination Annotation Principles
Catch-All and Clean-Up
Traces
ProtoF-Structures Proper
F-Structures
Head-Lexicalisation [Magerman,1994]
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources100
Treebank Annotation: Penn-II & LFG
• Head Lexicalisation: modified rules based on (Magerman, 1994)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources101
Treebank Annotation: Penn-II & LFG
Left-Right Context Annotation Principles:
• Head of NP likely to be rightmost noun …• Mother → Left Context Head Right Context
• F-structure quality evaluation against DCU 105, a manually annotated dependency gold standard of 105 sentences randomly extracted from WSJ section 23.
• Triples are extracted from the gold standard and the automatically produced f-structures using the evaluation software from (Crouch et al. 2002) and (Riezler et al. 2002)
relation(predicate~0, argument~1)
• Results calculated in terms of Precision and Recall
Treebank Annotation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources121
Treebank Annotation: Penn-II & LFG
• Precision and Recall for DCU 105 Dependency Bank results are calculated for All Annotations and for Preds-Only
• Following (Kaplan et al. 2004) Precision and Recall for PARC 700 Dependency Bank calculated for:
all annotations PARC features preds-only
• Mapping required• (Burke 2006)
PARC 700 PARC features
Precision 88.31%
Recall 86.38%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources124
Grammar and Lexicon Extraction : Penn-II & LFG
Lexical Resources:
• Lexical information extremely important in modern lexicalised grammar formalisms
• LFG, HPSG, CCG, TAG, … • Lexicon development is time consuming and extremely
expensive • Rarely if ever complete• Familiar knowledge acquisition bottleneck …• Subcategorisation frame induction (LFG semantic forms) from
f-Structure annotated version of Penn-II and -III• Evaluation against COMLEX
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources125
Grammar and Lexicon Extraction: Penn-II & LFG
• Lexicon Construction– Manual vs. Automated
Our Approach:
– F-Structure Annotation of Penn-II and Penn-III– Frames not Predefined– Functional and Categorial Information– Parameterised for Prepositions and Particles– Active and Passive – Long Distance Dependencies– Conditional Probabilities
For each level of embedding in F Determine the local predicate PRED Collect all subcategorisable grammatical functions GF1, …,
GFn
Return: PRED<GF1, GF2, …, GFn>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources131
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the pred : inquiry num : sg pers : 3adjunct : 1 : pred : soonpred : focustense : pastobl : pform : on obj : spec : det : pred : the pred : judge num : sg pers : 3
“The inquiry soon focused on the judge” (wsj_0267_72)
Prepositions and OBLs:
focus([subj,obl:on])
on([obj])
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources132
Grammar and Lexicon Extraction: Penn-II & LFG
topic : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3
……
pred : have tense : pressubj : spec : det : pred : the pers : 3 pred : treasury num : singcomp : index : [1] subj : spec : det : pred : the num : sing pred : government pers : 3
… …
pred : have tense : prespred : saytense : past
LDDs:
say([subj,comp])
“Until Congress acts , the government hasn't any authority to issue new debt obligations of any kind, the Treasury said.” (wsj_0008_2)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources133
Grammar and Lexicon Extraction: Penn-II & LFG
subj : pred : pro pron_form : itpassive : +to_inf : +pred : bexcomp : subj : pred : pro pron_form : it passive : + pred : consider tense : past obl : pform : as obj : spec : det : pred : a ……… ……… pred : risk num : sg pers : 3
Passive:
consider([subj,obl:as],p)
“… to be considered as an additional risk for the investor…”(wsj_0018_14)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources134
Grammar and Lexicon Extraction: Penn-II & LFG
subj : spec : det : pred : the cat : dt
pred : inquiry num : sg pers : 3 cat : nnadjunct : 1 : pred : soon
cat : rbpred : focustense : pastcat : vbdobl : pform : on obj : spec : det : pred : the
cat : dt pred : judge num : sg pers : 3
cat : nn
CFG categories:
focus(v,[subj,obl:on])focus(v,[subj(n),obl:on])
“The inquiry soon focused on the judge.” (wsj_0267_72)
Without Prep/Part With Prep/Part Lemmas 3586 3586 Semantic Forms 10969 14348 Frame Types 38 577
Lexicon extracted from Penn-II (O’Donovan et al 2005):
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources136
Grammar and Lexicon Extraction: Penn-II & LFG
• Evaluation for all active verbs (2992) extracted from Penn-II against COMLEX
• Largest evaluation for English subcat frame extraction system • Carroll and Rooth (1998) – 200 verbs• Schulte im Walde (2000) – over 3000 German verbs
• Directional Prepositions (about, across, along, around, behind, below, beneath, between, beyond, by, down, from…) included in COMLEX by “default” for verbs that have at least one p-dir …
• Systematic differences between our f-structures and PARC 700 and CBS 500 dependency representations
• Automatic conversion of our f-structures to PARC 700 / CBS 500 -like structures (Burke et al. 2004, Burke 2006, Cahill et al. under review)
• Best XLE and RASP resources with better results than those reported in literature to date
• (Crouch et al. 2002) and (Carroll and Briscoe 2002) evaluation software
• (Noreen 1989) Approximate Randomisation Test to test for statistical significance of results
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources164
Parsing: Penn-II and LFG
• Result dependency f-scores:
PARC 700 XLE vs. BKR-LFG:– 80.55% XLE– 83.08% BKR-LFG (+2.53%)
CBS 500 RASP vs. BKR-LFG:– 76.57% RASP– 80.23% BKR-LFG (+3.66%)
• Results statistically significant at 95% level (Noreen 1989) Approximate Randomisation Test
• BKR-LFG = treebank-induced Lexical-Functional Grammar resources with Bickel retrained (BKR) as c-structure engine in pipeline
architecture
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources165
Parsing: Penn-II and LFG
PARC 700 Evaluation:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources166
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources167
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources168
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources169
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources170
Parsing: Penn-II and LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources171
Probability Models: Penn-II & LFG
Probability Models:
• Our approach does not constitute proper probability model (Abney, 1996)
• Why? Probability model leaks:
• Highest ranking parse tree may feature f-structure equations that cannot be resolved into f-structure
• Probability associated with that parse tree is lost
• Doesn’t happen often in practise (coverage >99.5% on unseen data)
• Research on appropriate discriminative, log-linear or maximum entropy models is important (Miyao and Tsujii, 2002) (Riezler et al. 2002)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources172
Generation: Penn-II & LFG
Cahill and van Genabith, 2006
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources173
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources174
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources175
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources176
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources177
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources178
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources179
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources180
Generation: Penn-II & LFG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources181
Generation: the Good, the Bad and the Ugly
• Orig: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry .
• Gen: Supporters of the legislation view the bill as an effort to add stability and certainty to the airline-acquisition process , and to preserve the safety and fitness of the industry.
• Orig: The upshot of the downshoot is that the A 's go into San Francisco 's Candlestick Park tonight up two games to none in the best-of-seven fest .
• Gen: The upshot of the downshoot is that the A 's tonight go into San Francisco 's Candlestick Park up two games to none in the best-of-seven fest .
• Orig: By this time , it was 4:30 a.m. in New York , and Mr. Smith fielded a call from a New York customer wanting an opinion on the British stock market , which had been having troubles of its own even before Friday 's New York market break .
• Gen: Mr. Smith fielded a call from New a customer York wanting an opinion on the market British stock which had been having troubles of its own even before Friday 's New York market break by this time and in New York , it was 4:30 a.m. .
• Orig: Only half the usual lunchtime crowd gathered at the tony Corney & Barrow wine bar on Old Broad Street nearby .
• Gen: At wine tony Corney & Barrow the bar on Old Broad Street nearby gathered usual , lunchtime only half the crowd , .
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources182
Domain Variation, Multilingual LFG Resources, etc.
• Domain variation: ATIS (Judge et al 2005) and QuestionBank (Judge et al 2006)
• F-Str -> (Q)LF Quasi-Logical Forms (Cahill et al. 2003)
• Multilingual treebank-based LFG acquisition:
– German: TIGER treebank (Cahill et al 2003), (Cahill et al 2005)
– Chinese: Chinese Penn Treebank (Burke et al 2004)
– Spanish: Cast3LB (O’Donovan et al 2005), (Chrupala and van Genabith 2006)
• GramLab Project at DCU (2005-2008): Chinese, Japanese, Arabic, Spanish, French and German
A. Cahill and J. Van Genabith, Robust PCFG-Based Generation using Automatically Acquired LFG-Approximations, COLING/ACL 2006, Sydney, Australia
J. Judge, A. Cahill and J. van Genabith, QuestionBank: Creating a Corpus of Parse-Annotated Questions, COLING/ACL 2006, Sydney, Australia
G. Chrupala and J. van Genabith, Using Machine-Learning to Assign Function Labels to Parser Output for Spanish, COLING/ACL 2006, Sydney, Australia
M. Burke, Automatic Treebank Annotation for the Acquisition of LFG Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O’Donovan, Automatic Extraction of Large-Scale Multilingual Lexical Resources, Ph.D. Thesis, School of Computing, Dublin City University, Dublin 9, Ireland. 2005
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks, Computational Linguistics, 2005
A. Cahill, M. Forst, M. Burke, M. McCarthy, R. O'Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Acquisition of Multilingual Unification Grammar Resources; Journal of Research on Language and Computation; Special Issue on "Shared Representations in Multilingual Grammar Engineering", (eds.) E. Bender, D. Flickinger, F. Fouvry and M. Siegel, Kluwer Academic Press, 2005
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources185
Publications
R. O'Donovan, A. Cahill, J. van Genabith, and A. Way. Automatic Acquisition of Spanish LFG Resources from the CAST3LB Treebank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005
J. Judge, M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. Strong Domain Variation and Treebank-Induced LFG Resources; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway,2005
M. Burke, A. Cahill, J. van Genabith, and A. Way. Evaluating Automatically Acquired F-Structures against PropBank; In Proceedings of the Tenth International Conference on LFG, Bergen, Norway, 2005
M. Burke, A. Cahill, M. McCarthy, R.O'Donovan, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank; Journal of Language and Computation; Special Issue on "Treebanks and Linguistic Theories", (eds.) E. Hinrichs and K.Simov, Kluwer Academic Press. 2005. pages 523-547
A. Cahill. Parsing with Automatically Acquired, Wide-Coverage, Robust, Probabilistic LFG Approximations. Ph.D. Thesis. School of Computing, Dublin City University, Dublin 9, Ireland. 2004
M. Burke, O. Lam, A. Cahill, R. Chan, R. O'Donovan, A. Bodomo, J. van Genabith and A. Way; Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar; Proceedings of the PACLIC-18 Conference, Waseda University, Tokyo, Japan, pages 161-172, 2004
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources186
Publications
M. Burke, A. Cahill, R. O'Donovan, J. van Genabith, and A. Way. The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank, In Proceedings of the Ninth International Conference on LFG, Christchurch, New Zealand, pages 101-121, 2004
A. Cahill, M. Burke, R. O'Donovan, J. van Genabith, and A. Way. Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 320-327, Barcelona, Spain, 2004
R. O'Donovan, M. Burke, A. Cahill, J. van Genabith, and A. Way. Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank, In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), July 21-26 2004, pages 368-375, Barcelona, Spain, 2004
M. Burke, Cahill A., R. O' Donovan, J. van Genabith and A. Way. Treebank-Based Acquisition of Wide-Coverage, Probabilistic LFG Resources: Project Overview, Results and Evaluation, The First International Joint Conference on Natural Language Processing (IJCNLP-04), Workshop "Beyond shallow analyses - Formalisms and statistical modeling for deep analyses"; March 22-24, 2004 Sanya City, Hainan Island, China, 2004
Cahill A., M. Forst, M. McCarthy, R. O' Donovan, C. Rohrer, J. van Genabith and A. Way. Treebank-Based Multilingual Unification-Grammar Development. In the Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, at the 15th European Summer School in Logic Language and Information, Vienna, Austria, 18th - 29th August 2003
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources187
Publications
Cahill A, M. McCarthy, J. van Genabith and A. Way. Quasi-Logical Forms for the Penn Treebank; In (eds.) Harry Bunt, Ielka van der Sluis and Roser Morante; Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05, January 15-17, 2003, Tilburg, The Netherlands, ISBN: 90-74029-24-8, pp.55-71, 2003
Cahill A, M. McCarthy, J. van Genabith and A. Way. Evaluating Automatic F-Structure Annotation for the Penn-II Treebank. TLT 2002, Treebanks and Linguistic Theories 2002, 20th and 21st September 2002, Sozopol, Bulgaria, (eds.) E. Hinrichs and K. Simov, Proceedings of the First Workshop on Treebanks and Linguistic Theories (TLT 2002), pp. 42-60, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Parsing with PCFGs and Automatic F-Structure Annotation, In M. Butt and T. Holloway-King (eds.): Proceedings of the Seventh International Conference on LFG CSLI Publications, Stanford, CA., pp.76--95. 2002
Cahill A, and J. van Genabith. TTS - A Treebank Tool; in LREC 2002, The Third International Conference on Language Resources and Evaluation, Las Palmas de Grand Canaria, Spain, May 27th--June 2nd, 2002, Proceedings of the Conference, Volume V, (eds.) M.G.Rodriguez and C.P. Suarez Arnajo, ISBN 2-9517408-0-8, pp. 1712-1717, 2002
Cahill A, M. McCarthy, J. van Genabith and A. Way. Automatic Annotation of the Penn-Treebank with LFG F-Structure Information; LREC 2002 workshop on Linguistic Knowledge Acquisition and Representation - Bootstrapping Annotated Language Data, LREC 2002, Third International Conference on Language Resources and Evaluation, post-conference workshop, June 1st, 2002, proceedings of the workshop, (eds.) A. Lenci, S. Montemagni and V. Pirelli, ELRA - European Language Resources Association, Paris France, pp. 8-15, 2002
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources188
Penn-II-Based Acquisition of CCG Resources
Combinatory Categorial Grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources189
This lecture
• Recap: CCG
• Translating the Penn Treebank to CCG– The translation algorithm– CCGbank: the acquired grammar and lexicon
• Wide-coverage parsing with CCG
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources190
• Categories: specify subcat lists of words/constituents.
• Combinatory rules: specify how constituents can combine.
• The lexicon: specifies which categories a word can have.
• Derivations: spell out process of combining constituents.
CCG: the machinery
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources191
CCG categories
• Simple categories: NP, S, PP
• Complex categories: functions which return a result when combined with an argument:
VP or intransitive verb: S\NPTransitive verb: (S\NP)/NPAdverb: (S\NP)\(S\NP)PPs: ((S\NP)\(S\NP))/NP
(NP\NP)/NP
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources192
The combinatory rules
• Function application: x.f(x) a f(a) X/Y Y X (>)Y X\Y X (<)
• Canonical “normal-form” derivations (mostly function application):
• Alternative derivations:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources194
Type-raising and Composition
• Wh-movement:
• Right-node raising:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources195
CCG: semantics
• Every syntactic category and rule has a semantic counterpart:
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources196
From the Penn Treebank to CCG
• The basic translation algorithm• Dealing with null elements• Type-changing rules in the grammar• Preprocessing• CCGbank: The extracted lexicon/grammar
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources197
Input: Penn Treebank tree
• Flat phrase-structure tree• Traces/null elements and indices
• How well does our lexicon cover unseen data?“Training” data: sections 02-21
Test data: section 00
• The lexicon contains the correct entries for94.0% of the tokens in section 00.
• 3.8% of the tokens in section 00 do not appearin sections 02-21.
35% of the unknown tokens are N29% of the unknown tokens are N/N
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources232
Statistical Parsing with CCG
• The data: CCGbank• The algorithms: standard CKY chart parsing
(and a supertagger)• The models:
– Generative: Hockenmaier and Steedman (2002)– Conditional: Clark and Curran (2004)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources233
Parsing algorithms for CCG
• CCG derivations are binary trees.• Standard chart parsing algorithms (eg. CKY)
can be used.• Complexity: O(n6)
(or O(n3) if the category set is fixed)• Recovery of “deep” dependencies require
feature structures. • Supertagging: assign most likely categories
to words before parsing. Significantly speeds up parsing!
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources234
Parsing models
• Generative models: P(s,)Model the process which generates the derivation – Advantage: easy to guarantee consistency– Disadvantage: requires good smoothing techniques,
difficult to include complex features
Good baseline
• Conditional models: P( |s)Given a sentence s, predict most likely derivation – Advantage: more natural for parsing– Disadvantage: large model size, difficult to estimate
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources235
Evaluation: recovery of dependency structures
LabelledUnlabelled
Generative: 83.3 90.3(Hockenmaier and Steedman, 2002)
Conditional: 84.6 91.2(Clark and Curran, 2004)
This includes long-range dependencies
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources236
ccg2sem: from CCG to DRT
• A Prolog package which translates CCGbank derivations into Discourse Representation Theory structures (Bos, 2005)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources237
CCGbanks for other languages
• German (Hockenmaier, 2006):– Translation of German TIGER corpus into CCG.– Many crossing dependencies, etc.:
context-free approximations are inappropriate– Current coverage: 92.4% of all graphs
(excluding headlines, fragments etc.)
• Turkish (Cakici, 2005):– Extracts a CCG lexicon from the METU Sabanci
Treebank.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources238
A few referencesGeneral CCG references:M. Steedman (2000). The Syntactic Process, MIT Press.M. Steedman (1996). Surface Structure and Interpretation, MIT Press.CCGbank(s) and wide-coverage CCG parsing:J. Hockenmaier and M. Steedman (2005). CCGbank: User’s Manual, MS-CIS-05-09,
Dept. of Computer and Information Science, University of Pennsylvania.J. Hockenmaier and M. Steedman (2002). Acquiring Compact Lexicalized
Grammars from a Cleaner Treebank, LREC, Las Palmas, Spain.J. Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory
Categorial Grammar. PhD thesis, Infomatics, University of Edinburgh.J. Hockenmaier and M. Steedman (2002). Generative Models for Statistical Parsing
with Combinatory Categorial Grammar, ACL ‘02, Philadelphia, PA, USA.S. Clark and J. R. Curran (2004). Parsing the WSJ using CCG and Log-Linear
Models ACL '04, Barcelona, Spain.S. Clark and J. R. Curran (2004). The Importance of Supertagging for Wide-
Coverage CCG Parsing. Coling’04, Geneva, Switzerland.J. Bos (2005): Towards Wide-Coverage Semantic Interpretation. IWCS-6.R. Cakici (2005). Automatic Induction of a CCG Grammar for Turkish.
ACL Student Research Workshop, Ann Arbor, Mi, USA.J. Hockenmaier (2006). Creating a CCGbank and a wide-coverage CCG lexicon for
German. ACL/COLING ‘06, Sydney, Australia.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources239
More references
• The CCG website: http://groups.inf.ed.ac.uk/ccgwith lots of general references about CCG(as well as CCGbank, CCG parsing, etc.)
• CCGbank is available from the Linguistic Data Consortium (LDC) at the University of Pennsylvania.
• RULE: name of applied rule• DIST: distance between head words• COMMA: whether the phrase includes commas• SPAN: number of words the phrase dominates• SYM: nonterminal symbol (e.g. S, VP, …)• WORD: head word• POS: part-of-speech• LE: lexical entry• ARG: argument label (ARG1, ARG2, ...)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources295
Example: syntactic features
• Feature for the Head-Modifier construction for “saw a girl” and “with a telescope”
prep-mod-vpwith,IN,PP,3,
,transitiveVBD,saw,VP,3,
,0,modifier,3-head
LE,POS,WORD,SYM,SPAN
LE,POS,WORD,SYM,SPAN
COMMA,DIST,RULE,
rrrrr
lllllf
he saw a girl with a telescope
HEAD nounSUBCAT <>
HEAD verbSUBCAT <NP,NP>
HEAD nounSUBCAT <>
HEAD nounSUBCAT <>
HEAD prepMOD VPSUBCAT <NP>
HEAD prepMOD VPSUBCAT <>
HEAD verbSUBCAT <NP>
HEAD verbSUBCAT <>
HEAD verbSUBCAT <NP>
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources296
Example: semantic features
• Feature for the predicate argument relation between “he” and “saw”
pronounPRPhe
transitiveVBDsaw
1ARG1
LEPOSWORD
LEPOSWORD
DISTARG
,,
,,,
,,
,,
,,,
,,
nnn
hhhpaf
girl
saw
heARG1
ARG2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources297
Feature generation
• Features are generated by abstracting descriptions of probabilistic events
• Evaluation of the lexical entries extracted from Penn Treebank– Investigation of obtained lexical entries– Coverage
• Evaluation of the disambiguation model– Parsing accuracy
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources305
Experimental settings
• Training data: Sections 2-21 of Penn Treebank II (39,832 sentences)
• Test data:– Development set: Section 22 (1,700 sentences)– Final test set: Section 23 (2,416 sentences)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources306
Number of tree conversion rules
Target of conversion Number
Penn-II errors 102
Category mapping 85
Head annotation and binarization 63
Difference of phrase structures 15
Predicate argument structures 13
Long distance dependencies 13
Others 52
Total 343
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources307
Result of treebank conversion & lexicon extraction
• Treebank conversion and HPSG annotation succeeded for 37,886 sentences
• Extracted lexicon:
# words 34,765
# lexical entries 1,942
Average # lexical entries/word 1.43
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources308
Sources of treebank conversion failures
• Classification of failures of treebank conversion in Section 02 (67 failures/1989 sentences)
Shortcomings of tree conversion rules 18
Errors in Penn Treebank 16
Constructions currently unsupported 20
Constructions unsupported by HPSG 13
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources309
Breakdown of extracted lexical entries
# words# lexical entries
Avg. # lex. entries
noun 21,925 186 1.14
verb 4,094 945 1.94
adjective 8,078 62 1.28
adverb 1,295 72 2.75
preposition 159 193 9.17
particle 58 10 1.69
determiner 36 33 3.86
conjunction 94 321 9.46
punctuation 15 120 22.00
Total 34,765 1,942 1.43
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources310
Example lexical entries
HEADnounMOD <>
VAL
SPR < HEAD det >SUBJ <>COMPS <>
Common nounEx. review/NNappeared 140,805 times
HEADverbMOD <>VFORM base
VALSPR <>SUBJ <HEAD noun>COMPS <HEAD noun>
Transitive verbappeared 12,244 times
HEAD
adjMOD <HEAD noun>POSTHEAD -
VALSPR <>SUBJ <>COMPS <>
Pre-head adjectiveappeared 55,049 times
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources311
Evaluation of coverage
• The ratio of lexical entries in the test data covered by the grammar is measured
• A sentence is covered when all of the lexical entries in the sentence are covered (strong coverage)
Lexical entry
Sentence
w/o unknown word handling
96.52% 54.7%
w/ unknown word handling 99.15% 84.8%
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources312
Treebank size vs. coverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources313
Sentence length vs. coverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources314
Error analysis
• Classification of randomly selected uncovered lexical entries
Errors of Penn Treebank 10
Errors of treebank conversion 48
Lack of lexical entries 23
Constructions currently unsupported 9
Idioms 6
Non-linguistic expressions (ex. list) 4
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources315
Examples of uncovered lexical entries
• Lack of mappings from words into lexical entries because of data sparseness– Post-noun adjectives (younger, crucial)– Coordination conjunctions of NP and S’– Verbs taking present-participle as a complement
• Incorrect lexical entries obtained because of idiomatic expressions– (ADVP in part) because …
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources316
Evaluation of parsing accuracy
• Empirical evaluation of the probabilistic models– Overall accuracy– Treebank size vs. accuracy – Sentence length vs. accuracy– Contribution of features– Coverage and accuracy– Error analysis
• Measure: precision/recall of<predicate word, argument position, argument word, predicate type>
– e.g.) <saw, ARG1, he, transitive> girlsaw
heARG1
ARG2
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources317
Effect of feature forest models
• Accuracy for Section 23 (< 40 words)
Precision Recall
baseline 78.10 77.39
with syntactic features 86.92 86.28
with semantic features 84.29 83.74
with all features 86.54 86.02
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources318
Treebank size vs. accuracy
0
20
40
60
80
100
0 10000 20000 30000 40000
# sentences
Pre
cisi
on
/reca
ll (
%)
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources319
Sentence length vs. accuracy
0
20
40
60
80
100
0 20 40 60
Sentence length
Coverage (%)
Sentencecoverage
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources320
Contribution of features (1/2)
precision recall # features
All 87.12 85.45 623,173
- RULE 86.98 85.37 620,511
- DIST 86.74 85.09 603,748
- COMMA 86.55 84.77 608,117
- SPAN 86.53 84.98 583,638
- SYM 86.90 85.47 614,975
- WORD 86.67 84.98 116,044
- POS 86.36 84.71 430,876
- LE 87.03 85.37 412,290
None 78.22 76.46 24,847
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources321
Contribution of features (2/2)
precision recall # features
All 87.12 85.45 623,173
- DIST,SPAN 85.54 84.02 294,971
- DIST,SPAN,COMMA 83.94 82.44 286,489
- RULE,DIST, SPAN,COMMA
83.61 81.98 283,897
- WORD,LE 86.48 84.91 50,258
- WORD,POS 85.56 83.94 64,915
- WORD,POS,LE 84.89 83.43 33,740
- SYM,WORD, POS,LE
82.81 81.48 26,761
None 78.22 76.46 24,847
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources322
Coverage and accuracy
• Accuracies for strongly covered/uncovered sentences
• We can expect accuracy improvements by improving grammar coverage
Precision
Recall# sentences
Covered sentences
89.36 88.96 1,825
Uncovered sentences
75.57 74.04 319
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources323
Error analysis
• Classification of errors in randomly selected sentences (100 sentences)
PP-attachment ambiguity 76
Distinction of arguments/modifiers 49
Ambiguity of lexical entries 44
Errors in test data 22
Ambiguity of commas 32
Others 75
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources324
Examples of errors (1/2)
• Antecedent of a relative clause– It's made only in years when the grapes ripen perfectly (the
last was 1979) and comes from a single acre of [NP grapes [S' that yielded a mere 75 cases in 1987]].
• Argument/modifier distinction of to-phrases– More than a few CEOs say the red-carpet treatment tempts
them [VP-modifier to return to a heartland city for future meetings].
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources325
Examples of errors (2/2)
• Preposition or verb phrase?– Mitsui Mining & Smelting Co. posted a 62 % rise in pretax
profit to 5.276 billion yen ($ 36.9 million) in its fiscal first half ended Sept. 30 [VP compared with 3.253 billion yen a year earlier].
• Corpus-oriented development of HPSG– Y. Miyao, T. Ninomiya, and J. Tsujii. (2003). Lexicalized Grammar
Acquisition. In Proc. 10th EACL Companion Volume.– Y. Miyao, T. Ninomiya, and J. Tsujii. (2004) Corpus-oriented
grammar development for acquiring a Head-Driven Phrase Structure Grammar from the Penn Treebank. In Proc. IJCNLP 2004.
– H. Nakanishi, Y. Miyao, and J. Tsujii. (2004). Using Inverse Lexical Rules to Acquire a Wide-coverage Lexicalized Grammar. In the IJCNLP 2004 Workshop on “Beyond Shallow Analyses.”
– H. Nakanishi, Y. Miyao and J. Tsujii. (2004). An Empirical Investigation of the Effect of Lexical Rules on Parsing with a Treebank Grammar. In Proc. TLT 2004.
– K. Yoshida. (2005). Corpus-Oriented Development of Japanese HPSG Parsers. In 43rd ACL Student Research Workshop.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources337
Publications
• Feature forest model– Y. Miyao and J. Tsujii. (2002) Maximum entropy estimation
for feature forests. In Proc. HLT 2002.
• Probabilistic models for HPSG– Y. Miyao and J. Tsujii. (2003). A model of syntactic
disambiguation based on lexicalized grammars. In Proc. 7th CoNLL.
– Y. Miyao, T. Ninomiya and J. Tsujii. (2003). Probabilistic modeling of argument structures including non-local dependencies. In Proc. RANLP 2003.
– Y. Miyao, and J. Tsujii. (2005). Probabilistic disambiguation models for wide-coverage HPSG parsing. In Proc. ACL 2005.
– T. Ninomiya, T. Matsuzaki, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2006). Extremely Lexicalized Models for Accurate and Fast HPSG Parsing. In Proc. EMNLP 2006.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources338
Publications
• Parsing strategies for probabilistic HPSG– Y. Tsuruoka, Y. Miyao and J. Tsujii. (2004). Towards efficient
probabilistic HPSG parsing: integrating semantic and syntactic preference to guide the parsing. In the IJCNLP-04 Workshop on “Beyond shallow analyses.”
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, and J. Tsujii. (2005). Efficacy of Beam Thresholding, Unification Filtering and Hybrid Parsing in Probabilistic HPSG Parsing. In Proc. IWPT 2005.
– T. Ninomiya, Y. Tsuruoka, Y. Miyao, K. Taura, and J. Tsujii. (2006). Fast and Scalable HPSG Parsing. Traitement automatique des langues (TAL). 46(2).
• Domain adaptation– T. Hara, Y. Miyao, and J. Tsujii. (2005). Adapting a
probabilistic disambiguation model of an HPSG parser to a new domain. In Proc. IJCNLP 2005.
ESSLLI 2006
Treebank-Based Acquisition of LFG, HPSG and CCG
Resources339
Publications
• Generation– H. Nakanishi, Y. Miyao, and J. Tsujii. (2005). Probabilistic models
for disambiguation of an HPSG-based chart generator. In Proc. IWPT 2005.
• Semantics construction– M. Sato, D. Bekki, Y. Miyao, and J. Tsujii. (2006). Translating
HPSG-style Outputs of a Robust Parser into Typed Dynamic Logic. In Proc. COLING-ACL 2006 Poster Session.
• Applications– Y. Miyao, T. Ohta, K. Masuda, Y. Tsuruoka, K. Yoshida, T.
Ninomiya, and J. Tsujii. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts. In Proc. COLING-ACL 2006.
– A. Yakushiji, Y. Miyao, T. Ohta, Y. Tateisi, and J. Tsujii. (2006). Automatic Construction of Predicate-Argument Structure Patterns for Biomedical Information Extraction. In EMNLP 2006 Poster Session.