Robust rule-based parsing (quick overview) I. Robustness II. Three robust rule-based parsers of English III. Common features IV. Example : identification of subjects in Syntex
Nov 29, 2014
Robust rule-based parsing(quick overview)
I. RobustnessII. Three robust rule-based
parsers of EnglishIII. Common featuresIV. Example : identification of
subjects in Syntex
I. Robustness (Aït-Mohktar et al. 1997)
« the ability to provide useful analyses for real-word input text. By useful analyses, we mean analyses that are (at least partially) correct and usable in some automatic task or application »
implies : 1 analysis (even partial) for any real world input ability to process irregular input, to overcome error
analysis efficiency
I. Types of robust parsers (Aït Mokhtar et al. 1997)
based on traditional theorical models with rule-based and/or stochastic post-processing Minipar (Lin 1995)
most parsers are hybrid
stochastic parsers Charniak’s parser (2000)
rule-based parsers Non-Projective Dependency Parser (Järvinen & Tapanainen
1997) Syntex (Bourigault 2007) Cass (Abney 1990,1995)
II.1 Non-Projective Dependency Parser (Tapanainen & Järvinen 1997)
Syntactic Labeling
Tagged Text
OUTPUT
Selection of syntactic links
« all legitimate surface-syntactic labels are added to the set of morphological readings »
« syntactic rules discard contextually illegitimate alternatives or select legitimate ones »
valencysubcategorization
information
PruningGeneral heuristics disambiguate the last of the syntactic links
II.1 Non-Projective Dependency Parser (Tapanainen & Järvinen 1997)
If the preceding the word is an unambiguous auxiliary, the current word is the subject of this auxiliary
SELECT (@SUBJ)IF (1C AUXMOD HEAD);
Rules are contextual : How do you do ?
AUX
SUBJ
Rules use syntactic links established by preceding rules
Rules establish dependency links between words
II.2 Syntex (Bourigault 2007)
Tagged Text
OUTPUT
Object, Subject
Prep Attachement
Verb Chunk
Endogenous and exogenous
subcategorization information
non recursive SP
non recursive NP
Endogenous and exogenous
subcategorization information
he will leave
the man
from Paris
happy tree friends
??
This is the man from Paris?
?
This is the man
II.2 Syntex (Bourigault 2007)
One module per syntactic relation Each module processes the sentence from left to right.
Those who think they are interested in water supply must vote
Like the Non-Projective Dependency Parser, the rules establish dependency relations between words are contextual use syntactic links established by preceding rules
The identification of a dependency link is formulated as a «path» to be followed up through the existing links and grammatical categories from governor to dependent or from dependent to governor Ambiguous relations : selection of potential governors +
desambiguisation with probabilities
II.3 Cass (Abney 1990,1995)
CHUNK FILTER
Tagged Text
OUTPUT
CLAUSE FILTER
[NP the happy tree friends]
subcategorizationinformation PARSE FILTER
NP filter
Chunk filter
Raw Clause filter
Clause Repair filter
Subject-predicate relationBeginning and end of simplex clauses
Non recursive chunksInternal structure remains ambiguous
[SUBJThis] [PREDis] [NPthe man][SPfrom Paris]
Repair if no Subject-predicate relation
Assembles recursive structures
[SP from [NP the happy tree friends][VP will leave]
[[This] [is] [NPthe man][SPfrom Paris] ]
II.3 Cass (Abney 1990,1995)
Each filter uses transducers :
PP (Prep|To)+(NP|Vbg)
Use of repair (also used in Syntex and NPDP but less explicit):
« when errors become apparent downstream, the parser attempts to repair them »
Each filter makes a decision (determinism), the safest one in case of ambiguity
« ambiguity is not propagated downstream » « repair consists in directly modifying erroneous structure
without regard to the history of computation that produced the structure »
[SPIn [NPSouth Australia beds]][SPof [NPboulders]][VPwere deposited]
II.3 Cass (Abney 1990,1995)
Example of repair In South Australia beds of boulders were deposited …
[SPIn [NPSouth Australia beds]][SPof [NPboulders]][VPwere deposited]
Erroneous structure output from the Chunk filter
Raw Clause filter : no subject is found
Repair filter tries to find a subject by modifying the structure
[SPIn [NPSouth Australia]][NP-SUBJbeds][SPof boulders][VPwere deposited]
III. Common features : Incrementality The parsing task is divided into substasks
reduces the overall complexity of the main task :
« factoring the problem into a sequence of small, well defined
questions » (Abney 1990).
problem of circularity : difficult to choose in what order the relation should be identified (Bourigault 2007)
The sentence is parsed in several phases, each phase producing an intermediate structure
allows each phase to use the syntactic information left by the predecing phase
« the level of abstraction produced during the 1st phase (...) facilitates the description of deeper syntactic relations» (Aït-Mohktar et al. 1997)
ease of maintenance
III. Common features : determinism and repair
Each parsing phase yields one solution. In case of ambiguity, the safest choice is made, even if
some higher level information is needed ambiguity is not propagated downstream
Most regular errors can be repaired later on ≠ parallelism, backtracking
« The salient performance is not errors vs no errors, but the tradeoff between speed and error rate » (Abney 1990)
III. Common features: no syntactic theory
Use of common grammatical knowledge Hours of corpus observation to find clues for automatic
identification
Difference between : the theoretical study of the syntactic structures of language automatic identification of grammatical relation in real-word
texts
Difficulties in automatic syntactic analysis : lack of knowledge (semantics/pragmatics for desambiguation) deviation from the norm of the language errors of preceding processing steps
III. Common features : implicit grammatical knowledge
Bipartite architecture : Lexical information Recognition routines
No independent declaration of grammatical knowlege
Difficult / impossible to set apart : Grammatical knowledge Non grammar-based heuristics
No linguist/computer scientist job separation
Need both linguistic and programming know-hows
A condition to scalability and robustness
IV. Example : the subject relation in Syntex
The identification of the subject relation is formulated as a «path» through the already identified grammatical relations :
the cost of technology takes time to shrinkDet Noun Prep Noun Verb Noun Prep Noun
DET PREP NOMPREP OBJ NOMPREP
SUJ
TENSED VERB
takes
SUBJECT
cost
stop when you encounter an ungoverned Noun
move to the left
start from tensed verb
IV. Using existing links
The Subject might be far from the tensed verb Lots of configuration are possible :
Initiatives leading to cessation of smoking in workplaces
are adopted
Those who think they are interested in water supply must
vote.
No reference to the war, or to the alliance, should remain
Existing links form dependency islands (~syntagms or isolated words)
Following up the islands until a reasonnable subject is found allows to find subjects without describing all possible configurations or doing too much computing
PP PPGerund
PPClauseClause
PPPP Conj
IV. Ambiguities
Many persons have died in Darfur since the conflict began
A person sitting on the death row since the age of 16 is
not the same as before.
Many adults believe education equates intelligence.
Those who think they are interested in water supply must
vote.
When to stop? When to follow up ? When to repair ?
IV. Path decomposition At each island, a decision is made by a dedicated sub-
module (one type of island = one sub-module) :
stop and identify a subject
follow up to the island on the left
stop and return failure
without repair with repair
change path direction to the right to any other position in the sentence
Decisions are encoded as if-then rules that may test : local and non-local context : lemmas, ms tags, links, presence of commas…
specific information left by other modules : encountered tags, activated modules …
call other module
IV. Path Example : following up
Korea who we believe to have WMD is safe from us.
PP module
Clause module
Korea
_ RelPron [[SUJPron] Verb ]
SUBJ
Clause PP
IV. Path example : repair
Many adults believe education equates intelligence.
Clause
Clause module
## [[SUBJNP] Verb [OBJNP]] Verb[ [SUBJNP] Verb ]OBJ
SUBJ
OBJ
IV. Path example : sub-module call
On the walls were scarlett banners
Wall module
## [PP] Verb _
InvertedSubject module
banners
SUBJ
PP module
PP
NP
IV. Path example : change path
On the contrary, war hysteria was continuous and
deliberate, and acts such as looting, murdering, the
slaughters of prisonners, were considered as normal.
Commas module
PP module Clause module
Adj
Conj
PP module
All three political Parties at the federal level, and certainly at the provincial level in different sections, have parity clauses.
Although no directive was ever issued, it was known that the chief of the Departement intended that within one week no reference to the war with Eurasia, or to the alliance, should remain
Noun
+2.6 Recall
-0.07 Precision
IV. Evaluation on Susanne Corpus
Tensed verbIdentification(TreeTagger)
SubjectIdentification
(if tensed verb correct)
SUBJECT RELATION
(correct tensed verb and correct subject)
precision 94,87 94,56 89,51
recall 89,76 90,84 81,53
f-mesure 92,24 92,66 85,33
Shallow subjects evaluation only are not identified or evaluated :
I’ve never seen the dog hiding his bones. She wants me to clean my shoes The book is read by the boy
Bibliography
Abney (1990) : « Rapid Incremental Parsing with Repair », Proceedings of the 6th New OED Conference, University of Waterloo, Waterloo, Ontario.
Abney (1995) : «Partial Parsing with finite state cascade », Natural Language Engineering, Cambridge University Press www.sfs.uni-tuebingen.de/~abney/StevenAbney.html#cass
Aït-Mokhtar et al. (1997) : « Incremental Finite State Parsing », Proceedings of the ANLP-97, Washington
Bourigault (2007) : Syntex, analyseur syntaxique opérationnel, Thèse d’Habilitation à Diriger les Recherches, Université Toulouse - Le Mirail. w3.univ-tlse2.fr/erss/textes/pagespersos/bourigault/syntex.html
Charniak (2000): «A maximum-entropy-inspired parser », In The Proceedings of the North American, Chapter of the Association for Computational Linguistics,pp 132–139. http://www.cfilt.iitb.ac.in/~anupama/charniak.php
Lin (1995) :« Dependency-based Evaluation of Minipar », Proceedings of JCAI. http://www.cs.ualberta.ca/~lindek/downloads.htm
Tapanainen & Järvinen (1997) : « A Dependency Parser for English», Technical Reports, No.TR-1, Department of General Linguistics University, March 1997. www.connexor.com
TreeTagger : http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ Evaluation Corpus : ftp://ftp.cs.umanitoba.ca/pub/lindek/depeval