Deep Grammars in Hybrid Machine Translation University of Bergen Helge Dyvik
Mar 31, 2015
Deep Grammarsin Hybrid Machine Translation
University of Bergen
Helge Dyvik
Lexicon, Lexical Semantics, Grammar, and Translation for Norwegian
A 4-year project (2002 - 2006) involving groups at:•The University of Oslo•The University of Bergen•NTNU (The University of Trondheim)
Cooperation with PARC (John Maxwell) and others
The LOGON systemSchematic architecture
XLE: Xerox Linguistic EnvironmentA platform developed over more than 20 years
at Xerox PARC (now PARC)Developer: John Maxwell
•LFG grammar development•Parsing•Generation•Transfer•Stochastic parse selection•Interaction with shallow methods
An LFG analysis:
Det regnet'It rained'
•Develops parallel grammars on XLE:English, French, German, Norwegian, Japanese, Urdu, Welsh, Malagasy, Arabic, Hungarian, Chinese, Vietnamese•‘Parallel grammars’ means parallel f-structures:
A common inventory of featuresCommon principles of analysis
ParGram: The Parallel Grammar ProjectA long-term project (1993-)
LOGON Analysis Modules
Input string
•Tokenization•Named ent.•Compounds•Morphology
LFG lexicons:•NKL-derived•Hand coded
Lexicaltemplates
SyntacticrulesRule templates
c-structures
f-structures
MRSs
Norsk ordbanklexicon
XLE Parser
NorGram String of stemsand tags
Output-inputSupporting knowledgebase
Scope of NorGram
Lexicon: about 80 000 lemmas.In addition:
Automatically analyzed compoundsAutomatically recognized proper names"Guessed" nouns
Syntax: 229 complex rules, giving rise to about 48 000 arcs
Semantics: Minimal Recursion Semantics projections for all readings
Coverage
Performance on an unknown corpus of newspaper text:
•17 randomly selected pieces of text, limited to coherent text,
•comprising 1000 sentences
•taken from 9 newspapers
Adresseavisen, Aftenposten, Aftenposten nett, Bergens Tidende,
Dagbladet, Dagens Næringsliv, Dagsavisen, Fædrelandsvennen, Nordlys,
•from the editions on November 11th 2005.
The LOGON challenge:
From a resource grammar based on independent linguistic principles, derive MRS structures harmonized with the MRS structures of the HPSG English Resource Grammar.
Semantics for translation:Two issues
• The representational subset problem- Desirable: normalization to flat structures withunordered elements.
• Complete and detailed semantic analyses may be unnecessary.
- Desirable: rich possibilities of underspecification
Basics of
Minimal Recursion Semantics
•Developers: A. Copestake, D. Flickinger, R. Malouf, S. Rieheman, I.
Sag
•A framework for the representation of semantic information
•Developed in the context of HPSG and machine translation
(Verbmobil)
•Sources of inspiration:
- Quasi-Logical Form (H. Alshawi):
underspecification, e.g. of quantifier scope
- Shake-and-bake translation (P. Whitelock):
a bag of words as interface structure
An MRS representation
• is a bag of semantic entities (some corresponding to words,
some not),
each with a handle,
• plus a bag of handle constraints allowing the underspecification
of
scope,
• plus a handle and an index.
• Each semantic entity is referred to as an Elementary Predication
(EP).
• Relations among EPs are captured by means of shared
variables.
• There are three elementary variable types:
- handles (or 'labels') (h)
- events (e)
- referential indices (x)
From standard logical form to MRS
«Every ferry crosses some fjord»
Two readings:
Replace operators with generalized quantifiers:
every(variable, restriction, body)some(variable, restriction, body)
The first reading (wide-scope every):
var restriction body
Make the structure flat:• give each EP a handle• replace embedded EPs by their handles• collect all EPs on the same level (understood as conjunction)
Underspecified scope by means of handle constraints:
Make the structure flat:• give each EP a handle• replace embedded EPs by their handles• collect all EPs on the same level (understood as conjunction)
Wide scope: someWide scope: every
MRS as feature structure (also adding event variables):
Norwegian translation: «Hver ferge krysser en fjord»
Projecting MRS representationsfrom f-structures
«Katten sover»'The cat sleeps'
Projecting MRS representationsfrom f-structures
«Katten sover»'The cat sleeps'
mrs::
mrs::
mrs::
Composition: Top-level MRSwith unions of HCONS and RELS:
Post-processing this structurebrings us back to the LOGON MRS format:
http://decentius.aksis.uib.no/logon/xle-mrs.xml
bil 'car' (as in "Han kjøpte bil" 'He bought [a] car')
No SPEC
disse hans mange spørsmål 'these his many questions'
Multiple SPECs
Han jaget barnet ut nakent'He chased the child out naked'
The Transfer Component
Developer of the formalism: Stephan Oepen
Example of transfer
Source sentence:
Henter han bilen sin?fetches he car.DEF POSS.REFL.SG.MASC'Does he fetch his car?'
Alternative reading:'Does he fetch the one of the car?'
Parse output:
Choosing the first reading of Henter han bilen sin?
Choosing the first reading of Henter han bilen sin?
The variables have features.Interrogative is coded as [SF ques] on the event variable.
Two of fourtransferoutputs
Norwegiantransferinput
One of fourEnglishtransferoutputs
Generator output from the chosen transfer output
Transfer formalism(Stephan Oepen)
The form of a transfer rule:
C = contextI = inputF = filterO = output
Simple example:Lexical transfer rule, transferring bekk into creek
No context, no filter, only the predicate is replaced.
Example with a context restriction:gå en tur (lit. 'go a trip') is transferred into the light-verb constructiontake a trip.
In the context of _tur_n as its second argument,_gå_v is transferred to _take_v.
The SEM-I(Semantic Interface)
A documentation of the external semantic interfacefor a grammar, crucial for the writer of transfer rules.
In order to enforce the maintaining of a SEM-I,LOGON parsing returns fail if every parse containsat least one predicate not in the SEM-I.
A small sectionof the verb partof the NorGramSEM-ISize of the NorwegianSEM-I: slightly lessthan 6000 entries
Parse Selection
Parsing, transfer and generation may each givemany solutions, leading to a fanout tree:
The outputs at each of the three stages arestatistically ranked.
Example of a four-way ambiguity:
Det regnet 'It rained'/'It calculated'/'That one calculated'/'That rain'
The ParsebankerEfficient treebank building by discriminants
Developer: Paul Meurer, Bergen
Predecessors in discriminant analysis:David Carter (1997)Stephan Oepen, Dan Flickinger & al. (2003)
1
2
3
4
Packed representations and discriminants(Paul Meurer)
Clicking on one discriminant is in this case sufficientto select a unique solution:
The Parsebanker
'After all, a human being must be something more than a machine?'
TigerSearchThe implementation is under development by Paul Meurer
Find selected prepositional phrases with sentential objects:
Find selected prepositional phrases with the preposition 'om' and nominal objects:
Find topicalized objects: