Top Banner
The Oslo-Bergen Tagger OBT+stat - a short presentation André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad
48

The Oslo-Bergen Tagger OBT+stat - a short presentation

Jan 12, 2016

Download

Documents

osgood

The Oslo-Bergen Tagger OBT+stat - a short presentation. André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. Morphosyntactic tagger and lemmatizer. Bokmål and Nynorsk Based on lexicon and linguistic rules - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Oslo-Bergen Tagger OBT+stat - a short presentation

The Oslo-Bergen TaggerOBT+stat - a short presentation

André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad

Page 2: The Oslo-Bergen Tagger OBT+stat - a short presentation

Morphosyntactic tagger and lemmatizer• Bokmål and Nynorsk• Based on lexicon and linguistic rules• Statistical disambiguation for

completely unambiguous output (Currently Bokmål only)

Page 3: The Oslo-Bergen Tagger OBT+stat - a short presentation

Purpose

• Annotation for linguistic research (e.g. The Oslo Corpus) • Large scale corpora annotation (e.g. NoWaC in progress)

Page 4: The Oslo-Bergen Tagger OBT+stat - a short presentation

Applications

• Grammar checker in Microsoft Word and others• Open source and commercial translation systems (Apertium,

NyNo, Kaldera)• Commercial Content Management Systems (TextUrgy)

Page 5: The Oslo-Bergen Tagger OBT+stat - a short presentation

Resources

Lexicon based on Norsk ordbank Bokmål:    151 229 entriesNynorsk:  126 323 entries

Page 6: The Oslo-Bergen Tagger OBT+stat - a short presentation

Resources

Hand-made Constraint Grammar rules

Bokmål:    2214 morphological rulesNynorsk:  3849 morphological rules 

Page 7: The Oslo-Bergen Tagger OBT+stat - a short presentation

Resources

Development and test corpora Training/development corpus approx. 120,000 words each for Bokmål and Nynorsk

Test/evaluation corpusapprox. 30,000 words each for Bokmål and Nynorsk 

Page 8: The Oslo-Bergen Tagger OBT+stat - a short presentation

Resources

Dependency syntax for both Bokmål and Nynorsk

Page 9: The Oslo-Bergen Tagger OBT+stat - a short presentation

Technology

Multitagger                           Common LispCG Disambiguator               VislCG3 (C++)Statistical Disambiguator     Ruby, HunPos

Page 10: The Oslo-Bergen Tagger OBT+stat - a short presentation

Pipeline

Page 11: The Oslo-Bergen Tagger OBT+stat - a short presentation

Results

Competitive results on varied domains

Page 12: The Oslo-Bergen Tagger OBT+stat - a short presentation

Multitagger

• Sophisticated tokenizer, morphological analyzer and compound word analyzer (guesser)

• Enumerates all possible tags and lemmas• Tags composed of detailed morphosyntactic information

Page 13: The Oslo-Bergen Tagger OBT+stat - a short presentation

Multitagger output<word>Dette</word>"<dette>""dette"    verb inf i2 pa4"dette"    pron nøyt ent pers 3"dette"    det dem nøyt ent<word>er</word>"<er>""være"  verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en"      det mask ent kvant"en"      pron ent pers hum"en"      adv"ene"    verb imp tr1<word>testsetning</word>"<testsetning>""testsetning"    subst appell fem ub ent samset"testsetning"    subst appell mask ub ent samset<word>.</word>"<.>""$."    clb <<< <punkt>

Page 14: The Oslo-Bergen Tagger OBT+stat - a short presentation

Multitagger output

<word>en</word>"<en>" "en"    det mask ent kvant "en"    pron ent pers hum "en"    adv "ene"  verb imp tr1

Page 15: The Oslo-Bergen Tagger OBT+stat - a short presentation

CG Disambiguator

• Based on detailed Constraint Grammar rulesets for Bokmål and Nynorsk

• Rules compatible with the state of the art VislCG3 disambiguator

• Efficiently disambiguates multitagger cohorts with high precision

• Leaves some ambiguity by design

Page 16: The Oslo-Bergen Tagger OBT+stat - a short presentation

CG Rules

#:2553 SELECT:2553 (subst mask ent) IF         (NOT 0 farlige-mask-subst)         (NOT 0 fv)         (NOT 0 adj)         (NOT -1 komma/konj)         (**-1C mask-det LINK NOT 0 nr2-det LINK NOT *1 ikke-adv-adj);#  "en vidunderlig vakker sommerfugl"

Page 17: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example output<word>Dette</word>"<dette>""dette" pron nøyt ent pers 3 SELECT:2607; "dette" verb inf i2 pa4 SELECT:2607 ; "dette" det dem nøyt ent SELECT:2607 <word>er</word>"<er>""være" verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en" det mask ent kvant SELECT:2762; "en" adv REMOVE:3689 ; "en" pron ent pers hum SELECT:2762 ; "ene" verb imp tr1 SELECT:2762<word>testsetning</word>"<testsetning>""testsetning" subst appell mask ub ent samset SELECT:2553; "testsetning" subst appell fem ub ent samset SELECT:2553 <word>.</word>"<.>""$." clb <<< <punkt>

Page 18: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of ambiguity left unresolved<word>Setninger</word>"<setninger>""setning" subst appell fem ub fl "setning" subst appell mask ub fl <word>kan</word>"<kan>""kunne" verb pres tr1 tr3 <aux1/infinitiv> <word>være</word>"<være>""være" verb inf tr5 "være" verb inf a5 pr1 pr2 <aux1/perf_part> ; "være" subst appell nøyt ubøy REMOVE:3123 <word>vanskelige</word>"<vanskelige>""vanskelig" adj fl pos ; "vanskelig" adj be ent pos REMOVE:2318 <word>.</word>"<.>""$." clb <<< <punkt>

Page 19: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of ambiguity left unresolved

<word>Setninger</word>"<setninger>""setning"    subst appell fem ub fl "setning"    subst appell mask ub fl 

Page 20: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of unresolved ambiguity<word>Det</word>"<det>""det" pron nøyt ent pers 3 SELECT:2607 ; "det" det dem nøyt ent SELECT:2607 <word>dreier</word>"<dreier>""dreie" verb pres tr1 i2 tr11 SELECT:2467 ; "drei" subst appell mask ub fl SELECT:2467 ; "dreier" subst appell mask ub ent SELECT:2467<word>seg</word>"<seg>""seg" pron akk refl SELECT:3333 ; "sige" verb pret i2 a3 pa4 SELECT:3333<word>om</word>"<om>""om" prep SELECT:2653 ; "om" sbu SELECT:2653<word>åndsverk</word>"<åndsverk>""åndsverk" subst appell nøyt ub fl <*verk> "åndsverk" subst appell nøyt ub ent <*verk> <word>.</word>"<.>""$." clb <<< <punkt>

Page 21: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of unresolved ambiguity

<word>åndsverk</word>"<åndsverk>""åndsverk"    subst appell nøyt ub fl <*verk> "åndsverk"    subst appell nøyt ub ent <*verk> 

Page 22: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of lemma ambiguity

<word>Det</word>"<det>""Det" subst prop <*> <word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 ; "gammel" adj fl pos SELECT:3064 ; "gammal" adj fl pos SELECT:3064 <word>testamentet</word>"<testamentet>""testament" subst appell nøyt be ent "testamente" subst appell nøyt be ent <word>.</word>"<.>"

Page 23: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of lemma ambiguity

<word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 

Page 24: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of lemma ambiguity

<word>Oslo</word>"<oslo>" "Oslo" subst prop <word>er</word>"<er>" "være" verb pres a5 pr1 pr2 <aux1/perf_part> <word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent <word>vår</word>"<vår>" "vår" det mask ent poss SELECT:2689 ; "vår" det fem ent poss SELECT:2689 ; "vår" subst appell mask ub ent SELECT:2689 <word>.</word>"<.>" "$." clb <<< <punkt>

Page 25: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of lemma ambiguity

<word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent

Page 26: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of unwanted ambiguity

Livet på jorden har tilpasset seg og tildels utnyttet de skiftende forhold.

Page 27: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of unwanted ambiguity  <word>og</word>"<og>" "og" konj  "og" konj clb ; "og" adv REMOVE:2227 <word>til dels</word>"<til dels>" "til dels" adv prep+subst @adv <word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1  "utnytte" verb perf-part tr1 ; "utnytte" adj nøyt ub ent <perf-part> tr1 REMOVE:2274 ; "utnytte" adj ub m/f ent <perf-part> tr1 REMOVE:2274 <word>de</word>"<de>" "de" det dem fl SELECT:2780 ; "de" pron fl pers 3 nom SELECT:2780 <word>skiftende</word>"<skiftende>" "skifte" adj <pres-part> tr1 i1 i2 tr11 pa1 pa2 pa5 tr13 <word>forhold</word>

Page 28: The Oslo-Bergen Tagger OBT+stat - a short presentation

Example of unwanted ambiguity

<word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1  "utnytte" verb perf-part tr1

Page 29: The Oslo-Bergen Tagger OBT+stat - a short presentation

Statistical disambiguator

• Uses a statistical model to fully disambiguate• Simple model based on existing resources• Must discriminate between the ambiguities left by the CG

disambiguator

Page 30: The Oslo-Bergen Tagger OBT+stat - a short presentation

Earlier ambiguities - now resolved

<word>Setninger</word>"<setninger>" "setning" subst appell fem ub fl     <Correct!> "setning" subst appell mask ub fl 

Page 31: The Oslo-Bergen Tagger OBT+stat - a short presentation

Earlier ambiguities - now resolved

<word>om</word>"<om>" "om" prep      <Correct!> "om" sbu  <word>åndsverk</word>"<åndsverk>" "åndsverk" subst appell nøyt ub fl <*verk>          <Correct!> "åndsverk" subst appell nøyt ub ent <*verk> 

Page 32: The Oslo-Bergen Tagger OBT+stat - a short presentation

Earlier ambiguities - now resolved

<word>gamle</word>"<gamle>" "gammel" adj be ent pos    <Correct!> "gammal" adj be ent pos "gammel" adj fl pos  "gammal" adj fl pos

Page 33: The Oslo-Bergen Tagger OBT+stat - a short presentation

Earlier ambiguities - now resolved

<word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent     <Correct!>

Page 34: The Oslo-Bergen Tagger OBT+stat - a short presentation

Statistical disambiguation process

• Statistical tagger is run independently of the CG disambiguator

• The output is aligned• Statistical tagger result used to select among ambiguous

results• Simple lemma disambiguation

Page 35: The Oslo-Bergen Tagger OBT+stat - a short presentation

HMM modelling

• Robust performance on smaller amounts of training data• Good unknown word handling• Cheap and mature

Page 36: The Oslo-Bergen Tagger OBT+stat - a short presentation

Our HMM model

• Trained on 122 523 words in 8178 sentences• Variety of domains• More than 350 distinct tags• Not very good accuracy really

Page 37: The Oslo-Bergen Tagger OBT+stat - a short presentation

HMM model integration

Ambiguities in ca. 4.5% of tokensCoverage ca. 80%

Page 38: The Oslo-Bergen Tagger OBT+stat - a short presentation

Lemma disambiguation

Mainly resolved by tag disambiguationBut some are still disambiguous

Page 39: The Oslo-Bergen Tagger OBT+stat - a short presentation

Using word form frequencies

Idea: lemmas occur as word forms in large corpora

Use word frequencies from NoWaC to disambiguate among lemmas

Page 40: The Oslo-Bergen Tagger OBT+stat - a short presentation

Remaining ambiguities

Randomly selected

Page 41: The Oslo-Bergen Tagger OBT+stat - a short presentation

Expectations

• Cheap and cheerful modeling• Facing a variety of hard disambiguation decisions• On a large morphosyntactic tagset• Evaluated on a slightly eclectic corpus

Page 42: The Oslo-Bergen Tagger OBT+stat - a short presentation

Results: CG Disambiguation

Precision 96.03%Recall      99.02%F-score    97.2%

Page 43: The Oslo-Bergen Tagger OBT+stat - a short presentation

Results: Full disambiguation

Accuracy 96.56%

Page 44: The Oslo-Bergen Tagger OBT+stat - a short presentation

Results: Full disambiguation

Overall accuracy 96.56%Tagging accuracy 96.74%Lemma accuracy 98.33%

Page 45: The Oslo-Bergen Tagger OBT+stat - a short presentation

Details

Tagger coverage  79.39% Tagger accuracy  81.70%Lemma coverage 54.23%Lemma accuracy 86.71%

Page 46: The Oslo-Bergen Tagger OBT+stat - a short presentation

Forthcoming (technical)

• Optimizing for very large corpora (> billion words)• More sophisticated modeling• Discriminative modeling or MBT modeling• Constrained decoding• Better lemma disambiguation

Page 47: The Oslo-Bergen Tagger OBT+stat - a short presentation

Forthcoming (theoretical)

• Finding the best division of labor between data driven and rule driven approaches

• Pivoting on specific errors and ambiguities• Working more with syntax (CG3 dependency trees)

Page 48: The Oslo-Bergen Tagger OBT+stat - a short presentation

Links

• http://tekstlab.uio.no/obt-ny/index.html• http://github.com/andrely/OBT-Stat