Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of.

Nov. 2010 1

Exploiting Closed-Class Categories for

Arabic Tokenization and Part-of-Speech Tagging

Seth KulickLinguistic Data ConsortiumUniversity of Pennsylvania

[email protected]

Nov. 2010 2

Outline Background

Two levels of annotation in Arabic Treebank Use of a morphological analyzer

The problem My approach Evaluation, comparison, etc. Integration with parser

Rant about Morphology/Syntax interaction

Nov. 2010 3

Two levels of Annotation “Source tokens” (whitespace/punc-delimited)

S_TEXT – source token text VOC,POS,GLOSS – vocalized form with POS and gloss From this point on, all annotation is on an abstracted

form of the source text. This is an issue in every treebank.

“Tree tokens” (1 source token -> 1 or more tree token) needed for treebanking partition of source token VOC,POS,GLOSS

based on “reduced” form of POS tags – one “core tag” to a tree token.

T_TEXT – tree token text – artificial, although assumed real.

Nov. 2010 4

Example Source Token-> 2 Tree Tokens

partition based on the reduced POS tags NOUN and POSS_PRON

trivial for VOC,POS,GLOSS, not for S_TEXT->T_TEXT

Nov. 2010 5

Example Source Token-> 1 Tree Token

partition based on the reduced POS tag DET+NOUN

Nov. 2010 6

Reduced Core Tags distribution

Nov. 2010 7

Morphological Analyzer Where does the annotation for the source

token come from? A morphological analyzer, SAMA (nee BAMA)

(Standard Arabic Morphological Analyzer) For a given source token, it lists different possible

solutions Each solution has vocalization, pos, and gloss

Good aspect – everything “compiled out” Doesn’t over-generate

Bad aspect – everything “compiled out” Hard to to keep overall track of morphological

possibilities – what words can take what prefixes, etc.

Nov. 2010 8

SAMA example

input word: ktbh

Nov. 2010 9

The ProblemGo from sequence of source

tokens to … Best SAMA solution for each source token Utilizes SAMA as source of alternative solutions Then can split and create tree tokens for parsing

Or partial solution for each source token Partition of each source token into reduced POS and

T_TEXT How to partition morphological analysis in a

pipeline? (ie, what is the morphology-syntax interface here?) And what happens for Arabic “dialects” for which there

is no hand-crafted morphological analyzer? Or simply on new corpora for which SAMA has not been

manually updated?

Nov. 2010 10

My approach Open and Closed class words are very different

ATB lists closed-class words. PREP, SUB_CONJ, REL_PRON..

trying to keep track of affix possibilities in SAMA – write some regular expressions

different morphology Very little overlap between open and closed (Abbott

& Costello, 1937)

e.g. “fy” can be abbreviation, but is almost always PREP Exploit this: do something stupid for closed-class

and something else for open-class NOUN, ADJ, IV, PV, … Not using a morphological analyzer

Nov. 2010 11

Middle Ground MADA (Habash&Rambow, 2005), SAMT (Shash et. al., 2010)

pick a single solution from the SAMA possibilities tokenization, POS, lemma, vocalization all at once

AMIRA – “data-driven” pipeline (no SAMA) tokenization, then (reduced) POS – no lemma,

vocalization Us - Want to get tokenization/tags for parsing

Closed-class regexes – essentially small-scale analyzer open-class – CRF tokenizer/tagger Like MADA ,SAMT – simultaneous tokenization/POS-

tagging Like AMIRA – no SAMA

Nov. 2010 12

Closed-class words Regular expressions encoding tokenization and core

pos for words listed in ATB morphological guidelines.

“wlm” “text-matches” REGEX#1 and #2wlm/CONJ+REL_ADV “pos-matches” only REGEX#2

Store most frequent pos-matching regex for given S_TEXT, and most frequent POS tag for each group in pos-matching regex

Origin – checking SAMA

Nov. 2010 13

Classifier for Open-class words

Features: Run S_TEXT through all open-class regexes If text-matches: (training and testing)

feature: (stem-name, characters-matching-stem) stem-name encodes existences of prefix/suffix

if also pos-matches: (training) gold label: stem-name together with POS Encodes correct tokenization for entire word and POS for

stem

Nov. 2010 14

Classifier for Open-class words – Example 1

Input word: yjry (happens)

Gold label: stem:IV Sufficient along with input word to identify regex for

full tokenization Also derived features for stems:

_f1 = first letter of stem, _l1=last letter of stem, etc. stem_len=4,stem_spp_len=3

stem_f1=y, stem_l1=y,stem_spp_f1=y,stem_spp_l1=r Also stealing proper noun listing from SAMA

Matching regular expression Resulting feature

yjry/NOUN,ADJ,IV,… * stem=yjry

yjr/NOUN+y/POSS_PRON stem_spp=yjr

Nov. 2010 15


Input: wAfAdt (and + reported)

Gold label: p_stem:PV Also the derived features


wAfAdt/all stem=wAfAdt

w+AfAdt/all * p_stem=AfAdt

Nov. 2010 16


Input: ktbh (books + his)

Gold label: p_stem:PV Also the derived features


ktbh/all stem=ktbh

k/PREP+tbh/NOUN p_stem=tbh

k/PREP+tb/NOUN+h/POSS_PRON p_stem_spp=tb

ktb/NOUN+h/POSS_PRON stem_spp=ktb

ktb/IV,PV,CV+h/OBJ_PRON stem_svop=ktb

k/PREP+tb/IV,PV,CV+h/OBJ_PRON

p_stem_svop=tb

Gold label: stem_spp:NOUN

Nov. 2010 17


Input: Alktb (the+books)

Gold label: stem:DETNOUN Derived features:

stem_len=5, stemDET_len=3, p_stem_len=4….


Alktbh/all stem=Alktb

A/INTERROG_PART+lktb/PV p_stem=lktb

Nov. 2010 18

Classifier for Open-class words - Training

Conditional Random Field classifier (Mallet) Each Token is run through all the regexes, open and

closed. If pos-matches closed-class regex:

feature=MATCHES_CLOSED, gold label=CLOSED Else:

features assigned from open-class regex 72 gold labels: cross-product (stem_name,POS tag)

+ CLOSED Classifier used only for open-class words, but

gets all words in sentence as sequence

Nov. 2010 19

Classifier for Open-class words - Training

Classifier used only for open-class words, but gets all words in sentence as sequence

Gold label together with source token S_TEXT maps to a single regex results in tokenization with list of possible POS tags for

affixes (to be disambiguated from stored lists of tags for affixes)

gets tokenization and POS tag for stem at the same time. 6 days to train – 3 days if all NOUN* and ADJ* are

collapsed Current method – obviously wrong – hack for publishing. To do – get coarser tags and use different method for rest

Nov. 2010 20

Interlude before Results - How complete is the coverage of the

regexes?

regexes were easy to construct independent of any particular ATB section.

Some possibilities mistakenly left out e.g., NOUN_PROP+POSS_PRON

Some “possibilities” purposefully left out NOUN+NOUN h/p typo correction

Nov. 2010 21

SAMA example(it’s not really that ambiguous)

input word: ktbh (not ktbp!) h=ه p= ة

Nov. 2010 22

Interlude before Results - How much open/closed ambiguity is

there? i.e. How often is it the case that the solution for a source token is open-class, but the S_TEXT matches both open and closed-class regexes?

If it happens a lot, this work will crash In dev and test sections: 305 cases

109 NOUN_PROP or ABBREV (fy) Overall tradeoff: Give up on such cases, instead:

CRF for joint tokenization/tagging of open-class words absurdly high baseline for closed-class words, and prefixes Future – better multi-word named recognition

Nov. 2010 23

Experiments – training data Data – ATBv3.2, train/dev/tst as in (Roth et al.,

2008) No use of dev section for us

Training: open-class classifier “S_TEXT-solution” list (S_TEXT,solution) pairs

open-class: “solution” is gold-label closed-class: name of single pos-matching regex can be used to get regex and POS core tag for given S_TEXT

“regex-group-tag” list ((regex-group-name,T_TEXT),POS_TAG)

used to get most common POS tag for affixes

Nov. 2010 24

Lists example

S_TEXT input = “wlm” Consult S_TEXT-solution list to get solution “REGEX

#1” gives the solution w:[PART..PREP] + lm/NEG_PART

Consult regex-group-tag list to get POS for w solution: w/CONJ+lm/NEG_PART

reg-group-tag list also used for affixes for open-class solutions S_TEXT-solution list also used in some cases for open-class words

Nov. 2010 25

Four sources of solutions for given S_TEXT

Stored: S_TEXT was stored with a solution during training

(open or closed) random open-class match

Chosen at random from text-matching open-class regexes

random closed-class match Chosen at random from text-matching closeي-

class regexes Mallet

solution found from CRF classifier

Nov. 2010 26

Evaluation

# tokens

Tok POS

25305 95.8

84.7

22386 99.8

95.2

2896 64.8

3.3

23 82.6

65.2

0

# tokens

Tok POS

25305 99.3

93.5

22386 99.8

95.2

6 50.0

0.0

23 82.6

65.2

2890 95.6

80.2

# tokens

Tok POS

25305 99.2

93.5

7910 99.7

96.7

6 50.0

0.0

23 82.6

65.2

17366 99.0

92.2

Origin

(All)

Stored

Open

Closed

MalletTraining/Devtest – ATB3

v3.2Tokenization – word error evaluation

POS – on “core tag”, miss if tokenization miss not assuming gold tokenization

Baseline Run priority-stored Run priority-classifier

Nov. 2010 27

Results: Baseline

S_TEXT is in list “s_text-solution” use the most frequent stored solutionelse if S_TEXT text-matches >=1 closed-class regex, pick a random oneelse pick a random text-matching open-class regex

Almost all closed-class words seen in training 3.3% score for open-class words not seen in training

# tokens

Tok POS

25305 95.8

84.7

22386 99.8

95.2

2896 64.8

3.3

23 82.6

65.2

0

# tokens

Tok POS

25305 99.3

93.5

22386 99.8

95.2

6 50.0

0.0

23 82.6

65.2

2890 95.6

80.2

# tokens

Tok POS

25305 99.2

93.5

7910 99.7

96.7

6 50.0

0.0

23 82.6

65.2

17366 99.0

92.2

Origin

(All)

Stored

Open

Closed

Mallet


Nov. 2010 28

Results: Run priority-stored

If S_TEXT is in list “s_text-solution”, use the most frequent stored solution

else if S_TEXT text-matches >=1 closed-class regex, pick a random one

else use result of CRF classifier

# tokens

Tok POS

25305 95.8

84.7

22386 99.8

95.2

2896 64.8

3.3

23 82.6

65.2

0

# tokens

Tok POS

25305 99.3

93.5

22386 99.8

95.2

6 50.0

0.0

23 82.6

65.2

2890 95.6

80.2

# tokens

Tok POS

25305 99.2

93.5

7910 99.7

96.7

6 50.0

0.0

23 82.6

65.2

17366 99.0

92.2

Origin

(All)

Stored

Open

Closed

Mallet


Nov. 2010 29

Results: Run priority-classifier

If S_TEXT text-matches >=1 closed-class regex If is list “s_text-solution”, use that else pick a random text-matching closed-class regexelse use result of CRF classifier

Shows the baseline for closed-class items is very high. Predication for future work: variation for Mallet score,

not so much for closed-class baseline.

# tokens

Tok POS

25305 95.8

84.7

22386 99.8

95.2

2896 64.8

3.3

23 82.6

65.2

0

# tokens

Tok POS

25305 99.3

93.5

22386 99.8

95.2

6 50.0

0.0

23 82.6

65.2

2890 95.6

80.2

# tokens

Tok POS

25305 99.2

93.5

7910 99.7

96.7

6 50.0

0.0

23 82.6

65.2

17366 99.0

92.2

Origin

(All)

Stored

Open

Closed

Mallet


Nov. 2010 30

Comparison with Other Work MADA 3.0 (MADA 3.1 comparison not done)

Published results not good source of comparison different data sets, assumes gold tokenization for POS score

MADA produces different and additional data Comparison – train and test my system on same data

as MADA 3.0. Treat MADA output as SAMA output SAMT

Can’t make sense of published results in comparison to MADA or how it does on tokenization for parsing.

MADA & SAMT – “Pure” test impossible. AMIRA (Diab..) – comparison impossible

Nov. 2010 31

Comparison with MADA 3.0

Oh yeah. But may not look so good for MADA 3.1, if I can even train on the same data

Need to do this with SAMT.

# tokens

Tok POS

25305 99.0

92.0

# tokens

Tok POS

25305 99.3

93.5

22386 99.8

95.2

6 50.0

0.0

23 82.6

65.2

2890 95.6

80.2

# tokens

Tok POS

25305 99.2

93.5

7910 99.7

96.7

6 50.0

0.0

23 82.6

65.2

17366 99.0

92.2

Origin

(All)

Stored

Open

Closed

Mallet

MADA

Run priority-stored Run priority-classifier MADA

Nov. 2010 32

Integration with ParserPause to talk about data

Three forms of tree tokens e.g. S_TEXT=“Ant$Arh” vocalized (SAMA)

{inoti$Ar/NOUN + hu/POSS_PRON_3MS unvocalized (vocalized with diacritics stripped out)

{nt$Ar/NOUN + h/POSS_PRON_3MS input-string (T_TEXT)

Ant$Ar/NOUN + h/POSS_PRON_3MS Most parsing work using unvocalized (prev.

incoherent) MADA produces vocalized, so can get unvocalized Mine produces input-string (T_TEXT)

doesn’t have normalizations that exist in unvocalized form(future work – build them into closed-class regexes – duh!)

Nov. 2010 33

Integration with Parser

Mode 1: parser chooses its own tagsMode 2: parser forced to use given tags

Run 1 – used ADJ.VN, NOUN.VN gold tagsRun 2 – without .VN (not produced by SAMA/MADA)Run 3 – uses output of MADA or our system evaluated with Sparseval

Nov. 2010 34

Integration with Parser

Two trends reversed with tagger output (Run 3) T_TEXT better than UNVOC

probably because of better tokenization Better when parser selects own tags – for both systems.

What’s going on here?

Nov. 2010 35

Tokenization/Parser Interface Parser can recover POS tags, but tokenization

harder 99.3% tokenization – what does it get wrong? Biggest category (7% errors) – “kmA”

k/PREP+mA/{REL_PRON,SUB_CONJ} kmA/CONJ

Build a big model of tokenization and parsing (Green & Manning, 2010…)

tokenization: MADA 97.67, Stanford Joint system – 96.26 parsing: gold 81.1, MADA 79.2, Joint: 76.0 But “MADA is language specific and relies on manually

constructed dictionaries. Conversely the lattice parser requires no linguistic resources.”

Morophology/Syntax interface is not identity.

Nov. 2010 36

Future Work Better partition of the tagging/syntax work

Absurd to have different targets for NOUN, NOUN_NUM,NOUN_QUANT, ADJ_NUM, etc.

Can probably do this with simple NOUN/IV/PV classifier, with some simple maxent classifier for second pass

or integration of some sort with NP/idafa chunker Do something with closed-class baseline?

Evaluate what parser is doing – can improve on that? Special handling for special cases – kmA Better proper noun handling Integrate morphological patterns into classifier specialized regexes for dialects?

Nov. 2010 37

How Much Prior Linguistic Knowledge?

Two Classes of Words Closed-class – PREP, SUB_CONJ, REL_PRON, …

Regular expressions for all closed class input words Open-class - NOUN, ADJ, IV, PV, …

Simple generic templates Classifier used only for open-class words

Only the most likely stem/POSstem name, input string identify regular expression

Closed-class words remain at “baseline”

Nov. 2010 38

Future Work use proper noun list true morphological patterns two levels of classifiers? robustness – closed class don’t vary, open

class do. mix and match with MADA and SAMT

Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of.

Documents