Nov. 2010 1 Exploiting Closed-Class Categories for Arabic Tokenization and Part-of-Speech Tagging Seth Kulick Linguistic Data Consortium University of Pennsylvania [email protected]
Nov. 2010 1
Exploiting Closed-Class Categories for
Arabic Tokenization and Part-of-Speech Tagging
Seth KulickLinguistic Data ConsortiumUniversity of Pennsylvania
Nov. 2010 2
Outline Background
Two levels of annotation in Arabic Treebank Use of a morphological analyzer
The problem My approach Evaluation, comparison, etc. Integration with parser
Rant about Morphology/Syntax interaction
Nov. 2010 3
Two levels of Annotation “Source tokens” (whitespace/punc-delimited)
S_TEXT – source token text VOC,POS,GLOSS – vocalized form with POS and gloss From this point on, all annotation is on an abstracted
form of the source text. This is an issue in every treebank.
“Tree tokens” (1 source token -> 1 or more tree token) needed for treebanking partition of source token VOC,POS,GLOSS
based on “reduced” form of POS tags – one “core tag” to a tree token.
T_TEXT – tree token text – artificial, although assumed real.
Nov. 2010 4
Example Source Token-> 2 Tree Tokens
partition based on the reduced POS tags NOUN and POSS_PRON
trivial for VOC,POS,GLOSS, not for S_TEXT->T_TEXT
Nov. 2010 5
Example Source Token-> 1 Tree Token
partition based on the reduced POS tag DET+NOUN
Nov. 2010 6
Reduced Core Tags distribution
Nov. 2010 7
Morphological Analyzer Where does the annotation for the source
token come from? A morphological analyzer, SAMA (nee BAMA)
(Standard Arabic Morphological Analyzer) For a given source token, it lists different possible
solutions Each solution has vocalization, pos, and gloss
Good aspect – everything “compiled out” Doesn’t over-generate
Bad aspect – everything “compiled out” Hard to to keep overall track of morphological
possibilities – what words can take what prefixes, etc.
Nov. 2010 8
SAMA example
input word: ktbh
Nov. 2010 9
The ProblemGo from sequence of source
tokens to … Best SAMA solution for each source token Utilizes SAMA as source of alternative solutions Then can split and create tree tokens for parsing
Or partial solution for each source token Partition of each source token into reduced POS and
T_TEXT How to partition morphological analysis in a
pipeline? (ie, what is the morphology-syntax interface here?) And what happens for Arabic “dialects” for which there
is no hand-crafted morphological analyzer? Or simply on new corpora for which SAMA has not been
manually updated?
Nov. 2010 10
My approach Open and Closed class words are very different
ATB lists closed-class words. PREP, SUB_CONJ, REL_PRON..
trying to keep track of affix possibilities in SAMA – write some regular expressions
different morphology Very little overlap between open and closed (Abbott
& Costello, 1937)
e.g. “fy” can be abbreviation, but is almost always PREP Exploit this: do something stupid for closed-class
and something else for open-class NOUN, ADJ, IV, PV, … Not using a morphological analyzer
Nov. 2010 11
Middle Ground MADA (Habash&Rambow, 2005), SAMT (Shash et. al., 2010)
pick a single solution from the SAMA possibilities tokenization, POS, lemma, vocalization all at once
AMIRA – “data-driven” pipeline (no SAMA) tokenization, then (reduced) POS – no lemma,
vocalization Us - Want to get tokenization/tags for parsing
Closed-class regexes – essentially small-scale analyzer open-class – CRF tokenizer/tagger Like MADA ,SAMT – simultaneous tokenization/POS-
tagging Like AMIRA – no SAMA
Nov. 2010 12
Closed-class words Regular expressions encoding tokenization and core
pos for words listed in ATB morphological guidelines.
“wlm” “text-matches” REGEX#1 and #2wlm/CONJ+REL_ADV “pos-matches” only REGEX#2
Store most frequent pos-matching regex for given S_TEXT, and most frequent POS tag for each group in pos-matching regex
Origin – checking SAMA
Nov. 2010 13
Classifier for Open-class words
Features: Run S_TEXT through all open-class regexes If text-matches: (training and testing)
feature: (stem-name, characters-matching-stem) stem-name encodes existences of prefix/suffix
if also pos-matches: (training) gold label: stem-name together with POS Encodes correct tokenization for entire word and POS for
stem
Nov. 2010 14
Classifier for Open-class words – Example 1
Input word: yjry (happens)
Gold label: stem:IV Sufficient along with input word to identify regex for
full tokenization Also derived features for stems:
_f1 = first letter of stem, _l1=last letter of stem, etc. stem_len=4,stem_spp_len=3
stem_f1=y, stem_l1=y,stem_spp_f1=y,stem_spp_l1=r Also stealing proper noun listing from SAMA
Matching regular expression Resulting feature
yjry/NOUN,ADJ,IV,… * stem=yjry
yjr/NOUN+y/POSS_PRON stem_spp=yjr
Nov. 2010 15
Classifier for Open-class words – Example 2
Input: wAfAdt (and + reported)
Gold label: p_stem:PV Also the derived features
Matching regular expression Resulting feature
wAfAdt/all stem=wAfAdt
w+AfAdt/all * p_stem=AfAdt
Nov. 2010 16
Classifier for Open-class words – Example 2
Input: ktbh (books + his)
Gold label: p_stem:PV Also the derived features
Matching regular expression Resulting feature
ktbh/all stem=ktbh
k/PREP+tbh/NOUN p_stem=tbh
k/PREP+tb/NOUN+h/POSS_PRON p_stem_spp=tb
ktb/NOUN+h/POSS_PRON stem_spp=ktb
ktb/IV,PV,CV+h/OBJ_PRON stem_svop=ktb
k/PREP+tb/IV,PV,CV+h/OBJ_PRON
p_stem_svop=tb
Gold label: stem_spp:NOUN
Nov. 2010 17
Classifier for Open-class words – Example 2
Input: Alktb (the+books)
Gold label: stem:DETNOUN Derived features:
stem_len=5, stemDET_len=3, p_stem_len=4….
Matching regular expression Resulting feature
Alktbh/all stem=Alktb
A/INTERROG_PART+lktb/PV p_stem=lktb
Nov. 2010 18
Classifier for Open-class words - Training
Conditional Random Field classifier (Mallet) Each Token is run through all the regexes, open and
closed. If pos-matches closed-class regex:
feature=MATCHES_CLOSED, gold label=CLOSED Else:
features assigned from open-class regex 72 gold labels: cross-product (stem_name,POS tag)
+ CLOSED Classifier used only for open-class words, but
gets all words in sentence as sequence
Nov. 2010 19
Classifier for Open-class words - Training
Classifier used only for open-class words, but gets all words in sentence as sequence
Gold label together with source token S_TEXT maps to a single regex results in tokenization with list of possible POS tags for
affixes (to be disambiguated from stored lists of tags for affixes)
gets tokenization and POS tag for stem at the same time. 6 days to train – 3 days if all NOUN* and ADJ* are
collapsed Current method – obviously wrong – hack for publishing. To do – get coarser tags and use different method for rest
Nov. 2010 20
Interlude before Results - How complete is the coverage of the
regexes?
regexes were easy to construct independent of any particular ATB section.
Some possibilities mistakenly left out e.g., NOUN_PROP+POSS_PRON
Some “possibilities” purposefully left out NOUN+NOUN h/p typo correction
Nov. 2010 21
SAMA example(it’s not really that ambiguous)
input word: ktbh (not ktbp!) h=ه p= ة
Nov. 2010 22
Interlude before Results - How much open/closed ambiguity is
there? i.e. How often is it the case that the solution for a source token is open-class, but the S_TEXT matches both open and closed-class regexes?
If it happens a lot, this work will crash In dev and test sections: 305 cases
109 NOUN_PROP or ABBREV (fy) Overall tradeoff: Give up on such cases, instead:
CRF for joint tokenization/tagging of open-class words absurdly high baseline for closed-class words, and prefixes Future – better multi-word named recognition
Nov. 2010 23
Experiments – training data Data – ATBv3.2, train/dev/tst as in (Roth et al.,
2008) No use of dev section for us
Training: open-class classifier “S_TEXT-solution” list (S_TEXT,solution) pairs
open-class: “solution” is gold-label closed-class: name of single pos-matching regex can be used to get regex and POS core tag for given S_TEXT
“regex-group-tag” list ((regex-group-name,T_TEXT),POS_TAG)
used to get most common POS tag for affixes
Nov. 2010 24
Lists example
S_TEXT input = “wlm” Consult S_TEXT-solution list to get solution “REGEX
#1” gives the solution w:[PART..PREP] + lm/NEG_PART
Consult regex-group-tag list to get POS for w solution: w/CONJ+lm/NEG_PART
reg-group-tag list also used for affixes for open-class solutions S_TEXT-solution list also used in some cases for open-class words
Nov. 2010 25
Four sources of solutions for given S_TEXT
Stored: S_TEXT was stored with a solution during training
(open or closed) random open-class match
Chosen at random from text-matching open-class regexes
random closed-class match Chosen at random from text-matching closeي-
class regexes Mallet
solution found from CRF classifier
Nov. 2010 26
Evaluation
# tokens
Tok POS
25305 95.8
84.7
22386 99.8
95.2
2896 64.8
3.3
23 82.6
65.2
0
# tokens
Tok POS
25305 99.3
93.5
22386 99.8
95.2
6 50.0
0.0
23 82.6
65.2
2890 95.6
80.2
# tokens
Tok POS
25305 99.2
93.5
7910 99.7
96.7
6 50.0
0.0
23 82.6
65.2
17366 99.0
92.2
Origin
(All)
Stored
Open
Closed
MalletTraining/Devtest – ATB3
v3.2Tokenization – word error evaluation
POS – on “core tag”, miss if tokenization miss not assuming gold tokenization
Baseline Run priority-stored Run priority-classifier
Nov. 2010 27
Results: Baseline
S_TEXT is in list “s_text-solution” use the most frequent stored solutionelse if S_TEXT text-matches >=1 closed-class regex, pick a random oneelse pick a random text-matching open-class regex
Almost all closed-class words seen in training 3.3% score for open-class words not seen in training
# tokens
Tok POS
25305 95.8
84.7
22386 99.8
95.2
2896 64.8
3.3
23 82.6
65.2
0
# tokens
Tok POS
25305 99.3
93.5
22386 99.8
95.2
6 50.0
0.0
23 82.6
65.2
2890 95.6
80.2
# tokens
Tok POS
25305 99.2
93.5
7910 99.7
96.7
6 50.0
0.0
23 82.6
65.2
17366 99.0
92.2
Origin
(All)
Stored
Open
Closed
Mallet
Baseline Run priority-stored Run priority-classifier
Nov. 2010 28
Results: Run priority-stored
If S_TEXT is in list “s_text-solution”, use the most frequent stored solution
else if S_TEXT text-matches >=1 closed-class regex, pick a random one
else use result of CRF classifier
# tokens
Tok POS
25305 95.8
84.7
22386 99.8
95.2
2896 64.8
3.3
23 82.6
65.2
0
# tokens
Tok POS
25305 99.3
93.5
22386 99.8
95.2
6 50.0
0.0
23 82.6
65.2
2890 95.6
80.2
# tokens
Tok POS
25305 99.2
93.5
7910 99.7
96.7
6 50.0
0.0
23 82.6
65.2
17366 99.0
92.2
Origin
(All)
Stored
Open
Closed
Mallet
Baseline Run priority-stored Run priority-classifier
Nov. 2010 29
Results: Run priority-classifier
If S_TEXT text-matches >=1 closed-class regex If is list “s_text-solution”, use that else pick a random text-matching closed-class regexelse use result of CRF classifier
Shows the baseline for closed-class items is very high. Predication for future work: variation for Mallet score,
not so much for closed-class baseline.
# tokens
Tok POS
25305 95.8
84.7
22386 99.8
95.2
2896 64.8
3.3
23 82.6
65.2
0
# tokens
Tok POS
25305 99.3
93.5
22386 99.8
95.2
6 50.0
0.0
23 82.6
65.2
2890 95.6
80.2
# tokens
Tok POS
25305 99.2
93.5
7910 99.7
96.7
6 50.0
0.0
23 82.6
65.2
17366 99.0
92.2
Origin
(All)
Stored
Open
Closed
Mallet
Baseline Run priority-stored Run priority-classifier
Nov. 2010 30
Comparison with Other Work MADA 3.0 (MADA 3.1 comparison not done)
Published results not good source of comparison different data sets, assumes gold tokenization for POS score
MADA produces different and additional data Comparison – train and test my system on same data
as MADA 3.0. Treat MADA output as SAMA output SAMT
Can’t make sense of published results in comparison to MADA or how it does on tokenization for parsing.
MADA & SAMT – “Pure” test impossible. AMIRA (Diab..) – comparison impossible
Nov. 2010 31
Comparison with MADA 3.0
Oh yeah. But may not look so good for MADA 3.1, if I can even train on the same data
Need to do this with SAMT.
# tokens
Tok POS
25305 99.0
92.0
# tokens
Tok POS
25305 99.3
93.5
22386 99.8
95.2
6 50.0
0.0
23 82.6
65.2
2890 95.6
80.2
# tokens
Tok POS
25305 99.2
93.5
7910 99.7
96.7
6 50.0
0.0
23 82.6
65.2
17366 99.0
92.2
Origin
(All)
Stored
Open
Closed
Mallet
MADA
Run priority-stored Run priority-classifier MADA
Nov. 2010 32
Integration with ParserPause to talk about data
Three forms of tree tokens e.g. S_TEXT=“Ant$Arh” vocalized (SAMA)
{inoti$Ar/NOUN + hu/POSS_PRON_3MS unvocalized (vocalized with diacritics stripped out)
{nt$Ar/NOUN + h/POSS_PRON_3MS input-string (T_TEXT)
Ant$Ar/NOUN + h/POSS_PRON_3MS Most parsing work using unvocalized (prev.
incoherent) MADA produces vocalized, so can get unvocalized Mine produces input-string (T_TEXT)
doesn’t have normalizations that exist in unvocalized form(future work – build them into closed-class regexes – duh!)
Nov. 2010 33
Integration with Parser
Mode 1: parser chooses its own tagsMode 2: parser forced to use given tags
Run 1 – used ADJ.VN, NOUN.VN gold tagsRun 2 – without .VN (not produced by SAMA/MADA)Run 3 – uses output of MADA or our system evaluated with Sparseval
Nov. 2010 34
Integration with Parser
Two trends reversed with tagger output (Run 3) T_TEXT better than UNVOC
probably because of better tokenization Better when parser selects own tags – for both systems.
What’s going on here?
Nov. 2010 35
Tokenization/Parser Interface Parser can recover POS tags, but tokenization
harder 99.3% tokenization – what does it get wrong? Biggest category (7% errors) – “kmA”
k/PREP+mA/{REL_PRON,SUB_CONJ} kmA/CONJ
Build a big model of tokenization and parsing (Green & Manning, 2010…)
tokenization: MADA 97.67, Stanford Joint system – 96.26 parsing: gold 81.1, MADA 79.2, Joint: 76.0 But “MADA is language specific and relies on manually
constructed dictionaries. Conversely the lattice parser requires no linguistic resources.”
Morophology/Syntax interface is not identity.
Nov. 2010 36
Future Work Better partition of the tagging/syntax work
Absurd to have different targets for NOUN, NOUN_NUM,NOUN_QUANT, ADJ_NUM, etc.
Can probably do this with simple NOUN/IV/PV classifier, with some simple maxent classifier for second pass
or integration of some sort with NP/idafa chunker Do something with closed-class baseline?
Evaluate what parser is doing – can improve on that? Special handling for special cases – kmA Better proper noun handling Integrate morphological patterns into classifier specialized regexes for dialects?
Nov. 2010 37
How Much Prior Linguistic Knowledge?
Two Classes of Words Closed-class – PREP, SUB_CONJ, REL_PRON, …
Regular expressions for all closed class input words Open-class - NOUN, ADJ, IV, PV, …
Simple generic templates Classifier used only for open-class words
Only the most likely stem/POSstem name, input string identify regular expression
Closed-class words remain at “baseline”
Nov. 2010 38
Future Work use proper noun list true morphological patterns two levels of classifiers? robustness – closed class don’t vary, open
class do. mix and match with MADA and SAMT