Top Banner
NEED TO FIND A HEADLINE BLIND REVIEW NO First Author NO Affiliation / Address line 1 NO Affiliation / Address line 2 NO Affiliation / Address line 3 NO e-mail@domain BLIND REVIEW NO Second Author NO Affiliation / Address line 1 NO Affiliation / Address line 2 NO Affiliation / Address line 3 NO e-mail@domain Abstract We deal with the task of Noun Phrase chunking of Hebrew texts. We show that the traditional definition of base-NPs as non- recursive noun phrases does not work well for Hebrew, and propose the notion of SimpleNPs instead. Some syntactic properties of Hebrew related to noun phrases are discussed, and we note that the task of Hebrew SimpleNP chunking is harder then the English base-NP chunking. We briefly describe some preliminary attempts with methods known to work well for English. These methods gives low results (F from 76 to 86) in Hebrew. We then discuss our successful attempt using SVM. We also suggest the usage of morphological features other than word and POS for augmenting the SVM based chunking results, and show that they can improve the average precision in ~0.5%, recall in ~1%, and F- measure in ~0.75, resulting in a system with average performance of 92.99% precision, 93.41% recall and 93.2 F-measure. 1 Introduction Modern Hebrew is an agglutinative semitic language, with rich morphology, and the official language of Israel. Like most other non-European languages, it lacks many NLP resources and tools. NP chunking is the task of labelling noun phrases in natural language text. The input to this task is free text with part of speech tags (which indicate nouns, adjectives, etc). The output is the same text with brackets around base noun phrases. A base noun phrase is an
14

Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

Mar 24, 2018

Download

Documents

truongxuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

NEED TO FIND A HEADLINE

BLIND REVIEW NO First AuthorNO Affiliation / Address line 1NO Affiliation / Address line 2NO Affiliation / Address line 3

NO e-mail@domain

BLIND REVIEW NO Second AuthorNO Affiliation / Address line 1NO Affiliation / Address line 2NO Affiliation / Address line 3

NO e-mail@domain

Abstract

We deal with the task of Noun Phrase chunking of Hebrew texts. We show that the traditional definition of base-NPs as non-recursive noun phrases does not work well for Hebrew, and propose the notion of SimpleNPs instead. Some syn-tactic properties of Hebrew related to noun phrases are discussed, and we note that the task of Hebrew SimpleNP chunk-ing is harder then the English base-NP chunking. We briefly describe some pre-liminary attempts with methods known to work well for English. These methods gives low results (F from 76 to 86) in He-brew. We then discuss our successful at-tempt using SVM. We also suggest the usage of morphological features other than word and POS for augmenting the SVM based chunking results, and show that they can improve the average preci-sion in ~0.5%, recall in ~1%, and F-mea-sure in ~0.75, resulting in a system with average performance of 92.99% preci-sion, 93.41% recall and 93.2 F-measure.

1 Introduction

Modern Hebrew is an agglutinative semitic lan-guage, with rich morphology, and the official language of Israel. Like most other non-Euro-pean languages, it lacks many NLP resources and tools.

NP chunking is the task of labelling noun phrases in natural language text. The input to this task is free text with part of speech tags (which indicate nouns, adjectives, etc). The output is the same text with brackets around base noun phrases. A base noun phrase is an NP which does

not contain another NP (it is not recursive). The definition of base-NPs has to be adapted to the case of Hebrew (and probably other semitic lan-guages as well) to handle it’s syntactic features correctly. NP chunking is the basis for many other NLP tasks such as Shallow Parsing, Argu-ment Structure Identfication, Question Answer-ing, Machine Translation and Information Ex-traction.

2 Previous work

Text chunking (and NP chunking in particular), first proposed by Abney (1991), is a well studied problem for English. The CoNLL2000 shared task (Tjong et. al, 2000) was general chunking. The best result achieved for the shared task data was by Zhang et. al (2002), who achieved NP chunking results of 94.39% precision, 94.37% recall and 94.38% F-measure using a generalized Winnow algorith, and enhancing the feature set with the output of a dependency parser. Kudoh and Matsumoto (2000) used SVM based algo-rithm, and achieved NP chunking resutls of 93.72% precision, 94.02% recall and 93.87% F-measure for the same shared task data, using only the words and their POS tags. Similar re-sults were accomplished using Conditional Ran-dom Fields (Sha and Pereira, 2003).

The NP chunks in the shared task data are ac-tually base-NP chunks – which are non-recursive NPs, a definition first proposed by Ramshaw and Marcus (1995). This definition yields good NP chunks for English, but results in a very short and uninformative chunks for Hebrew (and prob-ably other semitic languages).

Recently, Diab et al (2004) used SVM based approach for Arabic text chunking. Their chunks data was derived from the Arabic Tree-Bank(REF?) using the same program that ex-tracted the chunks for the shared task. They used

Page 2: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

the same features as Kudoh and Matsumoto (2000), and achieved over-all chunking perfor-mance of 92.06% precision, 92.09% recall and 92.08 F-measure (The results for the NP chunks alone were not reported).

3 Hebrew SimpleNP Chunks

The standard definition of english base-NPs is any noun phrase that does not contain another noun phrase, with possessives treated as a special case, viewing the possessive marker as the first word of a new base-NP (Ramshaw and Marcus, 1995). Applying this definition to the Hebrew TreeBank(REF?) results in very primitive chunks (Table 1 shows the average number of words in a base-NP for English and Hebrew) – these chunks aren’t usefull for any practical pur-pose, and so we propose a new definition for He-brew NP chunks, which allows for some nested-ness. We call our chunks SimpleNP chunks.

English BaseNPs

Hebrew BaseNPs

Hebrew SimpleNPs

Avg # of words 2.17 1.39 2.49% length 1 30.95 63.32 32.83% length 2 39.35 35.48 32.12% length 3 18.68 0.83 14.78% length 4 6.65 0.16 9.47% length 5 2.70 0.16 4.56% length > 5 1.67 0.05 6.22

Table 1 - lengths of Hebrew and English NPs

3.1 What makes Hebrew different

The reason the traditional base-NP definition fails for the Hebrew TreeBank is the unique syn-tactic features of Hebrew such as Smixut and ___, and this is enhanced by the decisions of the TreeBank annotators, which makes the tree much less flat then the Penn TreeBank annotators.For example, take the English base noun phrase [The homeless people]. The Hebrew equivalent for it is " )1( האנשים מחוסרי הבית" which by the non-recursive NP definition will be bracketed as:[ה אנשים] ,מחוסרי [ה בית] or, loosely translating back to English: [the home]less [people].

האנשים מחוסרי הבית (1)Ha-ana$-im mehusr-ei ha-baitThe-people-[plur] lacking-[plur-construct] the-houseThe homeless people

We will now present some syntactic properties of Hebrew which are relevant to NP chunking. Fol-

lowing that, we’ll present our definition of Sim-pleNP Chunks.

Smixut / Double Smixut: The Hebrew genitive case is achieved by placing two nouns next to each other. This is called “noun construct”, or Smixut. The second noun can be treated as an ad-jective modifying the next noun. The first noun in such cases must be in a special construct form. The definite article is usually placed on the sec-ond word:

בית ספר (2)Beit sefer House-[construct] bookhouse of book (= School)

בית הספר (3) Beit ha-sefer House-[construct] the-bookThe house of the book (= The School) The construct form can also be chained:

משרד ראש הממשלה (4)Misrad ro$ ha-mem$alaOffice-[const poss] head-[const] the-governmentThe prime-minister’s office

Possesive / ,suffix / smixut: As seen in (4) / של the smixut form can also be used to indicate pos-session. Other ways of indicating possession is by using the possessive pronoun של (“$el” – ‘of’, ‘belonging to’) or an inflection of which (5). An-other way of indicating possession is by using a possessive suffix on the noun (6). The various forms can sometimes be mixed together, as in (7):

הבית שלי (5)Ha-bait $el-i The-house of-[poss 1st person]My house

ביתי (6)beit-i house-of-meMy house

משרדו של ראש הממשלה (7)Misrad-o $el ro$ ha-mem$alaOffice-[pos] of head-[const] the-governmentThe prime minister office

Page 3: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

Adjective: The hebrew adjectives come after the noun, and agree with it in number and gender.

תפוח ירוק (8)Tapu’ah yarokApple greenA green apple

Predicate Structure

Word order and the preposition “ET”: He-brew sentences can be in either SVO1 or VSO form. In order to keep the object separate from the subject, definite direct objects are marked with the special preposition “et”, which has no analog in English.

Possible null equative: The equative form in Hebrew can be null. (9) presents the non-null equative, (10) presents the null equative, while (11) and (12) are not equative at all, but looks very similar to the null-equative form:

הבית הוא גדול (9)Ha-bait hu gadolThe-house is bigThe house is big

הבית גדול (10)Ha-bait gadolThe-house bigThe house is big

בית גדול (11)Bait-gadolHouse bigA big house

הבית הגדול (12)Ha-bait Ha-gadolThe-house the-bigThe big house

Morphology / agglutinative

Derivative?The English morphemes: re-, ex-, un-, -less, -like, -able, etc. appear in Hebrew as separate lex-ical units – בר, דמוי, חסר, לא/אי/בלתי, לשעבר, מחדש , etc. Un- , -less חסרי, בלתי, לא

3.2 Defining SimpleNPs

Our definition of SimpleNPs is pragmatic and ____. We wanted to perserve meaningfull units of text that …, and so our definition starts with the most complex NP, and breaks it into smaller parts by stating what’s not in a SimpleNP. This can be summarised by the following table:

1 SVO – Subject-Verb-Object

Outside SimpleNP ExceptionsPrepositional Phrases % related PPs are al-

lowed: המכירותמ 5%

5% of the sales

Possessive is של not considered a PP

Relative ClausesVerb PhrasesApposition2

Some conjunctions(the handling of con-junctions followed that of the TreeBank as to mark it as a part of the NP or not)3.

Table 2 - Definition of SinpleNP chunks

3.3 Properties of SimpleNP chunks

Examples for some SimpleNP chunks resulting from that definition are:

[תופעה זו] התבררה אתמול ב[וועדת העבודה והרווחה שלהכנסת] שדנה ב[נושא העסקת עובדים זרים]

I Believe your translation abilites will be better then mine… [המעסיקים] אינם מצפים שיצליחו למשוך [מספר ניכר של 4עובדים ישראליים] ל[קטיף] בגלל [השכר הנמוך] המשולם

ל[עבודה זו].

This definition can also yield some rather long and complex chunks, such as:

והצבא חאן גינגיס של כיבושיהם[]שלו הטטרי המונגולי

Some more discussion of prop-erties: I find these sentences to highlight some of the deci-sions of where to cut, feel free to elaborate on it a lit-tle, or just dump them.

ב [ ה שבועיים ה אחרונים ] כבר ראיין [ שר ה [ כ עשרה קציני משטרה מנוסים ו צעיריםמשטרה ]

יחסית ] ב [ דרגת ניצב משנה ] , מ [ קרב שדרת ה ביניים של פיקוד ה משטרה ] , כ [ ה מועמדים ] לאייש

.[ את יחידת ה מטה ה מקצועית של הוא ]

2 Apposition structure is not annotated in the TreeBank. As a heuristic, we considered every comma inside a non con-junctive NP which is not followed by an adjective or an ad-jective phrase to be marking the beginning of an apposition.3 A special case are AP and possessors conjunctions, which are considered to be inside the SimpleNP. 4 For the readers who are familiar with Hebrew and feel that is an adjective and should be inside the NP, we’ll המשולםnote that this is not the case – המשולם here is actually a Verb in the Beinoni form.

Page 4: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

ה קהל [ קרב ] מ מעטים [ רק ש להניח יש ] , כהנא [ מצביעי ] היו מלווים - ה פורעים

] פוטנציאליים ה [ מצביעיו או

ל [ דברי פקידי ממשלה מקומיים ] , [ מפעלים ] על [ שטח[ סך של ] ש עברה [ ה שנה ב ] הרוויחו . 7טטרי 3

. 2מיליארד רובל ( מיליארד דולר ) ] , ש [ את כולם2 כמעט ] לקחה [ מוסקווה ]

Note that some NPs are split, for example by the preopsition ‘on’: [factories] on [tataric grounds], and by the relative clase: [a sum of 3.7billion R] that [almost all of which] were taken by [Moscow]

3.4 Hebrew Simple NPs are harder then English baseNPs

The SimpleNPs derived from our definition are highly coherent units, but are also much more complex than the non recursive English base NPs. As can be seen in Table 1, Our definition of Sim-pleNP yields chunks which are on average con-siderably longer than the English chunks, with about 20% of the chunks with 4 or more words (opposed to about 10% in English) and a siginif-cant portion (6.22%) of chunks with 6 or more words (1.67% in english).

Moreover, the baseline used at the CoNLL shared task (selecting the chunk tag which was most frequently associated with the current part-of-speech tag) (my REF for this is http://www.c-nts.ua.ac.be/conll2000/chunking/ - what do I do with it? ) gives far inferior results for Hebrew SimpleNPs (see Table 3).

4 Some baseline appraoches

We’ve done some experiamenting with different known methods for English NP chunking, which resulted in poor results. We won’t describe them in detail, but briefly describe our experiament settings, and provide the best scores we’ve got out of each method, in comparison to the re-ported scores for English.

All tests were done on the corpus derived from the HebrewTreebank, which is described in sec-tion …. The 500 last sentences were used as the test set, and all the other sentences were used for

training. The results were evaluated using the CoNLL shared task evaluation tools5. The approaches taken were Error Driven Pruning (EDP) (Cardie and Pierce, 1998) and Transfor-mational Based learning of IOB tagging (TBL) (Ramshaw and Marcus, 1995).

The Error Driven Pruning method does not take into account any lexical information and uses only the POS tags. For the Transformation Based method we’ve used both the POS tag and the word itself, with the same templates described in … Trying to use the Transformational Based learning with more features than just the POS and the word, resulted in poorer performance. Our best results for these methods, as well as the CoNLL baseline (BASE), are presented in the following table:

Method English BaseNPs

Hebrew Sim-pleNPs

Prec Rec Prec Rec FBASE 72.58 82.14 64.7 75.4 69.78EDP 92.7 93.7 74.6 78.1 76.3TBL 91.3 91.8 84.7 87.7 86.2

Table 3 - Some baseline results for SimpleNP chunking in Hebrew

5 SVM

We chose to adopt a tagging prespective for the SimpleNP chunking task, in which each word should be tagged as either B, I or O depending on wether it is in the Beginning, Inside, or Out-side of the given chunk, an approach first taken by Ramshaw and Marcus (1995), and which is now the de-facto standrad for this task. This tag-ging is actually a classification – each token is predicted as being either I, O or B, given some features from a predefined linguistic context (such as the words surrounding the given word, and their POS tags).

One model that allow for this prediction is SVM (support vector machines) (Vapnick, 1995). SVM is a supervised machine learnning algo-rithm which can handle many overlapping fea-tures, and that generalizes very well. It is a bi-nary classifier, but can be extended to multiclass classification (Allwein et. al, 2000, Kudo and Matsumoto, 2000).

5http://www.cnts.ua.ac.be/conll2000/chunking/conlle - val.txt

Page 5: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

SVMs have been successfully applied to many NLP tasks (Joachims, 1998) (THERE ARE PLENTY OF OTHER REFS IN http://chasen.org/~taku/software/yamcha/ (buttom of page) DO YOU HAVE ANY PREFS AS OF WHICH OF THESE TO USE?) , and more specifically for base phrase chunking (Kudo and Matsumoto, 2000, 2003). It was also successfully used outside of English - Diab et al recently re-ported some good results for Arabic (Diab et. al, 2004). (MENTION ALSO HABASH AND RAMBOW USE OF SVM FOR POS TAG-GING? OREN’S WORK?) and it was success-fully used on Japanease/Chinese as well. THE REFS FOR THIS ARE FOUND IN http://chasen.org/~taku/software/yamcha/, CALLED “Japanese Named Entity Extraction using Sup-port Vector Machine” AND IS SAID TO BE IN JAPANESE. SHOULD WE KEEP THIS IN?

5.1 Traditional SVM application to chunk-ing

The traditionl setting of SVM for chunking con-sider the context for the token to be classified to be the windows of two tokens before and after it, and the features to be the POS tags and lexical items (word forms) of all the tokens in context. Some settings (Kudo and Matsumoto, 2000) also include the IOB tags of the two “previously tagged” tokens as features (see fig. 1). This setting (including the two last IOB tags) perform nicely for the case of Hebrew Sim-pleNPs chunking as well.

W ORD POS CHUNKה NA B-NPמאבק NOUN I-NP

בין PREP O

רוסיה NAME B-NPל PREP Oה NA B-NPטטרים NOUN I-NP

Figure 1 - Basic SVM setting for Hebrew

5.2 Augmenting the SVM feature set with some morphological features

Hebrew is morphologically rich language, and recent POS tagger and Morphological Analyzer for Hebrew (---Undisclosed Ref due to Blind Submit---) address this issue and provide for each word not only the POS, but also other mor-

phological features such as Gender, Number, Construct, Tense, Person, and the Number, Gen-der and Person of the word suffix, if one exists.

Since SVMs can handle large feature sets very well, we can utilize some of these extra morpho-logical features provided by the POS tagger to improve our chunking. In particular, we found the combination of the Number and the Con-struct features to be effective in improving the chunking results, yielding an average increase of ~0.5% in precission, ~1% in recall, and ~0.75 in F-measure, resulting in an average F-measure of 93.2%.

6 EXPERIAMENT

6.1 The corpus

The Hebrew TreeBank6 consist of 4995 hand an-notated sentences from the ‘Ha’aretz’ newspaper. As any project of this size, it is not free of anno-tation inconsistencies and human errors.Aside for the syntactic structure, every word is POS annotated, and also include some morpho-logical features such as number, gender, con-struct and so forth. The words in the TreeBank (as in our resulting SimpleNP corpus) are after segmentation: של אנחנו / ב ה בית . The morpholog-ical disambiguator we use can also provide such segmentation. We derived the SimpleNPs structure from the TreeBank using the definition given at section 3.2. Then, we converted the original Hebrew TreeBank tagset to the tagset of our POS tagger. For each token we specify its word form, its POS, its morphological features, and its correct IOB tag. The result is the hebrew SimpleNP chunks cor-pus, which can be found at ----no link in this ver-sion----. The corpus consists of 4995 sentences, 27226 chunks and 120396 tokens. A sentence example is:

PREPOSITION NA NA N NA N NA N NA NA O בDEF_ART NA NA N NA N NA N NA NA B-NP הNOUN M S N NA N NA N NA NA I-NP עברAUXVERB M S N 3 N PAST N NA NA O היהADJECTIVE M S N NA N NA N NA NA O קלADVERB NA NA N NA N NA N NA NA O יותרVERB NA NA N NA Y TOINF N NA NA O לקרואET_PREP NA NA N NA N NA N NA NA B-NP אתDEF_ART NA NA N NA N NA N NA NA I-NP ה

6http://mila.cs.technion.ac.il/website/english/re-sources/corpora/treebank/index.html

Feature Set

Estimated Tag

Page 6: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

NOUN F S N NA N NA N NA NA I-NP מפה. PUNCUATION NA NA N NA N NA N NA NA O

6.2 POS tags and Morphological features:

The POS tagset we use consist of 22 tags, which are adjusted to Hebrew:

ADJECTIVE ADVERB ET_PREPAUXVERB CONJUNCTION DEF_ARTDETERMINER EXISTENTIAL INTERJECTIONINTEROGATIVE MODAL NEGATIONPARTICLE NOUN NUMBERPRONOUN PREFIX PREPOSITIONUNKNOWN PROPERNAME PUNCTUATIONVERB

For each token, we also supply the following morphological features (in that order):

Feature Possible ValuesGender (M)ale, (F)emale,

(B)oth (unmarked case), (NA)Number (S)ingle, (P)lurar, (D)ual,

can be (ALL), (NA)Construct (Y)es, (N)oPerson (1)st, (2)nd, (3)rd, (123)all, (NA)To-Infinitive (Y)es, (N)oTense PAST, PRESENT, FUTURE,

BEINONI, IMPERATIVE, TOINF, BAREINF

(has) Suffix (Y)es, (N)oSuffix-Num (M)ale, (F)emale, (B)oth, (NA)Suffix-Gen (S)ingle, (P)lurar, (D)ual, (DP)-

dual plural, can be (ALL), (NA)

6.3 Setup and evaluation

For all the SVM chunking experiaments, we use the YamCha7 toolkit (Kudo and Matsumoto, 2003). We use forward moving tagging, using standard SVM with polynomial kernel of degree 2, and C=1. For the multiclass classification we use pairwise voting.

For all the reported experiaments we chose the context to be a –2/+2 tokens windows, centered at the current token.

We use the standard metrics of accuracy (% of correctly tagged tokens), precission, recall and F-measure, with the only exception of normaliz-ing all punctuation tokens from the data prior to evaluation, as the TreeBank is highly inconssis-tent regarding the bracketing of punctuations,

7 http://chasen.org/~taku/software/yamcha/

and we don’t consider the exclusions/inclusions of punctuations from our chunks to be errors (ie. “[a book ,] [an apple]” “[a book] , [an apple]” and “[a book] [, an apple]” are all equivalent chunkings in our view).

All our development work was done with the first 500 sentences allocated for testing, and the rest for training. For evaluation, we used a 10-fold cross-validation scheme, each time with dif-ferent consecutive 500 sentences serving for test-ing and the rest for training.

6.4 Features used

We run several SVM experiaments, each with the settings described in section 5.3, but with a different feature set. In all of the experiaments the two previously tagged IOB tags were in-cluded in the feature set. In the first experiea-ment (denoted WP) we considered the word and POS tags of the context tokens to be part of the feature set. In the other experiements, we used different sub-sets of the morphological features of the tokens to enhance the features set. We found that good results were achieved by using the Number and Construct features together with the word and POS tags (we’ll denote this WPNC). Particulary bad results were achieved by using all the mor-phological features together.

6.5 Results

We’ll discuss the results of the WP and WPNC experiaments in details, and also provide the re-sults for the WPG (using the Gender feature), and ALL (using all available morphological fea-tures) experiaments, and P (using only the POS tag).

Features Acc Prec Rec FP 91.77 77.03 78.79 77.88WP 97.49 92.54 92.35 92.44WPNC 97.61 92.99 93.41 93.20WPG 97.41 92.41 92.22 92.32ALL 96.68 90.21 90.60 90.40

Table 4 - SVM results for Hebrew

Features Acc Prec Rec FWPNC 0.128 0.456 1.058 0.758

Table 5 - Improvemt over WP

As can be seen in Table 4, the lexical informa-tion is very important: augmenting the POS tag

Page 7: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

with lexical information boosted the F-measure from 77.88% to 92.44% (talk about the success of non-lexicalized for eng EDP?). The addition of the extra morphological features of Construct and Number yields another increase in perfoe-mance, resulting in an final F-measure of 93.2%. Note that the effect of these morphological fea-tures on the overall accuracy (the number of BIO tagged correctly) is minimal (Table 5), yet the ef-fect on the precision and recall is much more _STRONGER? BIGGER? SIGNIFICANT? _____. It is also interesting to note that the Gen-der feature hurts the performance, even though Hebrew has agreement on both Number and Gender. This might indicate that the gender agreement in Hebrew is not as strong, or that the Number agreement is more stable.

6.6 Error Analysis and the Effect of mor-phological features

An error analysis was done on the WPNC results for all the corpus. At the individual token level, Nouns and CCs caused the most confusion, fol-lowed by the Adverbs and Adjectives. Table 6 presents the confusion matrix for all POSs that had with a substential amount of errors. IO means that the correct chunk tag was I, but the system classified it as O.

Table 6 - WPNC Confusion Matrix

(Mention that Numbers errors can be dealt with using a specialized set of regexps?)By examining the errors on the chunks level, we identified 7 common classes of errors:

Conjunction related Errors: bracketing “[a] and [b]” instead of “[a and b]” and vice versa

Split errors: bracketing [a][b] instead of [a b]

Merge errors: bracketing [a b] instead of [a][b]

“Short” errors: bracketing “a [b]” or “[a] b” instead of [a b]

“Long” errors: bracketing “[a b]” in-stead of “[a] b” or “a [b]”

WholeChunk errors: either missing a whole chunk, or bracketing something which doesn’t overlap with a chunk at all

“Missing Token” errors: this is a gen-eralized form of conjunction errors: either “[a] T [b]” instead of “[a T b]” or vice versa, where W is a single token. The most frequent of such words (other then the conjuncts) was the possessive - "של" $el.

The data in table 6 suggets that Adjuncts and Ad-jectives related errors are mostly of the “short” or “long” types, while the Noun (including proper names and pronouns) related errors are of the “split” or “merge” types.

Of the error types, the most frequent was the conjunction related errors, closely followed by split and merge. After that appeared extra Ad-verbs or Adjectives at the end of the chunk, and missing adverbs before or after the chunk.

Conjunction are a major source of errors for Eng-lish chunking as well (Ramshaw and Marucs, 1995, Cardie and Pierce, 1998), and we plan to address them at future work. The split and merge errors are related to argument structure, which can be more complicated in Hebrew than in Eng-lish, because of the possiblity of null equative. The too-long and too-short errors we encoun-tered were mostly attachment related.

Most of the errors are of linguistics phenom-ena that can’t be infered by the localized context the SVM (and most other classification based ap-proaches) use.

What about errors which were caused by er-rors in the Treebank, or by the simple np deriva-tion scheme?

Next, we examine the types of errors the addi-tion of Number and Construct features fixed. Ta-ble 7 summerises this information:

ERROR WP WPNC # Fixed % FixedCONJ 256 251 5 1.95SPLIT 198 225 -27 -13.64MERGE 366 222 144 39.34EXT ADJ AFTER 120 117 3 2.50EXTRA CHUNK 89 88 1 1.12

Page 8: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

EXT ADV AFTER 77 81 -4 -5.19MIS ADV A 67 65 2 2.99MISSING CHUNK 50 54 -4 -8.00MIS ADV B 53 48 5 9.43EXTRA של TOK 47 47 0 0.00

Table 7 - Effect of Number and Construct infor-mation on most frequent error classes

The error classes most affected by the number and construct information were the split and the merge – the WPNC has a tendency of splitting chunks, which resulted in some unjustified splits, but compensates that by fixing over third of the merging mistakes. This result makes sense – con-struct and local agreement information can aid in the identification of precidacte bounderies.

7 Discussion and Future work

We’ve noted that due syntactic features such as Smixut, the traditional definition of baseNP chunks does not translate well to Hebrew and probably to other semitic languages with the same syntactic features, and defined the notion of SimpleNP chunks instead. We proposed a method of identifying Hebrew SimpleNPs by su-pervised learning using Support Vector Ma-chines, providng another evidence for the suit-ability of SVM to chunk identification. We’ve also shown that using additional morpho-logical features can enhance chunking accuracy. However, the set of morphological features used should be chosen with care – some features actu-ally hurt performance, and too much features is worse than none. Like in the case of English, a big proportion of the errors were caused by conjunction structures – this problem clearly requires more than local knowledge, and probably out of the reach of SVM. We plan to address this issue in future work.

Other directions of intended future work would be to try to use morphological features for im-proving chunking resulst in other morphologi-cally rich languages, notably Arabic.

For Hebrew, we intend to do full chunking, after properly the other phrase types. Also, we intend to look at Hebrew chunking using Conditional Random Fields, as it may be able to handle less local features.

ReferencesI’M NOT SURE ABOUT THE LOCATION OF THE

YEAR, AND ON THE FORMAT IN GENERAL. PLEASE VALIDATE ME.

Claire Cardie and David Pierce. 1998. Error-Driven Pruning of Treebank Grammars for Base Noun Phrase Identification. In Proceedings of the 17th In-ternational Conference On Computational Linguis-tics, pages 218-224. LOCATION?

Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Intoruction to the CoNLL-2000 Shared Task: Chunking. In Proceedings of CoNLL-2000 and LLL-2000, pages 127-132. Lisbon, Protugal.

Lance A. Ramshaw and Mitchel P. Marcus. 1995. Text chunking using transformation-based learn-ing. In Proceedings of the Third ACL Workshop on Very Large Corpora. Cambridge, MA, USA.

Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In ---------

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. 2000. Reducing multiclass to binary: A unifying approach for margin classifiers. In Jour-nal of Machine Learning Research, 1:113-141.

Thorsten Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of ECML-98, 10th European Conference on Machine Learning.

Taku Kudo and Yuji Matsumato. 2000. Use of sup-port vector learning for chunk identification. In Proceedings of the 4th Conf. on Very Large Cor-pora, pages 142-144.

Taku Kudo and Yuji Matsumato. 2003. Fast Methods for Kernel-Based Text Analysis. In Proceedings of the 41st Annual Meeting of ACL, pages ???-???.

Vladimir Vapnick. 1995. The Nature of Statistical Learning Theory. Springer Verlag, New York, USA.

Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. Technical Report CIS TR MS-CIS-02-35, University of Pennsylvania,

Steven P. Abney. 1991. Parsing by Chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny ed-itors, Principle Based Parcing. Kluwer Academic Publishers.

Tong Zhang, Fred Damerau and David Johnson. 2002. Text Chunking based on a Generalization of Winnow. In Journal of Machine Learning Re-search, volume 2 (March), 2002, pp. 615-637

Meni’s Ref with 3morelines

Page 9: Instructions for COLING/ACL 2006 Proceedingsyoavg/fs/chunk/colacl06sub3.doc · Web viewWe also suggest the usage of morphological features other than word and POS for augmenting the

POS All OI OB IO IB BO BINOUN 602 7 27 0 342 4 222CONJ 405 146 5 232 1 21 0ADVERB 306 87 32 104 10 68 5DEF_ART 247 18 10 9 104 5 101ADJECTIVE 215 140 0 58 0 16 1PROPNAME 168 4 8 0 82 0 74NUMBER 158 14 6 4 64 8 62PREP 152 81 0 62 0 9 0PRONOUN 99 2 27 3 24 12 31