i Chinese Verb Tense? Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification. Master’s Thesis Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science Graduate Program in Computational Linguistics Nianwen Xue, Advisor In Partial Fulfillment of the Requirements for Master’s Degree by Elizabeth Baran February 2013
45
Embed
i Chinese Verb Tense? Using English Parallel Data to Map Tense
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
i
Chinese Verb Tense?
Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.
Master’s Thesis
Presented to
The Faculty of the Graduate School of Arts and Sciences Brandeis University
Department of Computer Science Graduate Program in Computational Linguistics
Nianwen Xue, Advisor
In Partial Fulfillment of the Requirements for
Master’s Degree
by Elizabeth Baran
February 2013
ii
Acknowledgements
I want to thank my advisor, Nianwen Xue, for all of the opportunities and support he has
provided over the past couple years. Thank you for giving me an outlet to grow and further
develop my passion for languages. What I have learned has been invaluable.
I would also like to thank my family for their constant love and support throughout this
process.
iii
ABSTRACT
Chinese Verb Tense?
Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.
A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences
Brandeis University Waltham, Massachusetts
By Elizabeth Baran
We explore time in Chinese by mapping tense information from a manually-aligned English
parallel corpus onto Chinese verbs. We construct a detailed mapping procedure to accurately
convey tense in English through combinations of word tokens and parts-of-speech and then
transfer that information onto verbs in Chinese. We explore the resulting Chinese data set and
discuss the pros and cons of this mapping technique. Using this Chinese data set, augmented
with tense, we attempt to automatically predict the tense of each verb in Chinese using a
Conditional Random Fields algorithm along with a suite of linguistic features. We include an
algorithm for extracting and associating time expressions to verbs and integrate that as a
feature into our tense prediction algorithm. We achieve a 34% accuracy gain over our baseline
as well as a much deeper understanding of how tense can transfer between English and Chinese
in a translation environment.
iv
Table of Contents
TABLE OF CONTENTS Introduction ................................................................................................................................. 1
Related Work ............................................................................................................................... 6
Data .............................................................................................................................................. 8
We framed this task as a simple IOB recognition task and trained a Conditional Random Fields
algorithm using the crfsuite package (Okazaki, 2007), which is a first-order markov model
implementation. If a word began a time expression, it was given the label “B”. If a word was
inside of a time expression but was not the first word, it was given the label “I”. Any word
outside of a time expression was labeled “O”. We used the following features to train our
algorithm:
Features for Timex Extraction
1. WORD: The current word. 2. POS: The POS of the current word. 3. PREV_POS: the part-of-speech of the previous token. 4. NEXT_POS: the part-of-speech of the next token. 5. NORMALIZED: The character string of the word with all digits substituted with a D, so
“2009 年” becomes “DDDD 年”. 6. TimeChar: True if any of the characters in Figure 11 are part of the word. A time
character signals some sort of time or duration when used on its own or as part of
another word. This list was compiled by us, using our own intuitions about the language.
This is essentially a white list that we are incorporating into the algorithm. The
characters are limited to those that are unambiguously related to time, so we expect
that this feature can only help the algorithm, even if the current data set may be too
small to note its significance.
24
Figure 11: Description of TimeChar Characters
Date Character Translation Example Contexts
今 now 如今 “up until now”, 今年 “this year”
明 tomorrow 明天 “tomorrow”
昨 yesterday 昨晚 “last night”
时 time; at that time; while 做功课时 “while doing homework”
候 period 小的时候 “when [I] was little”
纪 century; period 世纪 “century”
钟 hour 两个钟头 “two hours”
天 day 五天后 “after 5 days”
日 day 10 月 10 日 “October 10th”
月 month 下个月 “next month”
年 year 去年 “last year”
早 early 早上 “morning”
晚 late 昨晚 “last night”
期 period 星期日 “Sunday”
We achieved .95 precision, .85 recall, and .89 F1 score, macro-averaged across the three IOB
categories, which is a fair improvement over the 0.94 precision, 0.74 recall, and 0.83 F1
achieved with TIRSemZh (Llorens, Saquete, Navarro, Li, & He, 2011). It is possible that a good
portion of this gain was due to some of the default configurations of the crfsuite classifier, since
our features overlapped for the most part. The TimeChar feature which was unique to our
algorithm, did not increase accuracy significantly, but the data set is too small to deliberate its
usefulness. Either way, it does not explain the extra 3 percentage points gained over TIRsemZh
that had more features and also used semantic roles, so we must assume that the CRF algorithm
that we used was tuned in a more beneficial way for this task.
After we tested this model on the TempEval data, we constructed a final model using all of the
training and testing data combined. With this model, we extracted time expressions in our main
data set and created a parallel time file that would be used for features during the tense
25
prediction stage. An example of this file is shown in Appendix A. Time expressions are denoted
with brackets.
After we extracted time expressions in our main data set, we performed some simple analysis to
understand the nature of these time expressions. Figure 12 is a frequency distribution of time
expressions in the data, with digits normalized (i.e. numerical digits, Arabic and Chinese, are
mapped to “D”). In the entire data set, there were a total of 405 unique time expressions, which
were condensed to 194 normalized time expressions. The distribution is logarithmic, consistent
with Zipf’s Law where the frequency of a word is inversely proportional to its rank – a
phenomenon we see often with frequency distributions in natural language (Zipf, 1932). Not
surprisingly, the most common normalized time expression is “DDDD 年” which is the format for
specifying a year. Following that is “目前”, which means “now” and then “去年” which means
“last year”. An example of one the many hapaxes is “白垩纪” which means “Cretaceous Period”.
Figure 12: Frequency Distribution of Normalized Time Expressions
We will revisit this data later on when we begin to resolve these time expressions to verbs and
interpret their meaning in regards to tense.
DDDD年
目前
去年
白垩纪
0
20
40
60
80
100
120
140
160
180
200
0 50 100 150 200
Fre
qe
un
cy C
ou
nt
26
LINKING TIME EXPRESSIONS TO VERBS
We use a rule-based approach to link time expressions to their verb counterparts. The rule is
based on the following assumption that we have found to often be the case in Chinese:
A time expression has jurisdiction over all verbs that are ancestors to its phrase node and ancestors to its sibling phrase nodes in a syntactic tree, unless obstructed by a CP or IP node. Given this definition we are able to associate time expressions with verbs by traversing the
syntactic tree. To test this method, we used data provided by Zhou et al. (2012) in which time
expressions where manually associated with events (i.e. verbs) using Mechanical Turk. In their
annotation scheme, a maximum of one time expression is associated with each event, whereas
our method for extracting time expressions has no maximum.
The annotated data came from the following 73 Chinese Treebank files. There were 2902 event
Our rule-based approach achieved 64% accuracy if we consider a match to be an exact match
and 68% accuracy when we consider a match to be one in which the gold match is included in
the set of time expressions that the rule-based algorithm extracted for a given event. We
27
consider the 68% to be more representative of the reality since the annotated gold data was
artificially constrained to only one time expression.
Although further improvements could and eventually should be made to this time-verb linking
algorithm, it seems that the next step would require significant more effort and data that does
not fall into the scope of this thesis. Therefore we used this rule-based association method for
our purposes and proceed to make the correct time associations with our main data set.
28
TENSE PREDICTION
We used a Conditional Random Fields algorithm that was part of the crfsuite package (Okazaki,
2007) to predict tense in Chinese. We looked at verbs only and attempted to tag them with
their correct tense. We consider verbs within a single sentence to be the basis for our sequence
modeling.
FEATURES The following features were used to predict tense. These features were borrowed in part from
Xue (2008) . Some of the simpler lexical features were borrowed from feature sets traditionally
used for Chinese POS-tagging (Ng & Low, 2004).
1. Most Frequent Tense
For this feature, we used 50,000 lines of complementary Chinese Treebank parallel data that
was automatically parsed and aligned. We performed our tense and aspect mappings as we did
with our gold data. Then we found the most common tags associated with each verb, excluding
VNOMAP and VOTHER tags, if they existed for that verb. This feature was therefore the string of
the most common tag associated with the verb.
2. Time Expressions
These are the strings of all time expressions associated with the verb as determined by our
algorithm described in LINKING TIME EXPRESSIONS TO VERBS.
29
3. Time Expression Value
We used the PKU dictionary (Wang & Yu, 2003) for this feature, which has a dictionary of time
expressions and a potential “tense” value, which can be 过 (past), 未(future), or 否(none). If any
of the time expressions in the Time Expressions have tense values, these were used.
4. Verb Classes
We also used the PKU dictionary for this feature. If the verb is placed into one or more verb
classes, we use the numbers associated with all classes.
5. Position in Verb Compound
If the verb is part of a verb compound (VSB, VCD, VRD, VCP, VNV, VPT), its position in the
compound, either first or last.
6. Quotes
If the verb is in quotes, then this feature returns True.
7. Verb
The verb string.
8. Previous Word
The previous word token.
9. Verb POS
The POS of the verb based on the automatic parse.
10. Next POS
The POS of the next word in the sentence.
11. Previous and Current POS
30
The POS of the previous word plus the POS of the current word.
12. Current and Next POS
The POS of the current word plus the POS of the next word.
13. Next Next POS
The POS of the word following the next word.
14. Previous and Next POS
The POS of the previous word plus the POS of the next word.
15. Post-Verb Aspect Marker
The aspect marker that immediately follows the verb, if one exists.
16. Adverb
All adverbs that modify the verb.
17. Right DER
If the functional character 得 occurs after the verb, then this feature is True. This character is followed by some modifier that signals how or the degree to which a verb is being done.
31
Figure 13 is an example of a tree structure taken from our data with some features highlighted,
namely the current verb, an adverb, and a time expression.
32
Figure 13: Example of features in a syntactic tree
RESULTS We established two baseline measures. The first measure used the same data as the Most
Frequent Tense feature and tagged each verb with the most frequent tense if it had any. Since
most verbs are most frequently tagged with VOTHER or VNOMAP, we excluded these when
there were other options available. This baseline came to .214. The second baseline measure
was simply to take the most frequent tag, which was VOTHER and tag all verbs as such. This was
slightly higher at .219.
33
Using the Conditional Random Fields algorithm provided by crfsuite and 10-fold cross validation
on our data set, we were able to achieve 0.552 accuracy – a 34% gain over our baselines. See
Table 5 for these figures.
Table 5: Results Compared to Baseline
Baseline Final Accuracy
.22 .55
We looked at the removal of each individual feature to see how much they contributed to our
final score.
Table 6: Feature Significance
Feature #
Accuracy Difference from
best (-)
All 0.552
7 0.514 -0.038
16 0.524 -0.028
13 0.534 -0.018
4 0.536 -0.016
8 0.538 -0.013
14 0.540 -0.012
1 0.545 -0.007
2 0.545 -0.007
11 0.545 -0.007
5 0.546 -0.005
9 0.548 -0.004
12 0.548 -0.004
3 0.549 -0.003
6 0.549 -0.003
10 0.549 -0.003
15 0.549 -0.003
17 0.552 0.000
The top 5 most important features were the verb itself, the adverbs, the POS of the word
following the next word, the verb classes, and the previous word. Common adverbs like “已经”,
34
meaning “already”, and “将”, meaning “in the future” encompass important temporal cues that
strictly confine the options for tense on the modified verb so it makes sense that this is an
important feature. Our time expression features were not as significant as we expected however
we believe this only proves that we have not yet found a way to capture the relevant
information that they provide. The Most Frequent Tense feature was less significant than we
would have thought, which we consider a better scenario since we would rather our algorithm
not rely on pre-compiled static information.
In terms of precision and recall, the results for each tag are displayed in Table 7.
Table 7: Precision, Recall, and F1 Scores for Each Tag