i Chinese Verb Tense? Using English Parallel Data to Map Tense

i

Chinese Verb Tense?

Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.

Master’s Thesis

Presented to

The Faculty of the Graduate School of Arts and Sciences Brandeis University

Department of Computer Science Graduate Program in Computational Linguistics

Nianwen Xue, Advisor

In Partial Fulfillment of the Requirements for

Master’s Degree

by Elizabeth Baran

February 2013

ii

Acknowledgements

I want to thank my advisor, Nianwen Xue, for all of the opportunities and support he has

provided over the past couple years. Thank you for giving me an outlet to grow and further

develop my passion for languages. What I have learned has been invaluable.

I would also like to thank my family for their constant love and support throughout this

process.

iii

ABSTRACT

Chinese Verb Tense?

Using English Parallel Data to Map Tense onto Chinese and Subsequent Tense Classification.

A thesis presented to the Department of Computer Science Graduate School of Arts and Sciences

Brandeis University Waltham, Massachusetts

By Elizabeth Baran

We explore time in Chinese by mapping tense information from a manually-aligned English

parallel corpus onto Chinese verbs. We construct a detailed mapping procedure to accurately

convey tense in English through combinations of word tokens and parts-of-speech and then

transfer that information onto verbs in Chinese. We explore the resulting Chinese data set and

discuss the pros and cons of this mapping technique. Using this Chinese data set, augmented

with tense, we attempt to automatically predict the tense of each verb in Chinese using a

Conditional Random Fields algorithm along with a suite of linguistic features. We include an

algorithm for extracting and associating time expressions to verbs and integrate that as a

feature into our tense prediction algorithm. We achieve a 34% accuracy gain over our baseline

as well as a much deeper understanding of how tense can transfer between English and Chinese

in a translation environment.

iv

Table of Contents

TABLE OF CONTENTS Introduction ................................................................................................................................. 1

Related Work ............................................................................................................................... 6

Data .............................................................................................................................................. 8

Tense Map Procedure .................................................................................................................. 9

Mapping Procedure ................................................................................................................... 15

The Problem with Translation .................................................................................................... 18

Comparison to Automatically Aligned Data ............................................................................... 20

Time Expressions........................................................................................................................ 22

Time Expression Recognition ................................................................................................. 22

Linking Time Expressions to Verbs ......................................................................................... 26

Tense Prediction ........................................................................................................................ 28

Features ................................................................................................................................. 28

Results .................................................................................................................................... 32

Conclusion and Future Work ..................................................................................................... 35

Bibliography ................................................................................................................................... 37

v

List of Tables

Table 1: Top 30 English Tags Aligned to Chinese Verbs ................................................................. 10

Table 2: English Tense Detailed Mapping Rules ............................................................................ 16

Table 3: Tag Distributions on Manually Aligned Data versus Automatically Aligned Data ........... 21

Table 4: Time Expression Link Data Files ....................................................................................... 26

Table 5: Results Compared to Baseline ......................................................................................... 33

Table 6: Feature Significance ......................................................................................................... 33

Table 7: Precision, Recall, and F1 Scores for Each Tag................................................................... 34

vi

List of Illustrations/Figures

Figure 1 ............................................................................................................................................ 1

Figure 2 ............................................................................................................................................ 2

Figure 3 ............................................................................................................................................ 2

Figure 4: Main Data Files ................................................................................................................. 8

Figure 5: English POS Tags that Align to Chinese VA ..................................................................... 11

Figure 6: English POS Tags that Align to Chinese VC ...................................................................... 12

Figure 7: English POS Tags that Align to Chinese VE ...................................................................... 13

Figure 8: English POS Tags that Align to Chinese VV ..................................................................... 13

Figure 9: An Example Parallel Sentence with Verbs Highlighted ................................................... 18

Figure 10: Data Split ....................................................................................................................... 22

Figure 11: Description of TimeChar Characters ............................................................................. 24

Figure 12: Frequency Distribution of Normalized Time Expressions ............................................. 25

Figure 13: Example of features in a syntactic tree ........................................................................ 32

1

INTRODUCTION

Understanding tense in a language that has none, is challenging and to some may seem futile.

However the attractiveness of a unifying grammatical theory makes it worth exploring from

both a linguistic and computational standpoint. Chinese is a tense-less language, meaning that

verbs in Chinese do not inflect for temporal changes in context. There are many language

counterparts that do inflect for tense, usually with inflections on the verb, and one of those is

English. The examples below demonstrate the contrast between Chinese and English.

Figure 1

我今天早上喝了咖啡。 I drank coffee this morning.

我正在喝咖啡呢。 I am drinking coffee.

我上床前，要再喝一杯咖啡。 I will drink another cup of coffee before bed.

In Figure 1, the verb in English changes form in the three different temporal contexts of past,

present, and future, but the verb in Chinese stays exactly the same. Tense is not solely limited to

past, present, future, even though these may be the most common. Although, Chinese is tense-

less, there are still strong motivations to understand what tense would mean in Chinese, since

time, including past, present, future, is very much conveyed and understood in Chinese.

2

One of these motivations falls within the domain of Machine Translation. When translating from

a language like Chinese to a language like English, tense must be created from the Chinese

representation of tense, which is virtually null, to the English representation which has past,

present, and future. Current MT models can handle this to some extent, but fail in obvious ways

in others. When overt contextual clues exist in Chinese to signal a corresponding tense in English,

models should be able to interpret these accurately. One of the state-of-the-art commercial

translators, Google Translate1, translates the following sentence as follows.

Figure 2

我下个月参加会议 I next M month participate meeting Google: month to attend the meeting.

In this translation, the verb "to attend" was left in its base or infinitive form when "next month"

clearly signals a future time. The logical fallacy in the English shows us this is incorrect, and the

gloss supports this. "to attend" should have been translated into the English future tense and

this should have been clear from the time expression 'next month'.

Figure 3

她们昨天吃饭了。 They yesterday eat AS . Google: They eat yesterday.

Again the verb "to eat" was left in its base form even though the temporal word "yesterday"

coupled with the aspectual marker "le" signal past action.

1 www.translate.google.com

3

Tense is important also in that it is a by-product of time in a language. Chinese may be tense-

less, but it is certainly not void of temporal expression; time is merely represented differently,

particularly through temporal adverbs, aspectual markers, and syntactic constructions. By

studying tense in Chinese we are inevitably studying the way time is represented in Chinese.

This can lead to better techniques for NLP tasks such as summarization and event ordering.

If we accept the assumption that tense is a by-product of time in languages that have it, we

begin to understand how a careful projection of tense onto Chinese can be informative not only

from a cross-linguistic perspective, but also as a tool to understand temporal reference in

Chinese, independently. We will be making these assumptions as we refer to tense in Chinese

throughout this thesis. Also, although we will often be referring exclusively to the verb as a

beacon for tense, it should be understood that temporal reference can apply to whole portions

of text, and the verb is a main portion of but not all of it.

Aspect is another mechanism employed often together with tense. In English, tense and aspect

go hand-in-hand in that they are usually inclinations on the verb. This thesis will consider aspect

inasmuch as it makes sense to do so as we compare verb inclinations on English with their

counterparts in Chinese.

The motivation for this thesis is to explore tense and aspect for applied purposes. There are

many who devote themselves to defining temporal reference in a deeper, more comprehensive

way like for TimeML (Pustejovsky, et al 2004), and ultimately this will be the ideal scenario with

Chinese too, but this is not the goal of this thesis. How we define tense here is not meant as an

end-all classification but more as an exploration into the features of the Chinese temporal

system. The results may indirectly serve as evidence for and/or against different temporal

4

categorization schemes in Chinese but this is not our primary goal. We are concerned more with

how we can specify and use temporal information in Chinese for many NLP tasks, which include

Machine Translation, document summarization, and event-ordering, just to name a few.

Time is represented in a myriad of ways throughout the world’s languages, which is why a

unifying theory between all of them may seem intimidating. If the ultimate goal is to find a

unified semantic representation of time that can easily map to all of the world's languages, the

goal of this thesis is to begin to explore the practical applications of this for Chinese. We ask

ourselves, to what extent can we use parallel information from other languages to uncover

temporal information in Chinese? How inter-operable are temporal systems cross-lingually? And

we do this under the experimental design of attempting to automatically predict tense in

Chinese using information from parallel English data.

We begin our exploration of Chinese tense by initially mapping English tense and aspect

information onto parallel Chinese data and attempting to automatically predict tense in Chinese.

We use a Conditional Random Fields algorithm with a suite of lexical, syntactic, and linguistic

features that we believe will inform the tagger on temporal context, which will in turn serve to

predict the tense that we have prescribed to Chinese.

The rest of the thesis is organized as follows. In RELATED WORK we discuss work that has been

done on Chinese-English tense mapping and/or Chinese tense prediction and how this compares

with our research. In DATA we discuss the type of data we used to conduct the mappings from

English and to automatically predict tense in Chinese. In TENSE MAP PROCEDURE we discuss our

mapping schema for projecting English tense onto Chinese. We look at some of the issues with

5

using parallel data for this task in THE PROBLEM WITH TRANSLATION as well as why we insisted on

using manually aligned data in COMPARISON TO AUTOMATICALLY ALIGNED DATA. In TIME EXPRESSIONS,

we describe the process of extracting time expressions that are used as features in predicting

tense. We look at the rest of the features for our tense prediction algorithm in FEATURES. Finally

we discuss our results and conclude in RESULTS and CONCLUSION AND FUTURE WORK.

6

RELATED WORK Some work has been done for automatic tense prediction in Chinese, and several have

attempted to use Chinese-English parallel data to do so.

Similar to our approach, Liu et al. (2011) use Chinese-English parallel data to expand the data set.

They consider the four basic tenses of present, past, future, and infinitive, focusing on absolute

time as opposed to reference time. Their data consists of POS-tagged and parsed English

sentences, POS-tagged Chinese sentences, and English-Chinese word-alignments. They consider

only verbs in Chinese that have non-conflicting verb mappings in English. If no mapping exists or

several mappings exist but are inconsistent with each other, the verb is no longer considered.

They justify capturing time expressions through local bigram features, but there is no explicit

time expression recognition being performed. They use a suite of basic lexical and syntactic

features to train a Maximum Entropy classifier with a Gaussian prior of 0.1 and max iterations of

100. They perform an iterative bootstrapping algorithm on those results, and are able to achieve

significant improvement over the baseline accuracy of 56.52, which was calculated by assigning

the majority tag for each verb. They point out the issue of error propagation through faulty

word tokenization and part-of-speech tagging. They also point out the need to construct a

deeper feature set.

Xue (2008) looked at tense in Chinese without the help of parallel data. Instead he performed

annotation on 5709 verb instances using a tag set, created in-house, that included past, present,

7

future, future-in-past, and none. He used a Maximum Entropy algorithm with a set of his own

lexical and linguistics features to train and test a tense prediction algorithm. He achieved an F1

score of 67.1 over a baseline of using the most frequent tag, which was 62.4. He did not

consider time expressions, however he did have a feature that looked at NTs, a type of temporal

noun in Chinese.

Ye et al. (2006), use Chinese-English parallel data with manual alignments to obtain verb tenses.

They discount Chinese verbs that are not aligned to English verbs in their data set. Beyond that,

they do not detail their mapping procedure but have three different tags: past, present, and

future. They use telicity, punctuality, and temporal ordering features to increase accuracy for

tense prediction as well as the typical lexical and syntactic features. They achieve F1 scores

of .627, .896, and .572 for present, past, and future respectively.

We can see that many choose to exploit the availability of Chinese-English parallel data and the

fact that English has tense information to explore tense in Chinese. However there is little

discussion on how this tense information is actually mapped over to the Chinese side. Those

who are familiar with the Penn Treebank data set (Marcus, Santorini, & Marcinkiewicz, 1993)

know that there are no POS tags that directly refer to tense. The tags instead reflect the surface

form of the English verb, which in turn generally corresponds to a certain tense. This mapping,

however is not one-to-one, and there needs to be some rule set that can interpret these tag

combinations to logically transform them into what we understand as tense (e.g. past, present,

future) and then tense and aspect combinations (e.g. past progressive, present perfect). We

hope to provide a more detailed and transparent account of our mapping process and show

how this may influence the validity of the data.

8

DATA Our primary data set was a relatively small number of Chinese-English parallel data files from

the Chinese Treebank. The data set is small because we opted for manual alignments over

automatic alignments to decrease error propagation during this early stage. This was the

manually aligned data that was available to us. We will be using the parallel English POS

information to essentially create a gold data set for Chinese, so we wanted this to be as accurate

as possible at least in this first attempt. The data files are listed in Figure 4 below.

Figure 4: Main Data Files

Chinese Treebank Files

8, 11-14, 17-20, 23-24, 26, 28, 30-33, 35-37, 43-44, 46-49, 51, 53-64, 66, 68, 71, 73-74, 76, 79, 81-84, 86-87, 89, 91, 93-95, 97-98, 101-104, 107-109, 111, 113, 115-116, 123, 126, 130-132, 134-138, 142-143, 146-150, 153-156, 159-169, 208-215, 217-218, 221-223, 229-230, 232-234, 236-242, 245-246, 249-251, 255-256, 258-259, 261, 263, 265, 267, 268-269, 301, 304, 306, 311-314, 316-318, 320, 323

9

TENSE MAP PROCEDURE The motivation for using English parallel data to formulate tense in Chinese is first and foremost

due to the lack of tense data in Chinese. Therefore, our intent is to only use the English data

during this mapping stage to create a tense-labeled gold standard for Chinese. The English data

will not be used later for feature selection during the automatic tense prediction stage.

To create this gold standard, we first examine the English POS tag alignments to Chinese verbs.

As mentioned earlier, there is no English tense POS tag, so we must use a set of manually

constructed rules to transform the tags that do exist in English into what we understand as

English tense and aspect.

In the chart below, we show the top 30 English tag alignments to Chinese verbs, as well as a

word token example that correspond to these parts-of-speech. Note that often more than one

English word is aligned to a single Chinese word. In those cases, tags are joined with hyphens.

10

Table 1: Top 30 English Tags Aligned to Chinese Verbs

English Tag Example Frequency

VBD reached 806

VBG increasing 587

JJ good 476

VB promote 449

TO-VB to be 417

VBZ is 363

NN development 345

VBN approved 308

VBZ-VBN has been 227

IN from 209

MD should 152

VBP are 146

DT-NN the establishment

123

VBP-VBN have been 122

VBD-IN served as 116

VBD-VBN had chosen 112

RB friendly 94

IN-VBG in implementing

86

NNS holds 76

MD-VB will allow 74

VB-VBN have become 70

IN-NN at maturation 56

VBG-IN participating in 56

VBD-RP asked about 43

VBZ-IN accounts for 43

IN-DT-NN in the implementation

39

VBP-IN belong to 33

TO-VB-VBN to be built 32

VBZ-VBG is diverging 32

JJR cheaper 24

From this table we can justify how often unintuitive part-of-speech alignments may actually

make sense when we see the corresponding word tokens. Grammatical categories in Chinese

are much fuzzier than they are in English. A word that is a verb in one context can be a noun in

another and vice versa. Same goes for adjectives, adverbs, etc. Context plays an important role

11

in distinguishing parts-of-speech in Chinese, moreso than English, because of a relative lack of

morphology in Chinese.

When we create our mapping rule set, we must consider how all of these tag combinations will

translate into tense. A simple example is the most commonly aligned tag “VBD” which would

translate to simple past. VBD-VBN (i.e. the past form and the past participle) might be past

perfect if the auxiliary verb is “to have” as in the example above, e.g. “had chosen”, or just

simple past in the passive voice if the auxiliary verb is “to be”, e.g. “were held”. At this point we

can see that simple part-of-speech mappings alone will be insufficient in excavating tense

information from the data. Instead, we need to consider a combination of features including

part-of-speech tags but also a selection of word tokens that are relevant to tense.

To explore these alignments further, we looked at which individual English tags aligned to each

of the four verb types in Chinese, predicative adjectives (VA), verb copulas (VC), existential verbs

(VE), and regular verbs (VV). The results are displayed in the figures below.

Figure 5: English POS Tags that Align to Chinese VA

0

50

100

150

200

250

300

350

400

12

A VA in Chinese represents a predicative adjective, and this adjective quality is particularly

apparent with the alignments we see in our data. VAs rarely translate to verbs in English so

these results were expected. They are also significantly different from the other three verb tags

both intuitively and as evidenced by our data. This is the most important factor in our

consideration of what a verb is in Chinese, and what compels our decision to eventually leave

the VA tag out of the verb category when we conduct tense mappings. There are 557 VA tags

represented in the graph above and a total of 651 in the entire document. The discrepancy is

due to an unspecified alignment that could either be deliberate or by error.

Figure 6: English POS Tags that Align to Chinese VC

There are 374 VC tags in the data, in which 336 have alignments and are represented in the

graph above. A VC is a verb copula, most notably 是 in Chinese which can translate to “is”, “was”,

“be”, etc. in English.

0

20

40

60

80

100

120

140

160

VBZ VBD VBP IN VB VBN MD CD NNP TO JJ NNS VBG NN POS DT

13

Figure 7: English POS Tags that Align to Chinese VE

A VE is an existential verb, usually the verb 有, which translates to “to have” and can also serve

the same purpose as “there is” or “there are” in English.

Figure 8: English POS Tags that Align to Chinese VV

VV is all other verbs and is therefore the largest category of verbs.

We were surprised to see a number of verbs with no alignment to an English word in Chinese,

especially given that our data was manually aligned. Since verbs carry significant meaning we

0

5

10

15

20

25

30

35

40

VBP EX VBZ VBD VBN VB JJ IN RB DT NN MD VBG CD TO WDT

0

200

400

600

800

1000

1200

1400

VB

VB

N

VB

D IN

VB

G

NN

VB

Z

TO

VB

P

DT

MD JJ

NN

S

RP

RB

JJR

NN

P

CD

-NO

NE

- , .

PR

P$

CC

PR

P

SYM

PO

S

WD

T

EX

RB

R

WP

JJS ''

FW

HY

PH

WR

B

14

were skeptical that these alignments were actually correct. We explore this issue further in

subsequent sections.

15

MAPPING PROCEDURE We map tense information onto Chinese by using a manually constructed rule set where the

input is a combination of aligned POS tags and word tokens from the English parallel corpus. Of

all the verbs that had alignments, there were 471 unique tag combinations. When we take out

VAs from the mix of verbs, we still have 463 unique tag combinations. This number is

considerably high for the number of available tags and the types of intuitively legal

combinations that can be made.

Our mapping rules for Chinese tense is based directly on what we understand about tense and

aspect in English. The theoretical foundations for this approach can certainly be argued, but we

justify that this is at the least one of the perspectives to consider when regarding tense in an

otherwise tense-less language. Given our understanding of the types of verb tags that exist in

English coupled with our understanding of auxiliary verbs in forming tense, we create the

following transformation rules.

16

Table 2: English Tense Detailed Mapping Rules

Tense Aspect English example Input (from English) Output Frequency

past simple danced VBD|VBN

VPAST 1487

1510

was/were cleaned "was|were"-VBN|VBD

there was/were VEX-"was|were"

perfect had danced "had"-VBN|VBD VPAST1 20

had been cleaned "had"-"been"-VBN|VBD

progressive was/were dancing "was|were"-VBG

VPAST2 3 was/were being cleaned

"was|were"-"being"-VBN|VBD

perfect progressive

had been dancing "had"-"been"-VBG VPAST3 0

present simple dances/dance VBZ|VBP

VPRES 726

1201

am/is/are watched "am|is|are"-VBD|VBN

there is/are VEX="is|are"

perfect has/have danced "has|have"-VBD|VBN

VPRES1 412 has/have been cleaned

"has|have"-"been"-VBD|VBN

progressive is/are dancing "am|is|are"-VBG

VPRES2 51 is/are being cleaned

"am|is|are"-"being"-VBN|VBD

perfect progressive

has/have been dancing

"has"-"been"-VBG VPRES3 12

future simple will dance MD="will"-VB

VFUTR 258

260

will be cleaned MD="will"-"be"-VBN|VBD

am/is/are going to dance

"am|is|are"-"going"-TO-VB

am/is/are going to be watched

"am|is|are"-"going"-TO-"be"-VBN|VBD

there will be VEX-MD="will"-be

perfect will have danced MD="will"-"have"-VBN|VBD

VFUTR1 0 will have been watched

MD="will"-"have"-"been"-VBN|VBD

are going to have danced

"am|is|are"-"going"-TO-"have"-VBN|VBD

progressive will be dancing MD="will"-"be"-VBG VFUTR2 2

perfect progressive

will have been dancing

MD="will"-"have"-"been"-VBG VFUTR3 0

infinitive to dance TO-VB VINF 981 981

dance VB

gerund walking ^VBG VBG 722 722

other the improvement any combination not taken care of above

VOTHER 1689 1689

no map no alignment to English VNOMAP 747 747

predicative adjective

Chinese tag VA VA 610 610

ALL

7720 7720

17

We enumerate 12 different tenses for English, as well as the infinitive form. In order to classify

all words that are originally considered a verb in Chinese, we use the tags that come at the end

of the chart. This includes the gerund (VBG) – which is ambiguous in terms of tense when it

occurs on its own, all mappings that are not taken care of in our tense rule set (e.g. verbs that

map to nouns or prepositions), and the predicative adjective (VA) which is considered as a verb

in Chinese but which we factored out of the mapping process since it tends to display adjective

and adverb qualities than verb qualities (see Figure 5). There are a total of 7720 verbs given

these tags.

There are several other things to note about this mapping. First off, the tags VBD and VBN for

past tense form and past participle, respectively, are often used interchangeably in our

mappings. This is mostly to reduce human annotation errors that may result from confusion

between the past tense and past participle form of a verb, which are the same for most regular

verbs in English. Also to reduce error, if a specific word token is a necessary part of the tag

pattern (e.g. “has” or “been”), we attempt to match only the word token and ignore the POS tag.

That way, in the case where the tag happens to be incorrect, we don’t miss out on catching the

right token. We also accounted for passive voice. So “he ate” is simple past in active voice, and

“it was eaten” is also simple past but in the passive voice. For future tense, we consider not only

the use of the modal “will” which is the traditional future tense indicator, but also the “is going

to” structure. You’ll note however that some of these combinations are so rarely, if ever said, in

English and were deliberately left out, e.g. “the move is going to have been being watched at

that time.”

18

THE PROBLEM WITH TRANSLATION

The problem with using parallel data for this task is the same problem we see when we try to

project any language onto another, which is that we tend to overlook the characteristics that

are unique to the language we are studying. Chinese and English are very different syntactically,

morphologically, phonetically, orthographically, and also historically and culturally. When we

use parallel English data, we are using a once-removed translation and we must be conscience

of the fact that it does not mean exactly the same thing as our source language. Furthermore,

translations can vary immensely between different translators and all could be considered

correct depending on the context. For our purposes, the best kind of translation would have

been one that is more literal and most likely less fluent in English. Unfortunately, this was not

exactly the type of data we had. Although the Chinese Treebank translations are more literal

than most published translations, there are many subtle inaccuracies that may only be

perceptible to someone who is doing a task like the one we are performing here. Below we

show an example of how translation can alter source meaning and how this may have caused

holes or inaccuracies in some of our mappings.

Figure 9: An Example Parallel Sentence with Verbs Highlighted

由浙江医科院院长、中国科学院士毛江森主持 1 在世界上率先研究 2 成功 3 ，并具

有 4 国际先进水平的甲肝减毒活疫苗，去年经卫生部批准 5 正式投入 6 生产和使用，

目前该区生产 7 此疫苗的普康公司已形成 8 年产 9 五百万人份的生产规模，这对

有效 10 地控制 11 甲肝流行具有重大意义。

The internationally advanced hepatitis " A " active toxin reducing vaccine successfully researched and produced for

the first time in the world and headed by Jiangsen Mao , President of Zhejiang Medical Institute and Chinese

Academy of Science member , was put into production and usage after official approval by the Ministry of Public

Health last year . At present , the Pu Kang Company , which produces the vaccine in this zone , has already formed a

production scale of 5 million doses per year , which has great significance in effectively controlling the hepatitis A

epidemic .

The interesting verbs in this sentence are verbs 3, 4, 5, and 9.

19

Verb 3 is a predicative adjective (VA) in Chinese, which we left out of our mapping precisely for

the reason demonstrated here; VAs tend to translate into adjectives and other modifiers, but

rarely into verbs. Here it modifies “researched” as an adverb. This is the same scenario as for

verb 10 which modifies the verb “controlling” as an adverb.

Verb 4 is not even represented in the English translation. It means “to have” or “to possess”,

and corresponds to the first segment of the passage translated as “the internationally advanced

hepatitis ‘A’ active toxin reducing vaccine”. A more literal translation, which would have better

suited our purposes, would have been something like “the active toxin-reducing hepatitis A

vaccine that possesses an internationally advanced level”. We can see how this translation is

awkward but encapsulates the verb “to possess” which would have been more accurately

tagged as VPRES.

Verb 5 also demonstrates a translation choice that resulted in a poor tense mapping. Verb 5

means “to approve”. The translation nominalizes the verb and therefore loses tense information.

A more literal and accurate translation would have been something like “last year, [the vaccine]

was approved by the Ministry of Public Health to formally enter production and usage”.

Nominalization was entirely avoidable for this verb, which may not always be the case, and

because of this translation choice, we lost valuable tense information.

Finally, verb 9 is not represented in the translation. It literally means “to produce”, but is used

somewhat idiosyncratically before the number amount of production for that year. Here this

verb is preceded by the word meaning “year” so together you could translate this as “[the

company], yearly, produces…”. The given translation seems to capture this with “per year” but

20

unfortunately doesn’t make the necessary alignment. In any case, it is not a verb alignment so it

would not provide any tense information. This is a case again where a more literal translation

would have been more helpful.

This sentence was chosen at random to demonstrate the types of problems we encounter even

with manual alignments. This mapping procedure is by no means perfect as we can see by the

number of VNOMAP and VOTHER tags that are present in the data, but when mappings exist

they tend to be more accurate than not.

COMPARISON TO AUTOMATICALLY ALIGNED DATA To further explore our decision to use manual alignments over automatic alignments to create

our gold data set, we looked at 50,000 lines of automatically aligned Chinese Treebank data and

performed the same mapping procedure. We found that our alignment technique drastically

degrades when we use it on automatically aligned data. The number of VNOMAP and VOTHER

tags that show up is overwhelmingly more. Table 3 shows the distribution.

21

Table 3: Tag Distributions on Manually Aligned Data versus Automatically Aligned Data

Tag Manual Alignment

Distribution Automatic Alignment

Distribution

VPAST 19.26% 0.35%

VPAST1 0.26% 0.08%

VPAST2 0.04% 0.41%

VPAST3 0.00% 0.00%

VPRES 9.40% 1.48%

VPRES1 5.34% 1.83%

VPRES2 0.66% 0.02%

VPRES3 0.16% 0.00%

VFUTR 3.34% 0.40%

VFUTR1 0.00% 0.00%

VFUTR2 0.03% 0.07%

VFUTR3 0.00% 0.00%

VINF 12.71% 0.08%

VBG 9.35% 0.25%

VOTHER 21.88% 77.79%

VNOMAP 9.68% 10.93%

VA 7.90% 6.31% A color scale is used to mimic the distribution. The darker the color green, the higher the

distribution. As you can see there is much more green in the manual alignment column. The only

green in the automatic alignment column is the VOTHER, VNOMAP, and VA columns, which

shows how much noisier the data is for this task. Furthermore, the actual amount of VOTHER

tags jumped drastically from 22% to 78%. In other words, the actual number of verbs that have

tense in the manually aligned data account for 38.5% of all verbs, and only account for 4.6% of

all verbs in the automatically aligned data. Assuming that these alignments are even accurate,

this gives us very little to work with.

22

TIME EXPRESSIONS

Time expressions, which include temporal adverbs and phrases, are an important part of

interpreting the temporal location of verbs in Chinese. In order to make use of them we need to

first identify them and then link them to the correct verbal context.

TIME EXPRESSION RECOGNITION For the recognition step, we followed the approach similar to the TIRSemZh method described

in Llorens et al. (2011). The primary difference was that we did not use semantic roles, and were

able to achieve slightly better accuracy.

Our data was taken from the Temp-Eval 2 Task2. The training and testing split are shown in

Figure 10 below.

Figure 10: Data Split

Chinese Treebank Training Files Chinese Treebank Testing Files

chtb_0031, chtb_0032, chtb_0033, chtb_0038, chtb_0040, chtb_0043, chtb_0049, chtb_0053, chtb_0059, chtb_0067, chtb_0071, chtb_0072, chtb_0073, chtb_0077, chtb_0080, chtb_0087, chtb_0088, chtb_0097, chtb_0112, chtb_0118, chtb_0128, chtb_0129, chtb_0130, chtb_0139, chtb_0143, chtb_0144, chtb_0147, chtb_0249, chtb_0251, chtb_0252, chtb_0259, chtb_0279, chtb_0291, chtb_0309

chtb_0544, chtb_0590, chtb_0592, chtb_0593, chtb_0594, chtb_0595, chtb_0596, chtb_0600, chtb_0604, chtb_0605, chtb_0615, chtb_0616, chtb_0618, chtb_0621, chtb_0628

2http://semeval2.fbk.eu/semeval2.php?location=data

23

We framed this task as a simple IOB recognition task and trained a Conditional Random Fields

algorithm using the crfsuite package (Okazaki, 2007), which is a first-order markov model

implementation. If a word began a time expression, it was given the label “B”. If a word was

inside of a time expression but was not the first word, it was given the label “I”. Any word

outside of a time expression was labeled “O”. We used the following features to train our

algorithm:

Features for Timex Extraction

1. WORD: The current word. 2. POS: The POS of the current word. 3. PREV_POS: the part-of-speech of the previous token. 4. NEXT_POS: the part-of-speech of the next token. 5. NORMALIZED: The character string of the word with all digits substituted with a D, so

“2009 年” becomes “DDDD 年”. 6. TimeChar: True if any of the characters in Figure 11 are part of the word. A time

character signals some sort of time or duration when used on its own or as part of

another word. This list was compiled by us, using our own intuitions about the language.

This is essentially a white list that we are incorporating into the algorithm. The

characters are limited to those that are unambiguously related to time, so we expect

that this feature can only help the algorithm, even if the current data set may be too

small to note its significance.

24

Figure 11: Description of TimeChar Characters

Date Character Translation Example Contexts

今 now 如今 “up until now”, 今年 “this year”

明 tomorrow 明天 “tomorrow”

昨 yesterday 昨晚 “last night”

时 time; at that time; while 做功课时 “while doing homework”

候 period 小的时候 “when [I] was little”

纪 century; period 世纪 “century”

钟 hour 两个钟头 “two hours”

天 day 五天后 “after 5 days”

日 day 10 月 10 日 “October 10th”

月 month 下个月 “next month”

年 year 去年 “last year”

早 early 早上 “morning”

晚 late 昨晚 “last night”

期 period 星期日 “Sunday”

We achieved .95 precision, .85 recall, and .89 F1 score, macro-averaged across the three IOB

categories, which is a fair improvement over the 0.94 precision, 0.74 recall, and 0.83 F1

achieved with TIRSemZh (Llorens, Saquete, Navarro, Li, & He, 2011). It is possible that a good

portion of this gain was due to some of the default configurations of the crfsuite classifier, since

our features overlapped for the most part. The TimeChar feature which was unique to our

algorithm, did not increase accuracy significantly, but the data set is too small to deliberate its

usefulness. Either way, it does not explain the extra 3 percentage points gained over TIRsemZh

that had more features and also used semantic roles, so we must assume that the CRF algorithm

that we used was tuned in a more beneficial way for this task.

After we tested this model on the TempEval data, we constructed a final model using all of the

training and testing data combined. With this model, we extracted time expressions in our main

data set and created a parallel time file that would be used for features during the tense

25

prediction stage. An example of this file is shown in Appendix A. Time expressions are denoted

with brackets.

After we extracted time expressions in our main data set, we performed some simple analysis to

understand the nature of these time expressions. Figure 12 is a frequency distribution of time

expressions in the data, with digits normalized (i.e. numerical digits, Arabic and Chinese, are

mapped to “D”). In the entire data set, there were a total of 405 unique time expressions, which

were condensed to 194 normalized time expressions. The distribution is logarithmic, consistent

with Zipf’s Law where the frequency of a word is inversely proportional to its rank – a

phenomenon we see often with frequency distributions in natural language (Zipf, 1932). Not

surprisingly, the most common normalized time expression is “DDDD 年” which is the format for

specifying a year. Following that is “目前”, which means “now” and then “去年” which means

“last year”. An example of one the many hapaxes is “白垩纪” which means “Cretaceous Period”.

Figure 12: Frequency Distribution of Normalized Time Expressions

We will revisit this data later on when we begin to resolve these time expressions to verbs and

interpret their meaning in regards to tense.

DDDD年

目前

去年

白垩纪

0

20

40

60

80

100

120

140

160

180

200

0 50 100 150 200

Fre

qe

un

cy C

ou

nt

26

LINKING TIME EXPRESSIONS TO VERBS

We use a rule-based approach to link time expressions to their verb counterparts. The rule is

based on the following assumption that we have found to often be the case in Chinese:

A time expression has jurisdiction over all verbs that are ancestors to its phrase node and ancestors to its sibling phrase nodes in a syntactic tree, unless obstructed by a CP or IP node. Given this definition we are able to associate time expressions with verbs by traversing the

syntactic tree. To test this method, we used data provided by Zhou et al. (2012) in which time

expressions where manually associated with events (i.e. verbs) using Mechanical Turk. In their

annotation scheme, a maximum of one time expression is associated with each event, whereas

our method for extracting time expressions has no maximum.

The annotated data came from the following 73 Chinese Treebank files. There were 2902 event

or verb instances in total.

Table 4: Time Expression Link Data Files

chtb_0031 chtb_0032 chtb_0033 chtb_0038 chtb_0040 chtb_0043 chtb_0049 chtb_0053 chtb_0059 chtb_0067 chtb_0071



chtb_0309 chtb_0310 chtb_0408 chtb_0427 chtb_0441 chtb_0450 chtb_0452 chtb_0453 chtb_0507 chtb_0510




Our rule-based approach achieved 64% accuracy if we consider a match to be an exact match

and 68% accuracy when we consider a match to be one in which the gold match is included in

the set of time expressions that the rule-based algorithm extracted for a given event. We

27

consider the 68% to be more representative of the reality since the annotated gold data was

artificially constrained to only one time expression.

Although further improvements could and eventually should be made to this time-verb linking

algorithm, it seems that the next step would require significant more effort and data that does

not fall into the scope of this thesis. Therefore we used this rule-based association method for

our purposes and proceed to make the correct time associations with our main data set.

28

TENSE PREDICTION

We used a Conditional Random Fields algorithm that was part of the crfsuite package (Okazaki,

2007) to predict tense in Chinese. We looked at verbs only and attempted to tag them with

their correct tense. We consider verbs within a single sentence to be the basis for our sequence

modeling.

FEATURES The following features were used to predict tense. These features were borrowed in part from

Xue (2008) . Some of the simpler lexical features were borrowed from feature sets traditionally

used for Chinese POS-tagging (Ng & Low, 2004).

1. Most Frequent Tense

For this feature, we used 50,000 lines of complementary Chinese Treebank parallel data that

was automatically parsed and aligned. We performed our tense and aspect mappings as we did

with our gold data. Then we found the most common tags associated with each verb, excluding

VNOMAP and VOTHER tags, if they existed for that verb. This feature was therefore the string of

the most common tag associated with the verb.

2. Time Expressions

These are the strings of all time expressions associated with the verb as determined by our

algorithm described in LINKING TIME EXPRESSIONS TO VERBS.

29

3. Time Expression Value

We used the PKU dictionary (Wang & Yu, 2003) for this feature, which has a dictionary of time

expressions and a potential “tense” value, which can be 过 (past), 未(future), or 否(none). If any

of the time expressions in the Time Expressions have tense values, these were used.

4. Verb Classes

We also used the PKU dictionary for this feature. If the verb is placed into one or more verb

classes, we use the numbers associated with all classes.

5. Position in Verb Compound

If the verb is part of a verb compound (VSB, VCD, VRD, VCP, VNV, VPT), its position in the

compound, either first or last.

6. Quotes

If the verb is in quotes, then this feature returns True.

7. Verb

The verb string.

8. Previous Word

The previous word token.

9. Verb POS

The POS of the verb based on the automatic parse.

10. Next POS

The POS of the next word in the sentence.

11. Previous and Current POS

30

The POS of the previous word plus the POS of the current word.

12. Current and Next POS

The POS of the current word plus the POS of the next word.

13. Next Next POS

The POS of the word following the next word.

14. Previous and Next POS

The POS of the previous word plus the POS of the next word.

15. Post-Verb Aspect Marker

The aspect marker that immediately follows the verb, if one exists.

16. Adverb

All adverbs that modify the verb.

17. Right DER

If the functional character 得 occurs after the verb, then this feature is True. This character is followed by some modifier that signals how or the degree to which a verb is being done.

31

Figure 13 is an example of a tree structure taken from our data with some features highlighted,

namely the current verb, an adverb, and a time expression.

32

Figure 13: Example of features in a syntactic tree

RESULTS We established two baseline measures. The first measure used the same data as the Most

Frequent Tense feature and tagged each verb with the most frequent tense if it had any. Since

most verbs are most frequently tagged with VOTHER or VNOMAP, we excluded these when

there were other options available. This baseline came to .214. The second baseline measure

was simply to take the most frequent tag, which was VOTHER and tag all verbs as such. This was

slightly higher at .219.

33

Using the Conditional Random Fields algorithm provided by crfsuite and 10-fold cross validation

on our data set, we were able to achieve 0.552 accuracy – a 34% gain over our baselines. See

Table 5 for these figures.

Table 5: Results Compared to Baseline

Baseline Final Accuracy

.22 .55

We looked at the removal of each individual feature to see how much they contributed to our

final score.

Table 6: Feature Significance

Feature #

Accuracy Difference from

best (-)

All 0.552

7 0.514 -0.038

16 0.524 -0.028

13 0.534 -0.018

4 0.536 -0.016

8 0.538 -0.013

14 0.540 -0.012

1 0.545 -0.007

2 0.545 -0.007

11 0.545 -0.007

5 0.546 -0.005

9 0.548 -0.004

12 0.548 -0.004

3 0.549 -0.003

6 0.549 -0.003

10 0.549 -0.003

15 0.549 -0.003

17 0.552 0.000

The top 5 most important features were the verb itself, the adverbs, the POS of the word

following the next word, the verb classes, and the previous word. Common adverbs like “已经”,

34

meaning “already”, and “将”, meaning “in the future” encompass important temporal cues that

strictly confine the options for tense on the modified verb so it makes sense that this is an

important feature. Our time expression features were not as significant as we expected however

we believe this only proves that we have not yet found a way to capture the relevant

information that they provide. The Most Frequent Tense feature was less significant than we

would have thought, which we consider a better scenario since we would rather our algorithm

not rely on pre-compiled static information.

In terms of precision and recall, the results for each tag are displayed in Table 7.

Table 7: Precision, Recall, and F1 Scores for Each Tag

Tag Precision Recall F1

VA 0.889 0.870 0.879 VPAST 0.595 0.677 0.633 VINF 0.522 0.645 0.577 VOTHER 0.527 0.598 0.560 VNOMAP 0.632 0.493 0.554 VPRES 0.479 0.333 0.393 VFUTR 0.370 0.417 0.392 VBG 0.383 0.349 0.365 VPRES2 0.500 0.250 0.333 VPRES1 0.467 0.200 0.280 VPAST1 0.000 0.000 0.000 VPAST2 0.000 0.000 0.000 VPRES3 - - - VFUTR2 - - -

The VA tag was most accurately predicted followed by VPAST which was most the common type

of tense tag in the data. VPAST1 and VPAST2 occurred less than 5 times in the data so the scores

of 0 do not tell us much.

35

CONCLUSION AND FUTURE WORK Trying to understand tense in a language that has none is a difficult task that we have attempted

to explore in this thesis. There are many questions to consider that may change the direction of

the task considerably. For example, if the motivation is more theoretical, we may consider

defining a temporal annotation scheme that is organic to Chinese and that may or may not

overlap with other languages. If the motivation is for NLP tasks like Machine Translation, it

makes sense to consider other languages, namely the ones we may be translating into, when

considering a tense schema for Chinese. This is more the direction we have taken in this thesis.

We mapped English tense onto Chinese using parallel Chinese-English data. Our work differs

from others in that we explain in detail how this mapping is carried out and the type of data that

results. This understanding is fundamental to creating temporal information schemas and

continuing this type of work going forward.

We integrated time expressions into our feature selection process by constructing an algorithm

for automatically extracting time expressions and associating them with verbs. Although we

achieved considerable results, we found this area to be the one in which the most

improvements can be made. A better understanding of the network of time expressions in the

data and how they associate with verbs is crucial to understanding tense in Chinese since this is

intuitively how Chinese speakers interpret time given that no inclinations on the verb exist. We

have information like “yesterday” and “later”, but these words mean little without a larger

36

temporal context and our algorithm can only extract so much information from strings and

naïve understandings of time expressions (e.g. “yesterday” is in the past). We hope to modify

the direction of this research in the future to focus on time expressions and the networks that

exist between them. Only then can we make more informed decisions about how they may

influence the tense of a verb.

Finally, we used a suite of lexical, syntactic, and other linguistically-informed features to train

and test a Conditional Random Fields algorithm. We achieved considerable improvement over

our baseline. The improvement shows that using English parallel data to understand tense in

Chinese is very worthwhile for certain NLP tasks. In the future we would like to integrate our

results into a MT system to see how translation could be improved. We hope that the types of

features we explored will help correct the types of errors that we saw in Figure 1, Figure 2, and

Figure 3.

37

Bibliography Kudo, T. (2005). CRF++: Yet another crf toolkit. Retrieved from http://crfpp.sourceforge.net.

Li, W., Wong, K.-F., & Yuan, C. (2001). A Model for Processing Temporal References in Chinese.

Workshop on Temporal and Spatial Information Processing - Volume 13 (pp. 5:1-5:8).

Stroudsburg: Association for Computational Linguistics.

Liu, F., Liu, F., & Liu, Y. (2011). Learning from Chinese-English Parallel Data for Chinese Tense

Prediction. Fifth International Joint Conference on Natural Language Processing (pp.

1116-1124). Chiang Mai, Thailand: AFNLP.

Llorens, H., Saquete, E., Navarro, B., Li, L., & He, Z. (2011). Data-Driven Approach Based on

Semantic Roles for Recognizing Temporal Expressions and Events in Chinese. NLDB 2011

(pp. 88-99). Berlin: Springer-Verlag.

Marcus, M. P., Santorini, B., & Marcinkiewicz, M. (1993). Building a Large Annotated Corpus of

English: The Penn Treebank. Computational Linguistics, 313-330.

Ng, H., & Low, J. (2004). Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-

Based or Character-Based? Proceedings of EMNLP.

Okazaki, N. (2007). CRFsuite: a fast implementation of Conditional Random Fields (CRFs).

Retrieved from http://www.chokkan.org/software/crfsuite/

Olsen, M., Traum, D., Van Ess-Dykema, C., Weinberg, A., & Dolan, R. (2000). Telicity as a Cue to

Temporal and Discourse Structure in Chinese-English Machine Translation. College Park,

MD.

Pustejovsky, J., Ingria, R., Sauri, R., Castano, R., Littman, J., Gaizauskas, R., et al. (2004). The

Specification Language TimeML. In The Language of Time: A Reader (pp. 185-196).

Oxford.

Wang, H., & Yu, S. (2003). The semantic knowledge-base of contemporary Chinese and its

applications in WSD. Proceedings of the second SIGHAN workshop on Chinese language

processing - Volume 17 (pp. 112-118). Sapporo, Japan: Association for Computational

Linguistics.

Xue, N. (2008). Automatic inference of the temporal location of situations in Chinese text. 2008

Conference on Empirical Methods in Natural Language Processing (pp. 707-714).

Honolulu: Association for Computational Linguistics.

Ye, Y., & Zhang, Z. (2005). Tense Tagging for Verbs in Cross-Lingual Context: A Case Study.

Second International Joint Conference on Natural Language Processing (pp. 885-895).

Jeju Island, Korea: Springer-Verlag.

38

Ye, Y., Fossum, V. L., & Abney, S. (2006). Latent Features in Automatic Tense Translation

between Chinese and English. Fifth SIGHAN Workshop on Chinese Language Processing

(pp. 48-55). Sidney, Australia: Association for Computational Linguistics.

Zhou, Y., & Xue, N. (2012). Exploring Temporal Vagueness with Mechanical Turk. Proceedings of

the 6th Linguistic Annotation Workshop (pp. 124-128). Jeju, Korea: Association for

Computational Linguistics.

Zhu, X., Yuan, C., Wong, K., & Li, W. (2000). An Algorithm for Situation Classification of Chinese

Verbs. Second Workshop on Chinese Language Processing (pp. 140-145). Stroudsburg:

Association for Computational Linguistics.

Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language. Cambridge,

Mass: Harvard University Press.

39

APPENDIX A: TIME FILES

新华社香港 [二月二十三日] 3电

据台 “ 经济部 ” 统计， [去年]4 两岸贸易额为二百零九亿美元。

其中，台湾对祖国大陆输出值为一百七十八亿美元，比 [上一年]5 增长百分之二十；

输入值为三十一亿美元，比 [上年]6 增长百分之七十四。

台湾在两岸贸易中顺差一百四十七亿美元。

统计还显示，台商投资祖国大陆正趋向大型化。

[去年]7 经台当局核准的台商投资案共四百九十项，金额为十点九二亿美元。

在投资项目上比 [上年] 减少四百四十四件，但投资金额却比 [上年]8 增加一点三亿

多美元。

（完）

新华社北京 [二月二十九日]9 电

国家开发银行 [日前]10 在日本资本市场成功地发行了三百亿日元武士债券。

这是国家开发银行首次在国际资本市场发行债券，由日本野村证券株式会社和

日本兴业银行证券株式会社作为联合主干事，发行期限十年，到期一次偿还。

据了解，这次发行武士债券的条件是 [近几年]11 来比较优惠的，筹集的资金将

主要用于广东岭澳核电工程、伊敏电厂和绥中电厂等国家重点建设项目。

国家开发银行自成立以来，为国家重点建设项目筹集了大批资金。

[一九九五年]12 ，国家开发银行成功地组织了首次五千万美元外国银团贷款，同

时承做了岭澳核电工程、秦山二期核电等项目的国外出口信贷的转贷，从内资

和外资两个方面不断加大对重点建设项目的支持力度，为推动中国经济发展发

挥了积极的作用。

（完）

3 February 23rd

4 Last year

5 Last year

6 Last year

7 Last year

8 Last year

9 February 29th

10 A few days ago

11 In recent years

12 1995

i Chinese Verb Tense? Using English Parallel Data to Map Tense

Documents