Top Banner
Linguistic annotation 2/14/2006 Nianwen Xue
56

Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking Constituent structure and structural ambiguity.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

Linguistic annotation

2/14/2006

Nianwen Xue

Page 2: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

2

Outline

• Tokenization / segmentation, POS tagging• Treebanking

Constituent structure and structural ambiguity Basic grammatical relations and how argument

structure is instantiated

• Propbanking/nombanking Cross-linguistic syntactic alternations, verb senses

and argument structure

• Others: named entity, coreference, discourse connectives

Page 3: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

3

Tokenization

• English In the new position he will oversee Mazda ’s

U.S. sales , services , parts and marketing operations .

We did n’t have much of a choice . U.S. trade officials said the Philippines and

Thailand would be the main beneficiaries of the president ‘s action .

Anything ‘s possible -- how about the new Guinea Fund ?

Page 4: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

4

Tokenization

• English In the new position he will oversee Mazda ’s

U.S. sales , services , parts and marketing operations .

We did n’t have much of a choice . U.S. trade officials said the Philippines and

Thailand would be the main beneficiaries of the president ‘s action .

Anything ‘s possible -- how about the new Guinea Fund ?

Page 5: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

5

Tokenization

• The federal government suspended sales of the U.S. savings bonds because Congress has n’t lifted the ceiling on government debt .

• The Treasury said the U.S. will default on Nov. 9 if Congress does n’t act by then .

Page 6: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

6

Tokenization

• The federal government suspended sales of the U.S. savings bonds because Congress has n’t lifted the ceiling on government debt .

• The Treasury said the U.S. will default on Nov. 9 if Congress does n’t act by then .

Page 7: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

7

Tokenization

• Assets of the 400 taxable funds grew by $ 1.5 billion during the latest week .

• Exports in October stood $ 5.29 billion , a mere 0.7 % increase from a year earlier , while imports increased sharply to $ 5.39 billion , up 20 % from last year .

• Do you notice any ambiguity in tokenization?

Page 8: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

8

Tokenization

• Assets of the 400 taxable funds grew by $ 1.5 billion during the latest week .

• Exports in October stood $ 5.29 billion , a mere 0.7 % increase from a year earlier , while imports increased sharply to $ 5.39 billion , up 20 % from last year .

• Do you notice any ambiguity in tokenization ?

Page 9: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

9

Exercise

• How many sentences in the WSJ corpus of the Penn Treebank contain “’re”?

• How many sentences in the WSJ corpus of the Penn Treebank contain “’d”?

Page 10: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

10

Big deal, you say

• The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words

这句话里有几个词?

Howmanywordsarethereinthissentence?

Page 11: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

11

Big deal, you say

• The problem is pushed to the forefront for languages like Chinese, where there are no delimiting spaces between words

zhe ju hua li you ji ge ci 这 句 话 里 有 几 个 词 ? this CL sentence inside have [how many] CL word ?

How many words are there in this sentence ?

Page 12: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

12

A much harder problem than it first appears…

• Well, what if we just create a list of words (a dictionary) and compare the sentence against this list?

• 日文章鱼怎么说 ?Dictionary entries: 日 “ Sun”, 日文 “ Japanese”, ,文章 “ article”, ,章鱼 “ octopus”, 鱼 “ fish” 怎么 “ how” 说 “ say”

Page 13: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

13

A much harder problem than it first appears…

• Well, what if we just create a list of words (a dictionary) and compare the sentence against this list?

• 日文 章鱼 怎么 说 ?Japanese Octopus how say How do you say octopus in Japanese?

• 日 文章 鱼 怎么 说 ? Sun article fish how say ???

Page 14: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

14

Computer problem vs human problem

• Well that may be a problem for the computer because the computer is dumb…

• Segmentation is difficult for humans as well What is a word? Different criteria do not coincide

Page 15: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

15

What if we let native speakers follow their

intuitions?• Inadequate level of inter-annotator

agreement Sproat, 1996: 70% Xue at al, 2005: 90%

• Conclusion: need a linguistic definition of wordhood to develop segmentation standards

Page 16: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

16

Packard’s (2000) notion of words

• Orthographic word: Words are defined by delimiters in written text. This appears to have no relevance in Chinese since there are no such written delimiters

• Sociological word: Following (Chao, 1968, pp. 136-138), these are ‘that type of unit, intermediate in size between a phoneme and a sentence, which the general, non-linguistic public is conscious of, talks about, has an every day term for, and is practically concerned with in various ways.’ In English this is the lay notion of ‘word’, whereas in Chinese this is the character ( 字 zi).

Page 17: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

17

Packard’s notions of word

• Lexical word: This corresponds to Di Sciullo and Williams’s (1987) listeme

• Semantic word: Roughly speaking this corresponds to a “unitary concept”.

• Phonological word: defined according to phonological criteria. Is it a domain that a phonological process applies? Is it s prosodic unit?

Page 18: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

18

Packard’s notions of word

• Morphological word: following Di Sciullo and Williams (1987), a morphological word is anything that is the output of a phonological rule

• Syntactic word: These are all and only the constructions that occupy X0 in the syntax. Well first you need to know what X0 is.

• Psycholinguistic word: this the “ ‘word’ level of linguistic analysis that is … salient and highly relevant to the operation of the language processor”

Page 19: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

19

Wordhood tests

• Phonological: Bound morpheme: a bound morpheme forms

a word with its neighboring morpheme• Syntactic:

Insertion: if another morpheme can be inserted between X and Y, then it is unlikely a word.

XP-substitution: if a morpheme cannot be replaced with an XP of the same type, then it is likely to be a word

Page 20: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

20

Wordhood tests

• Semantic If the meaning of X-Y is non-compositional, then it

is a word

• Others Productivity: if a rule that combines morpheme X

and morpheme Y is not productive, then X-Y is likely to be a word

Frequency of co-ocurrence: if morphemes X and Y co-occur frequently then they form a word

Page 21: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

21

Exercise

• Given the wordhood criteria and wordhood tests we have discussed, how many words are there in the “can’t” ?

Page 22: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

22

Answer

• Orthographical word: 1• Sociological word: ?• Lexical word: 2• Semantic word: 2• Phonological word: 1• Morphological word: 2• Syntactic word: 2• Psycholinguistic word: ?

Page 23: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

23

Chinese morphological types

• Reduplication• Affixation• Compounding• Proper names• Abbreviations

Page 24: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

24

Verbal reduplication

说说 shuo-shuo speak-speak “speak a little” 看看 kan-kan look-look “take a look” 走走 zou-zou walk-walk “take a walk”磨磨 mo-mo rub-rub “rub a little”讨论讨论 taolun-taolun discuss-discuss “discuss a

little”请教请教 qingjiao-qingjiao ask-ask “ask a little”研究研究 yanjiu-yanjiu research-research“look into”

Page 25: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

25

Verbal reduplication

说一说 shuo-shuo speak one speak “speak a little” 看一看 kan-kan look one look “take a look” 走一走 zou-zou walk one walk “take a walk”磨一磨 mo-mo rub one rub “rub a little”

* 讨论一讨论 taolun-yi-taolun discuss-one-discuss * 请教一请教 qingjiao-yi-qingjiao ask-one-ask * 研究一研究 yanjiu-yi-yanjiu research-one-research

Page 26: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

26

Adjectival reduplication

舒服 shufu 舒舒服服 shushu-fufu “comfortable” 舒服舒服 shufu-shu-fu “enjoy”干净 ganjing 干干净净 gangan-jingjing “very clean” 干净干净 ganjing-ganjing “clean up”糊涂 hutu 糊糊涂涂 huhu-tutu “muddle-headed” (?) 糊涂糊涂 hutu-hutu快活 快快活活 kuaikuai-huohuo “happy” 快活快活 kuaihuo-kuaihuo “make happy”漂亮 漂漂亮亮 piaopiao-liangliang“pretty” 漂亮漂亮 piaoliang-piaoliang“make pretty”

Page 27: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

27

Prefixation

老 lao- 老王 lao-wang “old wang”小 xiao- 小王 xiao-wang “small wang” 第 di- 第一 di yi “first”初 chu- 初三 chu san “the third”可 ke- 可爱 ke-ai “cute” 可靠 ke-kao “reliable”

Page 28: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

28

Suffixation

学 -xue 心理学 xinli-xue “psychology”家 -jia 心理学家 xinli-xue-jia “psychologist”化 -hua 绿化 lv-hua “greenize??” 率 -lv 录取率 luqu-lv “enrollment rate”主义 -zhuyi 马克思主义 makesi-zhuyi“marxism”

Page 29: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

29

Compounding

Location:客厅 沙发 keting-shafa “living room sofa”河 马 hema “river horse (hippopotamus)” 海 狮 haishi “sea lion (seal)”Used for:指甲 油 zhijia you “nail polish”乒乓 球 pingpang qiu “ping-pang ball”太阳眼镜 taiyang yanjing “sunglasses”Material:大理石 地板 talishi diban “marble floor”

纸老 虎 zhilaohu “paper tiger”

Page 30: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

30

Resultative verb compounding

Result:打破 dapo “break by hitting”拉开 lakai “open by pulling”Achievement: 写清楚 xieqingchu “write clearly” 买到 maidao “succeed in buying”Direction:跳过去 tiaoguoqu “jump across”走进来 zoujinlai “come walking in”

Page 31: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

31

Subject-Verb compounds

头疼 tou-teng (head hurt) “have a headache”

嘴硬 zui-ying (mouth hard) “stubborn”眼红 yan-hong (eye red) “covet”心酸 xin-suan (heart sour) “feel sad”命苦 ming-ku (fate bitter) “unlucky”

Page 32: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

32

Subject-Verb compounds

我 的 头 很 疼I DE head very hurt“My head hurts badly.”

这 事 让 我 很 头疼 This matter make I very headache“This gave me a real headache”.

Page 33: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

33

Verb-object compounds

出版 chu-ban (emit edition) “publish”睡觉 shui-jiao (sleep sleep) “sleep”毕业 bi-ye (finish study) “graduate”开刀 kai-dao (operate knife) “operate”开玩笑 kai-wanxiao (make joke) “make a joke”照相 zhao-xiang (shine image) “take a picture”

Page 34: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

34

Verb-object compounds

别 开玩笑 !Do not jokeDo not joke!开 他 的 玩笑。Make he DE jokeMake fun of him.

Page 35: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

35

Let’s try one

她 很 担心 孩子 的 健康 成长Test type Test Test result Prediction

phonologicalBound morpheme Yes? One word

Syllable count yes One word

syntactic insertion no One word

XP substitution no One word

semantic Non-compositional yes One word

othersproductive N/A N/A

frequency N/A N/A

Page 36: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

36

But …

• 担心 : 她 为 孩子 担 心Test type Test Test result Prediction

phonologicalBound morphemes? no Two words

Syllable count yes One word

syntacticInsertion yes Two words

XP substitution yes Two words

Semantic Non-compositional? yes One word

Others Productive? N/A N/A

Frequent co-occurrence? N/A N/A

Page 37: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

37

Summary

• Wordhood has to be decided in context• When wordhood tests lead to conflict

predictions, decisions will have to be made based on what the annotated corpus is for.

Page 38: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

38

Discussion question

• Based on word criteria we have discussed, is “make headway” one word or two words?

Page 39: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

39

POS-tagging: throwing words into different buckets…

• Each category is a bucket• How many buckets are there?

Noun Verb Adjective Preposition Adverb

• Which bucket should“five”, “the”, “$”, should go?

Page 40: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

40

Penn Treebank Tagsets (buckets)

• CC - coordinating conjunction: and, but• CD - cardinal number: one, two, three• DT - determiner: a, the, this, that• EX - existential there• FW - foreign word• IN - preposition or subordinate conjunction• LS - list marker: firstly, secondly• To - to• UH - interjection, uh, oh

Page 41: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

41

CC or DT

• Neither/?? he or/CC she likes skiing.• Neither/?? men like skiing .

• Either/?? Jean or/CC Mary likes singing.• Either/?? Girl likes singing.

• Both/?? Jack and/CC Tom hates singing .• Both/?? men hates singing.

Page 42: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

42

CC or DT

• Neither/CC he or/CC she likes skiing.• Neither/DT men like skiing .

• Either/CC Jean or/CC Mary likes singing.• Either/DT Girl likes singing.

• Both/CC Jack and/CC Tom hates singing .• Both/DT men hates singing.

Page 43: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

43

CD or NN

• One/?? of the best reasons• The only one/?? Of its kind• The only ones/?? of its kind

Page 44: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

44

CD or NN

• One/CD of the best reasons• The only one/NN Of its kind• The only ones/NN of its kind

Page 45: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

45

EX or RB

• There/?? was a party in progress.• There/?? ensued a melee.• There/?? , a party was in progress.• There/?? , ensued a melee.

Page 46: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

46

EX or RB

• There/EX was a party in progress.• There/EX ensued a melee.• There/RB , a party was in progress.• There/RB , ensued a melee.

Page 47: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

47

The role of context in POS tagging

• Can we take a list of all the words in a language, and decide which bucket each word should go, without looking at the context in which the word occurs?

• Water, can,drops

Page 48: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

48

Categorizing context

• Morphological

• Syntactic

• Semantic

Page 49: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

49

Morphological context

• Inflectional morphologyVerb: destroy, destroying, destroyed

Noun: destruction, destructions

He watered the plant.

• Derivational morphologyNoun: destruction

Page 50: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

50

Syntactic context

• Verb: The bomb destroyed the building. He decided to water the plant.

• Noun: The destruction of building

Page 51: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

51

Semantic context

• Verb: action, activity• Noun: state, object, etc.

Page 52: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

52

What do we have in Chinese?

• Morphological clues: not as much• Syntactic clues: not as rich, but exist• Semantic clues: About the same

Page 53: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

53

Syntactic clues

• Impoverished, but exist:这 座 大楼 的 倒塌this CL building DE collapse“the collapse of this building”

这 座 大楼 看起来要 倒塌This CL building seem will collapse“It looks like this building will collapse.”

Page 54: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

54

Semantic clues

• Same as English: Noun: state, object, etc. Verb: action, activity, etc.

Page 55: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

55

When syntactic and semantic clues are in conflict

这 座 大楼 的 倒塌 this CL building DE collapse“the collapse of this building”

Option 1: 倒塌 is a verb regardless of its contextOption 2: 倒塌 can be a noun or a verb depending

on its contextThe Chinese Treebank decision: option 2

POS tags based on syntactic clues encode not only its own lexical properties, but also information provided by its context

“context-free” POS tags are no better than a dictionary

Page 56: Linguistic annotation 2/14/2006 Nianwen Xue. 2 Outline Tokenization / segmentation, POS tagging Treebanking  Constituent structure and structural ambiguity.

56

Online references

• Chinese Treebank: www.cis.upenn.edu/~chinese• Sproat, Richard. 2002. Coling tutorial:

www.linguistics.uiuc.edu/rws• Penn Treebank: www.cis.upenn.edu/~treebank/home.html