Feb 06, 2016
Using corpora to studyClassifiers in Mandarin Chinese
Richard [email protected]
08/12/2006, Berlin COST Action A31 WG1 Meeting 2
Chinese corpus linguistics
• In relation to English, Chinese has a much shorter history of using corpora– Sinica Balanced Corpus of Chinese
• The first annotated corpus of Mandarin• Freely accessible online since the mid-1990s
• Rapid progress over the last decade– Corpus building and exploration technology– Publicly available corpus resources
08/12/2006, Berlin COST Action A31 WG1 Meeting 3
Chinese text processing• Computational processing of Chinese text is more complex than
English• Chinese text is encoded in double-byte native encodings
– Potential confusion of bytes in running text– GB2312 for SC and Big5 for TC– The advent of Unicode has facilitated Chinese computing
• But most existing data and tools are based on native encoding• Word tokenization is an essential first step in serious Chinese
computing– Defining legitimate “words” in running text– Involving dictionary matching and the use of statistic models
• Part-of-speech tagging depends on the results of tokenizaton– Accuracy of accuracy: 98%– Accuracy of POS tagging: 96%
08/12/2006, Berlin COST Action A31 WG1 Meeting 4
Concordancers for Chinese
• Many concordancers designed for English do not work well with Chinese data
• There are presently three types of tools for Chinese– Unicode-based tools
• WordSmith version 4 (Commercial product)• Xaira (open source freeware)
– Concordancers dependent on language support packs (or in WinXP, default non-Unicode font set as Chinese)
• AntConc (freeware)• ConcApp (freeware)• MonoConc Pro (commercial product)• Concordance (shareware)
– Web-based query systems bundled with specific online corpora
08/12/2006, Berlin COST Action A31 WG1 Meeting 5
Chinese corpus resources• Sinica Balanced Corpus
– http://www.sinica.edu.tw/SinicaCorpus/• Sinica Tagged Corpus of Early Mandarin
– http://www.sinica.edu.tw/Early_Mandarin/• Modern Chinese Language Corpus
– http://219.238.40.213:8080/CpsQrySv.srf• PKU-CCL Chinese Corpus
– http://ccl.pku.edu.cn/YuLiao_Contents.Asp• BLCU Modern Chinese Corpus
– http://202.112.195.8:8089/ccir_login?input=*• Chinese Internet Corpus
– http://corpus.leeds.ac.uk/query-zh.html• Lancaster Corpus of Mandarin Chinese
– http://www.ling.lancs.ac.uk/corplang/lcmc/• Lancaster LOS Angeles Spoken Chinese Corpus
– http://www.ling.lancs.ac.uk/corplang/llscc/• More details of more corpora in more languages are on the handout
08/12/2006, Berlin COST Action A31 WG1 Meeting 6
Lancaster Corpus of Mandarin Chinese (LCMC)
• Designed as a Chinese match for FLOB and Frown• Representing written Mandarin as used in mainland
China in the early 1990s• A balanced corpus of one million words in 500 samples
proportionally taken from 15 text categories• Marked up in XML and Encoded in Unicode• Tokenized and POS tagged• Freely searchable online
– http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl• Released by ELRA and OTA free of charge for academic
and educational purposes• An indexed version for use with Xaira is available• V1.2 incorporates validated details of classifier use
08/12/2006, Berlin COST Action A31 WG1 Meeting 7
Lancaster LOS Angeles Spoken Chinese Corpus (LLSCC)
• One million words of spoken Mandarin• Both dialogues (55%) and monologues (45% )• Both spontaneous (57% ) and scripted (43%) speech• Seven spoken registers
– face-to-face conversation, telephone conversation, play/movie scripts, TV talk show transcripts, formal debates, spontaneous oral narrative, edited oral narrative
• Marked up in XML and encoded in Unicode• Tokenised and POS tagged
– The Telephone Conversation part is tagged with details of classifier use– The unannotated version of this part is available from the LDC as
CallHome Mandarin Transcripts• More information
– http://www.ling.lancs.ac.uk/corplang/llscc/
08/12/2006, Berlin COST Action A31 WG1 Meeting 8
Annotation scheme for classifiers (q)
Tag Gloss
qu Unit classifier
ql Collective classifier
qa Arrangement classifier
qc Container classifier
qm Standard measure
qs Species classifier
qt Temporal classifier
qv Verbal classifier
08/12/2006, Berlin COST Action A31 WG1 Meeting 9
Why classifiers are necessary (1)
• Grammatically mandatorysan ben shu *san shu
three CL book three book
three books three books
• Distinguishing between word sensesyi tiao xian yi gen xian
one CL line one CL thread
a line a thread
08/12/2006, Berlin COST Action A31 WG1 Meeting 10
Why classifiers are necessary (2)
• Resolving syntactic ambiguity– Example A)
Ho laozong gei-le ta yi-ba shouqiangHo general give-Asp him one-CL pistolGeneral Ho gave him a pistol.
– Example B)Ho laozong gei-le ta yi shouqiangHo general give-Asp him one pistol (CL)General Ho shot him once with a pistol.
08/12/2006, Berlin COST Action A31 WG1 Meeting 11
Use and name of classifiers
• The use of “classifiers” dated back as early as over 3,300 years ago– Oracle bone inscriptions excavated from the Yin
Ruins (1300-1100 B.C.)
• Classifiers became established as a separate word class in Chinese only in the 1950s– Ding et al (1952): A Talk on Grammar in Modern
Chinese
• Different terms had been used for classifiers– But mainly treated as a subclass of nouns
08/12/2006, Berlin COST Action A31 WG1 Meeting 12
Syntactic features of classifiers
• Classifiers were the last to have become one of the 11 word classes in Chinese because they cannot be used independently as sentential constituents
• Typically following a numeral or demonstrative pronoun zhe 这 ‘ this’, na (那 ) ‘that’, or na (哪 ) ‘which’
• Monosyllabic classifiers can be reduplicated to function as different sentential constituents, expressing a general grammatical meaning with different situational variants (Guo 1999)– Co-existence or repetition of entities or events
• “All around”, “many”, “one by one”, “continuous”
08/12/2006, Berlin COST Action A31 WG1 Meeting 13
Levels of grammaticalization• Specialised classifiers
– Fully grammaticalized– Functioning as classifiers only– Bleaching of lexical meaning, difficult to find translation equivalents in a non-
classifier language– E.g. (n) 个 ,件 ,块 ,颗 ,辆 ,枚 ,匹 ,幢 ; (v) 次 ,遍 ,场 ,顿 ,番 ,回 ,通 ,趟 ,下 ,
阵• Concurrent classifiers
– Mainly derived from nouns and verbs– Can be used as nouns/verbs and classifiers– The classifier use is semantically related to the lexical meaning of the original
noun/verb– E.g. 口 ,头 ,台 ;瓶 ,碗 ; 包 ,封 ,卷 ,捆 , 束
• Temporary borrowings– Mainly borrowed from nouns, verbs, and adjectives– Functioning as classifiers only on an ad hoc basis– Full lexical meaning– E.g. 脸 (face), 屋子 (house); 刀 (knife), 枪 (gun), 脚 (foot), 拳 (fist)
08/12/2006, Berlin COST Action A31 WG1 Meeting 14
Semantic types of classifiers (1) • Nominal classifiers (6 types): Quantifying nouns
– Unit classifiers• Count individual entities• E.g. 个 (63.5% of unit classifiers, 38.8% of all classifiers),位 ,条 ,张 ,名 ,件 ,句 ,家 ,项 ,封 ,只 ,片 ,步 ,块 ,部 ,份 ,座 ,届 ,口 , 支
– Collective classifiers• Provide a collective reference for separate entities• E.g. 套 ‘ set’ , 批 ‘ batch’ , 双 ‘ pair’ , 系列 ‘ series’ , 副 ‘ pair’ ,
群 ‘ group’ , 代 ‘ generation’ , 组 ‘ group’ , 对 ‘ pair’ , 队 ‘ team’– Arrangement classifiers
• Also refer to a collection, but focus on constellation aspect (shape), i.e. how entities are arranged or grouped together
• E.g. 层 ‘ layer’, 堆 ‘ pile’, 团 ‘ ball’, 沓 ‘ pad’, 串 ‘ string’, 丝 ‘ thread’, 排 ‘ row’, 把 ‘ handful’, 滴 ‘ drop’, 束 ‘ bunch’, 缕 ‘ thread’, 行 ‘ row’
08/12/2006, Berlin COST Action A31 WG1 Meeting 15
Semantic types of classifiers (2)• Nominal classifiers: Quantifying nouns
– Standard measure classifiers• Express exact measures of various kinds, in local or international
units• E.g. 元 ,块 ,米 ,吨 ,克 , 美元 ,里 , 厘米 ,亩 ,度 , 平方米 ,斤 , 公里 ,
公斤 ,分 ,尺 ,升 ,丈 ,℃– Container classifiers
• Denote types of containers, which are borrowed temporarily to provide an inexact measure of mass or entities usually associated with such containers
• E.g. 杯 ,碗 ,盒 ,袋 ,桶 ,脸 ,瓶 ,壶 ,盆 ,盘 ,锅 ,瓢 ,箱 ,筐 ,包 ,匙 ,罐 ,腔 ,坛 ,锹 ,盅 ,车 ,斗 ,肚子
– Special container classifiers, can only take yi -> ‘full’, more descriptive than quantifying
– Species classifiers• Denote the type of entities grouped together• E.g. 种 (kind, over 90%), 类 (sort), 级 (grade), 样 (type), 等
(grade), 品 (class)
08/12/2006, Berlin COST Action A31 WG1 Meeting 16
Semantic types of classifiers (3)• Verbal classifiers: quantifying verbs
– 9 specialised verbal classifiers• E.g. 次 (times, 40.8% of all verbal classifiers),下 (stroke),场
(course of action),番 (once over),阵 (step of action),趟(return journey),回 (times),遍 (once through),顿 (criticising, abusing)
– Borrowed verbal classifiers• An open set, mostly nouns denoting tools and related items• E.g. 声 ,眼 ,口 ,刀 ,脚 ,拳 , 巴掌 ,枪 , 棒
• Temporal classifiers: measuring time– Exact measures
• 年 ,天 ,岁 , 分钟 , 小时 ,夜 ,周 , 周年 ,日 , 周岁 ,月 ,载 , 星期 , 昼夜 ,刻 ,宿 ,宵 , 礼拜 , 旬
– Inexact measures• E.g. 会儿 ,段 , 辈子 , 阵子 ,会 ,阵 , 瞬间
08/12/2006, Berlin COST Action A31 WG1 Meeting 17
Classifiers in writing and speech
• Unit classifiers by far most common, in speech and writing• Because of the weight of generalised classifier ge, unit classifiers
are particularly frequent in speech• Other common types: temporal, verbal• Infrequent types: container, arrangement, collective
0
500
1000
1500
2000
2500
Arrang
emen
t
Contain
er
Collecti
ve
Std m
easu
re
Specie
s
Tempo
ral
Unit
Verbal
Classifier type
Fre
qu
ency
per
100
,000
to
ken
sLCMC
CallHome
08/12/2006, Berlin COST Action A31 WG1 Meeting 18
Variation across genres
• Apart from the speech-writing difference, various genres also differ in classifier use• Most frequent in news reportage (A), humour (R), and speech (S): over 3K in 100K• Least common in news review (B), news editorial (C), religious writing (D), and
academic prose (J): below 2k in 100k• Generally more common in imaginative (K-R) writing and speech (S) than in
informative writing (A-J)
0
500
1000
1500
2000
2500
3000
3500
4000
A B C D E F G H J K L M N P R S
Genre
Fre
qu
en
cy
pe
r 1
00
K w
ord
s
08/12/2006, Berlin COST Action A31 WG1 Meeting 19
Distribution of classifier types
0%
20%
40%
60%
80%
100%
A B C D E F G H J K L M N P R S
Genre
Pro
po
rtio
n
Verbal
Unit
Temporal
Species
Std measure
Collective
Container
Arrangement
• Distribution of different types of classifiers also varies across genres• Unit classifier is the most common type in all genres (2/3 of all classifiers)• Container, arrangement, and collective classifiers are relatively rare in all genres• Std measure classifiers are most frequent in news reportage (A) and official docs (H)• Species classifiers are more common in informative than imaginative writing
08/12/2006, Berlin COST Action A31 WG1 Meeting 20
Cognitive basis of classifier use• Allan (1977): number of dimensions• Adams and Conklin (1973): elasticity, hardness, discreteness• Shi (2001): ratio between different dimensions, and materiality• Dimensions and use of classifiers
– 0-D: point, e.g. yi dian (点 ) mo ‘a point of ink’– 1-D: line, e.g. yi xian (线 ) xiwang ‘a thread of hope’– 2-D: area (Y being the longer dimension)
• Y/X>>1 –> zhang (张 ): e.g. yi zhang zhaopian ‘a photo’• Y/X>>0 –> tiao (条 ): e.g. yi tiao malu ‘a road’
– 3-D: block (Q=Y/X)• Z/Q >> 0 –> pian (片 ): e.g., yi pian shuye ‘a leaf’• Z/Q >> 1 –> kuai (块 ): e.g. yi kuai tang ‘a lump of sugar’• Z/Q >> sufficiently large –> gen (根 ): e.g. yi gen dianxian ‘a cable’
• While the use of nominal classifiers is closely associated with shape, this is not the only criterion nouns and classifiers co-select each other– Five co-selection criteria
08/12/2006, Berlin COST Action A31 WG1 Meeting 21
Co-selection by similarity
• Classifiers are closely related to shapes which are historically associated with the nouns that have given rise to these classifiers, e.g. tiao (条 )– tiao: ‘small branch/twig’ –> ‘long, narrow, flexible’: jie
(街 ) ‘street’, tui (腿 ) ‘leg’, lu (路 ) ‘road’, xian (线 ) ‘line; thread’, he (河 ) ‘river’, yu (鱼 ) ‘fish’, etc; ‘bamboo slips for writing’ –> guiding ( 规定 ) ‘regulation’, jianyi ( 建议 ) ‘suggestion’, falu ( 法律 ) ‘law’, xinwen ( 新闻 ) ‘news’, etc
– kuai (块 ) (‘soil lump/block’ –> something of a lumpy/blocky shape, e.g. a wrist watch; ‘territory soil’ –> something with a boundary, e.g. a scar
08/12/2006, Berlin COST Action A31 WG1 Meeting 22
Co-selection by metonymy
• The original lexical meanings of classifiers refer to the most salient features of the entities being classified, e.g.– kou (口 ) ‘mouth’ (for pigs), tou (头 ) ‘head’ (for
cattle), wei (尾 ) ‘tail’ (for fish), ding (顶 ) ‘top’ (for hats, sedan chairs etc)
• BUT long term linguistic conventions are always important in language use– *tou: rabbit, cat– *wei: peacock, squirrel
08/12/2006, Berlin COST Action A31 WG1 Meeting 23
Co-selection by relatedness
• The original lexical meanings of classifiers refer to actions closely related to entities being classified, e.g.– bao (包 ) ‘wrap-> pack (resulting of packing)’– chuan (串 ) ‘string together-> string, bunch’– kun (捆 ) ‘tie up, fasten -> bundle’– peng (捧 ) ‘hold in both hands -> a double
handful’
08/12/2006, Berlin COST Action A31 WG1 Meeting 24
Co-selection by association
• The original lexical meanings of classifiers refer to tools, containers, and places, etc closely associated with the entities being classified, e.g.– dao (刀 ) ‘knife -> a cut of (meat)’ – wan (碗 ) ‘bowl -> a bowl of (rice)’– chuang (床 ) ‘bed -> a bed of (quilt/sheet etc)’– mu (幕 ) ‘curtain -> an act of (play)’
08/12/2006, Berlin COST Action A31 WG1 Meeting 25
Co-selection by conventions
• Sometimes, co-selection has to be interpreted by following linguistic conventions because it is not always possible to track the grammaticalization path of a classifier to ascertain the relationship between its original lexical meaning with the entities being classified– In what way is tiao historically related to renming ‘human life’?– Why is tou used for pigs and cattle but not rabbits or cats?– Why is wei used for fish but not for peacocks or squirrels even
though they have tails that are as salient as, if not more so, than that of fish
• Such missing links have to be accounted for by linguistic conventions of the speech community
08/12/2006, Berlin COST Action A31 WG1 Meeting 26
Collocates
• Let’s now have a look at the noun collocates of some common classifiers in Chinese to see how well the proposed co-selection criteria work
• Defining collocates (in 2 million words)– Window span of L5-R5– z>3.0– Minimum co-occurrence frequency of 5
08/12/2006, Berlin COST Action A31 WG1 Meeting 27
Collocates of zhang (张 )
Collocate Gloss Frequency z-score
牌 playing card 64 85.9
纸条 notepaper 9 49.7
支票 cheque 6 40.4
照片 photo 7 26.6
票 ticket 10 21.5
纸 paper 13 21.2
脸 (thick/thin) face, cheek 12 17.2
皮 skin/leather 7 17.2
画 drawing 6 9.5
床 (prototypical)
bed 6 9.3
08/12/2006, Berlin COST Action A31 WG1 Meeting 28
Collocates of tiao (条 ) – 1
Collocate Gloss Frequency z-score
规定 stipulation 51 41.9
条例 regulation 11 26.1
街 street 11 25.4
腿 leg 14 22.5
车道 (traffic) lane 6 20.4
路 road 23 19.7
直线 straight line 6 19.4
河 river 6 11.7
08/12/2006, Berlin COST Action A31 WG1 Meeting 29
Collocates of tiao (条 ) - 2
Collocate Gloss Frequency z-score
指令 instruction 6 10.7
建议 suggestion 7 9.2
鱼 fish 9 8.8
线 line; thread 7 7.4
原则 principle 8 7.0
意见 comment 6 5.3
新闻 news 6 4.3
08/12/2006, Berlin COST Action A31 WG1 Meeting 30
Collocates of kuai (块 )
Collocate Gloss Frequency z-score
平地 level ground 6 59.0
石头 stone 11 51.1
布 cloth 6 23.0
地 land, field 9 3.2
08/12/2006, Berlin COST Action A31 WG1 Meeting 31
Collocates of ge (个 )• Generalised classifier ge (个 ): bamboo (竹 ) split into halves, initially as a
counter for bamboos and arrows; when a bamboo chip is used for counting, it becomes a symbol of the entity being counted. In other words, the entity loses its shape, colour, function or any other attribute and becomes a unit of counting, ge.
• Ge can be used for any noun (people or things, large or small) that does not have a specific classifier, and it can be used to replace specific classifiers of many nouns.
• A total of 115 noun collocates• 29 refer to human beings, 86 to non-human entities• 66 refer to concrete entities, 49 to abstract entities
– 12 related to time• Top 20 noun collocates (z>8.8, F>5, in the order of z-scores)
– 月 ‘ month’, 星期 ‘ week’, 人 ‘ person’, 小时 ‘ hour’, 电话 ‘ phone call’, 礼拜 ‘ week’, 字 ‘ character’, 百分点 ‘ percentage’, 地方 ‘ place’, 角落 ‘ corner’, 项目 ‘ project’, 钟头 ‘ hour’, 问题 ‘ problem, question’, 电饭锅 ‘ rice cooker’, 女人 ‘ woman’, 字儿 ‘ character’, 例子 ‘ example’, 盒子 ‘ box’, 照相机 ‘ camera’, 东西 ‘ stuff’
08/12/2006, Berlin COST Action A31 WG1 Meeting 32
Classifiers for dongxi ( 东西 )• A noun with a rather general and vague referent; can refer to anything, but not human
being – It is an insult to say someone is a dongxi, or is not a dongxi
• The vagueness in reference makes it possible to use a nominal classifier of any type for dongxi
• Unit classifier– (General) ge ( 个 ), jian ( 件 ) ‘piece’, fen ( 份 ) ‘portion’– (Shape) tiao ( 条 ), zhang ( 张 ), and kuai ( 块 )– (Book/paper) ben ( 本 for books), pian ( 篇 for a piece of writing)
• Collective classifier– tao ( 套 ) ‘set’
• Arrangement classifier– dui ( 堆 ) ‘pile’
• Container classifier– xiangzi ( 箱子 ) ‘box’, bao (包 ) ‘pack’
• Standard measure classifier– dun ( 吨 ) ‘ton’
• Species classifier– yang ( 样 ) ‘type’, zhong ( 种 ) ‘kind’, lei ( 类 ) ‘class’
08/12/2006, Berlin COST Action A31 WG1 Meeting 33
Variations
• Not all instances of classifier use are in line with these co-selection criteria– Regional variation
• dao (刀 ) ‘knife’– Mandarin: yi-ba (把 ) dao ‘a knife’– Cantonese: yi-zhang (张 ) dao ‘a knife’
• niu (牛 ) ‘cattle’– Mandarin: yi-tou (头 ) niu ‘a cow’– Wu: yi-zhi (只 ) niu ‘a cow’
• ren (人 ) ‘person’– Mandarin: yi-ge (个 ) ren– Fuzhou: yi-zhi (只 ) ren
– Unconventional, creative use of classifiers often found in literary works
– Diachronic variaion
08/12/2006, Berlin COST Action A31 WG1 Meeting 34
Thank you!