Top Banner
Scientometrics, Vol. 24. No. 2 (1992) 201-220 ZIPF'S DATA ON THE FREQUENCY OF CHINESE WORDS REVISITED R. ROUSSEAU*, QIAOQIAO ZHANG** * Katholieke Industri~ie Hogeschoo~ West-VlaanderenZeedijk 101, B-8400 Oostende (Belgium) and Speciale Licentie Documentatie- en Bibliotheekwetenschap Universityof Antwerp, Un~ersiteitsplein 1, B-2610 Wilrijk (Belgium) ** Dep. of filformation Science, The City University,Northampton Squarr London, ECIV OHB (UK) and Department of Sci-Tech Inf., China National Rice Research Instiw.te, Hangzhou (The People's Republic of China) (Received February 20, 1990) At the occasion of the 40th anniversary of George Zipf's premature dead, we re.analyse his data on the frequency of Chinese words. We find the best fitting Lotka, Zipf, Bradford and Leimkuhler distribution and show that only Lotka's function is not rejected by a Kolmogorov- Smirnov test. Using an additional term to Leimkuhler's function leads to a statistically acceptable fit. In this way we can determine a core (nucleus) of most frequently used Chinese words. Introduction George Kingsley Zipf was born in Freeport, Illinois (USA), on January 7, 1902 and died, only 48 years old, on September 25, 1950. He graduated summa cure laude from Harvard College in 1924, and spent the next year in Germany, studying at Bonn and Berlin. He returned to Harvard and received his Ph.D. in Comparative Phylology in 1930. He stayed at Harvard and became instructor in German until 1936, assistant professor of German until 1939, and university lecturer until the year of his dead. This paper is written to commemorate the 40th anniversary of Zipf's premature dead. The data mentioned in this introductory section are taken from the introduction, by George Miller, 1 of the reprinted edition of Zipf's book 'The psycho- biology of language: an introductionto dynamic philology. '2 In the informetric and linguistic literature the terms 'Zipf curve' and 'Zipf's equation' express either a relation between the frequency of occurrence of an event and the number of different events occurring with that frequency (see e.g. Nicholls' Scientometrics 24 (1992) Elsevier, Amsterdam-Oxford-New York- Tokyo Akad~miai Kiad6
20

Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

May 18, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

Scientometrics, Vol. 24. No. 2 (1992) 201-220

ZIPF'S D A T A ON THE FREQUENCY OF CHINESE WORDS REVISITED

R. ROUSSEAU*, QIAOQIAO ZHANG**

* Katholieke Industri~ie Hogeschoo~ West-Vlaanderen Zeedijk 101, B-8400 Oostende (Belgium) and

Speciale Licentie Documentatie- en Bibliotheekwetenschap University of Antwerp, Un~ersiteitsplein 1, B-2610 Wilrijk (Belgium)

** Dep. of filformation Science, The City University, Northampton Squarr London, ECIV OHB (UK) and

Department of Sci-Tech Inf., China National Rice Research Instiw.te, Hangzhou (The People's Republic of China)

(Received February 20, 1990)

At the occasion of the 40th anniversary of George Zipf's premature dead, we re.analyse his data on the frequency of Chinese words. We find the best fitting Lotka, Zipf, Bradford and Leimkuhler distribution and show that only Lotka's function is not rejected by a Kolmogorov- Smirnov test. Using an additional term to Leimkuhler's function leads to a statistically acceptable fit. In this way we can determine a core (nucleus) of most frequently used Chinese words.

Introduction

George Kingsley Zipf was born in Freeport, Illinois (USA), on January 7, 1902 and died, only 48 years old, on September 25, 1950. He graduated summa cure laude from Harvard College in 1924, and spent the next year in Germany, studying at Bonn and Berlin. He returned to Harvard and received his Ph.D. in Comparative Phylology in 1930. He stayed at Harvard and became instructor in German until 1936, assistant professor of German until 1939, and university lecturer until the year of his dead. This paper is written to commemorate the 40th anniversary of Zipf's premature dead. The data mentioned in this introductory section are taken from the introduction, by George Miller, 1 of the reprinted edition of Zipf's book 'The psycho- biology of language: an introductionto dynamic philology. '2

In the informetric and linguistic literature the terms 'Zipf curve' and 'Zipf's equation' express either a relation between the frequency of occurrence of an event and the number of different events occurring with that frequency (see e.g. Nicholls'

Scientometrics 24 (1992) Elsevier, Amsterdam-Oxford-New York- Tokyo Akad~miai Kiad6

Page 2: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

formulation3), or a relation between the frequency of occurrence of an event and its rank when the events are ordered with respect to the frequency of occurrence (the

most frequent one first). Using the terminology of Ref. 4, see also Ref. 5 we will restrict the usage of the

term 'Zipf's equation' or 'Zipf's law' to the second situation. More precisely, when r

denotes the rank, and f (r) denotes the frequency of the event at rank r, Zipf's

equation states that

r(r) .,- = c, (1)

where c is a constant.

The equation

I ( 0 = c l (2)

where 13 is a positive parameter, will be termed the generalised form of Zipf's

equation. The former situation, dealing with the frequency of occurrence of an event

(denoted as y) and the number of different events occurring with that frequency (denoted as gO')), will be termed Lotka's law or Lotka's equation: more precisely, it

states that

go') =A/y (3)

where A and cz are parameters. The special case with a = 2 has a special historical interest. Moreover, it can be

shown that under some conditions Lotka's inverse square law is mathematically equivalent to Zipf's law.6, 7

Zipf tried to explain the observed regularities in the occurrence of words in texts

by the principle of least effort. According to this principle, people try to fred an equilibrium between uniformity and diversity in the use of words. For example, when someone does not remember the name of an object or a person, he may replace it by

'thing' or 'that man', but when a sentence is cluttered with 'things' and 'men" it soon becomes unintelligible. On the other hand, when a scientist, say an historian, speaks on some topic that occurred during the Middle Ages, and uses the correct terminology, he will after a short while, have to expand his sentences, and add some

202 Scientometrics 24 (1992)

Page 3: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

explanations, or else again the audience will not be able to follow. So a vocabulary

should not be too poor or too rich: there must be an equilibrium.

Mandelbrot, however, has shown that the principle of least effort, as formulated

by Zipf, is too vague, and that Zipf curves can be explained as the result of a

particular stochastic process. In his proof Mandelbrot used the nowadays very popular

notion of fractals (see Ref. 8, and part IV of Ref. 5). So with Miller 1 we may say that

Zipf belongs among those rare but stimulating men whose failures are more fruitful

than most men's successes. The authors of this paper, however, would like to add some comments here.

Although several people, such as Wyllis 4 do not hesitate to point out the absence of serious statistical methods in Zipf's original work, we still believe that the stochastic

approach as used e.g. by Mandelbrot 8 or Hill,9,10 does not explain everything. The

tendency, observed by Zipf, for words that are in heavy use, to be abbreviated or replaced by simpler words, is a very real phenomenon. Consider, for example, the

chains: automobile - auto - car, or television - tele - TV, or telefacsimile - telefax - fax. The equilibrium hypothesis and the principle of least effort too, have

some intuitive value. So, although the purely mathematical, say, stochastic approach

leads to the correct distribution function, and some of Zipfs vague ideas were incorporated in the mathematical derivations, we do not think that mathematics gives

the ultimate explanation about the use of words in texts or in spoken language.

The history of Zipf's law

In 1932 the Harvard Universily Press published 'Selected Studies of the Principle

of Relative Frequency in Language', a report written by George K. Zipf. 11 In thi.~ report he investigated the occurrences of words in Latin (Plautus) and Chinese texts.

To perform the analysis of the Chinese texts Zipf needed the help of two Chinese

collaborators: Mr Kan Yu Wang and Mr H. Y. Chang. We may say that Mr Wang

and Mr Chang were to Zipf, as was Mr E. Lancaster Jones to Bradford. 12 As these

investigations are among the fhrst studies in quantitative linguistics, this shows that, through these men, China played a pioneering role in the history of quantitative linguistics, and hence also in informetrics.

Zipf and his collaborators analysed twenty fragments of text, each containing 1000 syllables, taken from twenty different sources (see Appendix: Tables 6 and 7). This yielded a corpus of 20,000 syllables.

Scientometrics 24 (1992) 203

Page 4: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

In this paper we wish to re-analyse Zipfs data. To do this we will use recent advances in informetrics, resulting in a new interpretation of the data. It seems

moreover, that many people have forgotten that Zipf studied not only Western languages, but also Chinese. Indeed, in 1989 Meyer 13 wrote that 'until now it was unknown whether Zipfls law' is valid for other [than Western] languages, for example for the Chinese one'. Even in China the investigations by Zipf, Wang and Chang are not common knowledge, as witnessed by miss Xu 14 who wrote that 'Professor Zipf only made a statistic of Indo-European language families'. Moreover, Meyer's statement was reprinted as a Press Digest 15 in Current Contents. So, it seems that publishing and re-analyzing Zipf's original data will help to set the record straight.

Another remarkable point is the following: in 1932 Zipf only used Lotka's inverse square law (without referring to Lotka, 16 who had published his work as early as 1926). Furthermore, he did not perform any statistical test and did not include the most frequently occurring words (occurring more than 50 times), because applying Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by taking a continuous approach.) The main point is that he did not consider the rank-frequency form (although he had mentioned it in Ref. 17). It is only in 'The psycho-biology of language '2 that he wrote, after having used Lotka's inverse square distribution: 'There is, however, another method of viewing and plotting these frequency distributions which is less dependent upon the size of the bulk and which reveals an additional feature. As suggested by a friend, one can consider the words of a vocabulary as ranked in the order of their frequency, e.g. the fhrst most frequent word, the second most frequent, the third most frequent, the five-hundredth most frequent, the thousandth most frequent, etc. We can indicate on the abscissa of a double logarithmic chart the number of the word in the series and on the ordinate its frequency'. Could it be that Zipf did not invent 'Zipf's law', but that a friend showed him the mathematical relation? We have, however, no due who this mysterious friend might have been. In the works of Zipf we have examhaed we found only one person whom he addresses as 'my friend', namely R. Y. Chao 18 (p.

548). As Professor Chao is also mentioned in Ref. 11, p. 5, the conjecture that this friend is professor R. Y. Chao is not too far-fetched. Note also that in Refs 11 and 2 there is no reference to Estoup or to Pareto, although Estoup is mentioned on page 3 of Zipf's very first publication on the relative frequency of words. 17 Later, in 'Human behavior and the principle of least effort '18 Zipf did refer to Lotka, Estoup and Pareto. Clearly, the relation between Zipfls function and Bradford's law was not yet established at that moment. It is also interesting that besides to F_stoup, Zipf 18 refers

204 Scientometrics 24 (1992)

Page 5: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

in this later work also to Godfrey Dewey 19 and to E. V. Condon 20 as scientists that

have noted the hyperbolic nature of the frequency of word usage. Perhaps it was to

one of these works that the mysterious friend drew Z ip f s attention. In this context

we should mention the work of Petruszewycz, 21 who thinks that the mysterious friend

could have been Alan N. Holden. The reason for this conjecture is that Holden is

mentioned in an earlier work of Zipf and that he worked for the same company as

Condon, 20 namely the Bell Telephone Company. For further information on the

history of Zipf 's law we refer the reader to Refs 21 and 22.

An informetric analysis of Zipf's data on Chinese word frequencies: fitting a Lotka and a Zipf function

In this section we will present Z ip f s data on Chinese word frequencies (Tables 1

and 2) and we will try to fit Lotka 's inverse square law and Z ip f s law. We will show

(see Table 1) that Lotka's law fits; Zipf 's law however, is rejected by a Kolmogorov-

Smirnov test, even on the 1% level (see Table 2). For more information on this test

we refer e.g. to Ref. 5, 1.3.6.

Table 1 Frequency of occurrence of Chinese words (Zipf tl, 1932)

A: number of occurrences 13: frequency of occurrence C: cumulative frequency of occurrence D: cumulative relative frequency of occurrence E: expected cumulative frequency of occurrence (oL= 2) F: expected cumulative relative frequency of occurrence G: absolute value of differences between columns D and F

A B C D E F G

1 2046 2046 0.6122 2031.70 0.6079 0.0043 2 494 2540 0.7600 2539.62 0.7599 0.0001 3 216 2756 0.8247 2765.37 0.8275 0.0028 4 100 2856 0.8546 2892.33 0.8655 0.0109 5 99 2955 0.8842 2973.61 0.8898 0.0056 6 66 3021 0.9039 3030.06 0.9067 0.0028 7 41 3062 0.9162 3071.50 0.9191 0.0029 8 25 3087 0.9237 3103.25 0.9286 0.0049 9 30 3117 0.9327 3128.35 0.9361 0.0034

$ciemometrics 24 (1902) 205

Page 6: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

(Table 1. cont.)

A B C D E F G

10 20 3137 0.9387 3148.67 0.9422 0,0035 11 25 3162 0.9461 3165.44 0.9472 0,0011 12 22 3184 0.9527 3179.55 0.9514 0.0013 13 10 3194 0.9557 3191.58 0.9550 0,0007 14 14 3208 0.9599 3201.94 0.9581 0,0018 15 13 3221 0.9638 3210.96 0.9608 0,0030 16 10 3231 0.9668 3218.91 0.9632 0.0036 17 10 3241 0.9698 3225.93 0.9653 0.0045 18 6 3247 0.9716 3232.22 0.9672 0,0044 19 5 3252 0.9731 3237.83 0.9688 0.0043 20 5 3257 0.9746 3242.91 0.9704 0,0042 21 4 3261 0.9758 3247.52 0.9717 0.0041 22 2 3263 0.9764 3251.73 0.9730 0,0034 23 5 3268 0.9779 3255.54 0.9741 0,0038 26 3 3271 0.9788 3265.33 0.9771 0.0017 28 4 3275 0.9800 3270.72 0.9787 0.0013 29 4 3279 0.981i 3273.12 0.9794 0.0017 30 6 3285 0.9829 3275.39 0.9801 0.0028 32 6 3291 0.9847 3279.50 0.9813 0.0034 33 2 3293 0.9853 3281.34 0.9819 0.0034 34 1 3294 0.9856 3283.11 0.9824 0.0032 35 1 3295 0.9859 3284.78 0.9829 0.0030 36 1 3296 0.9862 3286.36 0.9834 0.0028 37 1 3297 0.9865 3287.83 0.9838 0.0027 38 1 3298 0.9868 3289.23 0.9842 0.0026 41 4 3302 0.9880 3293.04 0.9854 0.0026 43 2 3304 0.9886 3295.31 0.9860 0.0026 44 2 3306 0.9892 3296.35 0.9863 0.0029 45 3 3309 0.9901 3297.35 0.9866 0.0035 46 1 3310 0.9904 3298.32 0.9869 0.0035 47 2 3312 0.9910 3299.22 0.9872 0.0038 50 1 3313 0.9913 3301.76 0.9880 0.0033 52 1 3314 0.9916 3303.30 0.9884 0.0032 55 2 3316 0.9922 3305.41 0.9891 0.0031 57 1 3317 0.9925 3306.68 0.9894 0.0031 58 1 3318 0.9928 3307.28 0.9896 0.0032 60 1 3319 0.9931 3308.41 0.9900 0.0031 66 2 3321 0.9937 3311.45 0.9909 0.0028 68 1 3322 0.9940 3312.36 0.9911 0.0029 72 1 3323 0.9943 3313.96 0.9916 0.0027 73 1 3324 0.9946 3314.36 0.9917 0.0029 75 1 3325 0.9949 3315.10 0.9920 0.0029 78 1 3326 0.9952 3316.13 0.9923 0.0029 81 1 3327 0.9955 3317.07 0.9925 0.0030 83 1 3328 0.9958 3317.67 0.9927 0.0031

206 Scientometrics 24 (1992)

Page 7: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

(Table 1. cont.)

A B C D E F G

101 1 3329 0.9961 3321.98 0.9940 0.0021 102 2 3331 0.9967 3322.18 0.9941 0.0026 105 1 3332 0.9970 3322.75 0.9942 0.0028 109 1 3333 0.9973 3323.45 0.9945 0.0028 118 1 3334 0.9976 3324.86 0.9949 0.0027 131 1 3335 0.9979 3326.56 0.9954 0.0025 142 1 3336 0.9982 3327.73 0.9957 0.0025 146 1 3337 0.9985 3328.13 0.9959 0.0026 147 1 3338 0.9988 3328.23 0.9959 0.0029 156 1 3339 0.9991 3329.03 0.9961 0.0030 246 1 3340 0.9994 3333.75 0.9975 0.0019 290 1 3341 0.9997 3335.02 0.9979 0.0018 905 1 3342 1.0000 3339.66 0.9993 0.0007

The texts analysed by Zipf and his collaborators contained 13,252 words, of which 3342 were different. (Note that Zipf, 11 p. 23, writes 13,248 and 3332, and admits that there is some error.) This means that on the average each word occurred almost four times. Note that the number of words occurring exactly once is slightly higher than expected.

A Kolmogorov-Smirnov test on the 10% level allows a maximum difference between the observed and the expected cumulative relative distribution equal to 0.021 (= 1.22/~/3342). The largest difference occurs for y=4 and is equal to 0.0109. This implies that the words in the Chinese texts analysed by Zipf and his collaborators satisfy Lotka's inverse square law.

To fred Zipf's distribution (1) we need only the parameter C. Summing Eq. (1) over all r yields:

N N E f ( r ) = C(~ , 1 / r ) , (4)

r = l r = l N

with N = 3342. As ,X 1 1/r = In(N) + ~t (Euler's constant, approximately equal to 0.5772), we find (using Eq. (4)): 13252=C (In(3342) + 0.5772), or C= 1524.7. If we lower our standards and try a fit on the 1% level, the maximum allowed difference between the observed relative cumulative distribution and the theoretical one is 0.014 (again, using a Kolmogorov-Smirnov test). For r= 1 this difference is already too high, namely (1524.7-905) / 13252 = 0.047. Putting C= 905 (the observed frequency of one), yields e.g. for R(10) a theoretical value of 2651, which must be compared

Scientometrics 24 (1992) 207

Page 8: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

with the observed value of 2390, or a relative difference of 0.0197. This shows that Zipf's distribution can not be fitted to Zipf's data on the frequency of Chinese words.

This is also illustrated by Fig. 1.

f(r)

,o ,0 2 , e %. I I I

r

_ 10 3

10 2

1 2 3 4

Fig. 1.Zipf's data on a double logarithmic scale. �9 : observations (to prevent cluttering the table to much we represent only some selected data); full line: theoretical distribution according to Zipf's law;, dotted line: theoretical distribution according to column E of Table 2

Next, we will try to fit the generalised form of Zipf's equation to the data. To do

this we take the logarithm of Eq. (2), which results in:

in(/'(,')) = In (C) - 13 In ( 0 (5),

hence 13 and In (C) can be obtained by linear regression. A calculation based on the

67 ranks of Table 2 (the 67 data points given by Zipf), column D yields 13 = 0.8124 and In (C)=6.624, hence C=754. The correlation coefficient for this linear equation is -0.99414, which is excellent. Yet, when we turn to Eq. (2) the fit is rejected by a

Kolmogorov-Smirnov test. However, when the number of observations is large, the requirements for a good fit according to a K-S test are very severe. Moreover, to go from Eq. (5) to Eq. (2) we have to take exponentials. In this process we also take

exponentials of the errors! This explains why the fit to Eq. (2) is rejected.

208 Scientometrics 24 (1992)

Page 9: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

Table 2 Rank-frequency form of Chinese words (Zipf 11, 1932)

A: Ranks (0 B: Frequency (f(r)) at rank r C: Cumulative frequency at rank r, denoted by R(r) D: Cumulative frequency predicted by the generalised form of Zipf's law,

C= 754, [~ = 0.81 E: Cumulative frequency predicted by the generalised form of Zipf's law,

C= 689, ~ =0.784

A B C D E

1 905 905 754 689 2 290 1195 1183 1089 3 246 1441 1492 1380 4 156 1597 1736 1612 5 147 1744 1940 1807 6 146 1890 2116 1976 7 142 2032 2271 2126 8 131 2163 2410 2261 9 118 2281 2537 2384

10 109 2390 2653 2497 11 105 2495 2760 2602 13 102 2699 2954 2792 14 101 2800 3042 2879 15 83 2883 3126 2961 16 81 2964 3205 3039 17 78 3042 3280 3114 18 75 3117 3352 3185 19 73 3190 3421 3253 20 72 3262 3487 3319 21 68 3330 3551 3382 23 66 3462 3671 3502 24 60 3522 3728 3559 25 58 3580 3783 3614 26 57 3637 2836 3668 28 55 3747 3938 3771 29 52 3799 3987 3820 30 50 3849 4035 3868 32 47 3943 4126 3961 33 46 3989 4170 4005 35 45 4079 4255 36 45 4124 4131 38 44 4212 4375 4212 40 43 4298 4451 4289 42 41 4380 4524 44 41 4462 4434 45 38 4500 4629 4469

Scientometrics 24 (1992) 209

Page 10: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

(Table 2. cont.)

A B C D E

46 37 4537 4663 4503 47 36 4573 4696 4537 48 35 4608 4728 4570 49 34 4642 4760 4603 51 33 4708 4822 4667 54 32 4804 4912 57 32 4900 4847 60 30 4990 5080 63 30 5080 5013 65 29 5138 5210 67 29 5196 5116 70 28 5280 5332 71 28 5308 5215 73 26 5360 5402 74 26 5386 5287 77 23 5455 5492 79 23 5501 5401 81 22 5545 5578 5445 83 21 5587 5620 85 21 5629 5531 88 20 5689 5721 90 20 5729 5634 93 19 5786 5817 95 19 5824 5733 98 18 5878 5909

101 18 5932 5846 106 17 6017 6049 111 17 6102 6023 116 16 6182 6213 121 16 6262 6188 127 15 6352 6381 134 15 6457 6388 141 14 6555 6439 148 14 6653 6587 153 13 6718 6735 158 13 6783 6719 169 12 6915 6929 180 12 7047 6991 193 11 7190 7195 205 11 7322 7270 215 10 7422 7415 225 10 7522 7472 240 9 7657 7642 255 9 7792 7753 268 8 7896 7875 280 8 7992 7969

210 Sc~entometrics 24 (1992)

Page 11: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

(Table 2. cont.)

A B C D E

300 7 8132 8122 321 7 8279 8295 354 6 8477 8495 387 6 8675 ~ 8752 436 5 8920 8976 486 5 9170 9333 536 4 9370 9476 586 4 9570 9833 694 3 9894 10118 802 3 10218 10723

1049 2 10712 11233 1296 2 11206 12245 2319 1 12229 13645 3342 1 13252 15480

In an attempt to fred a better fit for Eq. (5) we used other ranks (see column E of Table 2) and excluded the first one. This gave 13 =0.784, C=689 and a correlation coefficient of -0.995. Unfortunately, this too deviates too much from the observed data. Hence, we conclude that, although the generalised form of Zipfs equation agrees in general terms with observed data, it does not fit in the statistical sense. One of the problems, namely the large difference between f(1) (= 905) and f(2) (= 290) will be investigated in the next section.

Zipf's data and the Bradford distribution

In view of the relation between Lotka's inverse square law, the Bradford distribution and Leimkuhler's law (see e.g. Refs 5, 6, 23, 24), where it is shown that these laws are 'equivalent') it is natural to try to fit Leimkuhler's distribution

R ( r ) = a In(1 + b r) (6)

to Zipf's data. Here, as in Table 2, R(r) denotes the cumulative number Of

occurrences of the first r words. Figure 2 shows the rank-frequency distribution of the data on semi-logarithmic scales. We will also try to find Bradford groups as is done e.g. in Refs 5, 25, 26.

Scientometrics 24 (1992) 211

Page 12: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

12

o: 9

6

I I I 1 2 3

Ig(r)

Fig. 2. Cumulative form of Chinese word frequencies (semi-logarithmic scale). All words with ranks belonging to the left of the square bracket are core words

Applying Egghe's global method,25, 5 we are free in our choice of the number of

groups (denoted asp). We will takep = 5. The Bradford multiplier k is then equal to k = (e't f (1))l/p -- (1.781) o.2 (905) o.2 ~ 4.38.

From this we calculate the number of items in every group, denotedasY0:

Y0 is equal to 13252 / 5 = 2650.4. The number of sources (different words) in the

In'st Bradford group is then r 0 = T (k-l) / kP-1 , where T is the total number of

sources (different words). In our case this yields: r 0 = (3342) (3.38) / (4.38) 5 - 1 ~- 7.

This leads to Table 3 which shows that the number of occurrences is not constant. As

this number must be a constant for data satisfying Bradford's law, we conclude that

Zipf's data do not satisfy Bradford's distribution.

Table 3 Bradford groups for Zipf's data on Chinese word frequencies

Group # words # occurrences

1 r o = 7 2032 2 rok ~ 31 2180 3 rok2 ~- 135 2751 4 ro k3 ~ 589 3135 5 rold ,~ 2580 3154

2 1 2 Scientometrics 24 (1992)

Page 13: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

From these calculations we find for the parameters a and b in Eq. (6):

a = Y0 / In(k) = 2650.4 / 1.477 = 1794~4,

and b = k - 1 / r 0 = 4.3799./7.0124 = 0.6246 (6)

Hence we obtain the following equation:

R (r) = 1794.4 In(1 + 0.6246 r).

A quick check shows that e.g. R(240) -- 9002, which should be compared with the

observed value of 7657. Hence, this equation does not fit the data.

We have tried other methods to find R(r), including the use of truncation, or an application of the Weber-Fechner law, i.e. R(r) = w ln(b r). (Among the more exotic

attempts we mention a multiple regression on 67 points with In(r) and (In(r)) 2 as independent variables. This yielded R(r) = 574 + 513.096 In(r) + 137.7 (in(r)), 2 with

a generalised correlation coefficient R2 equal to 0.9978. This model is, of course,

totally unexplained!) The negative results of these attempts prompted us to a closer

examination of the data. An obvious feature is the large gap between the fast and the

second number of occurrences. So, perhaps the first number is 'too high' and consists

of two parts: a part described by the Leimkuhler law and an additional part. This

leads to the conjecture that the frequency of occurrence of Chinese words can be

described by the equation:

R 1 (r) = c + a In(1 + O r) (7)

This equation was proposed in the literature byAsai . 27

Using a computer program we found the following least-squares solution of the

parameters for Eq. (7): c = 880, a = 2103.47 and b = 0.10202; with a coefficient of determination equal to 0.9997.

Table 4 shows the result of fitting this equation to Zipf's data. A Kolmogorov- Smirnov test on the 1% level allows a maximum value of D (the maximum difference

between the observed and theoretical cumulative relative distribution) of 0.0141. On the 5% level this value is 0.0118, and on the 10% level the value is 0.0106. Table 4 shows that Eq. (7) is accepted at the 1% level. Moreover, were it not for the furst

value, it would have been accepted at the 10% level!

Scientometrics 24 (1992) 213

Page 14: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

Table 4 Cumulative rank-frequency form of Zipf's data

A: Rank (r) B: Observed cumulative distribution C: Theoretical distribution (R 1 (0=880 + 2103.47 In(1 + 0.102020) D: Absolute value of the difference between the observed and the

theoretical relative distribution

A B C D 1 905 1084 0.0135 2 1195 1271 0.0057 3 1441 1442 0.0001 4 1597 1560 0.0028 5 1744 1747 0.0002 6 1890 1885 0.0004 7 2032 2014 0.0014 8 2163 2135 0.0021 9 2281 2250 0.0023

10 2390 2359 0.0023 11 2495 2463 0.0024 13 2699 2656 0.0032 14 2800 2746 0.0041 15 2883 2833 0.0038 16 2964 2916 0.0036 17 3042 2996 0.0035 18 3117 3073 0.0033 19 3190 3147 0.0032 20 3262 3219 0.0032 21 3330 3288 0.0032 23 3462 3421 0.0031 24 3522 3484 0.0029 25 3580 3545 0.0026 26 3637 3605 0.0024 28 3747 3719 0.0021 29 3799 3774 0.0019 30 3849 3828 0.0016 32 3943 3931 0.0009 33 3989 3981 0.0006 36 4124 4123 0.0001 38 4212 4213 0.0001 40 4298 4299 0.0001 44 4462 4462 0.0000 45 4500 4500 0.0000 46 4537 4538 0.0001 47 4573 4576 0.0002 48 4608 4613 0.0004 49 4642 4649 0.0005 51 4708 4719 0.0008

214 Scientometrics 24 (1992)

Page 15: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

(Table 4. cont.)

A B C D 57 4900 4917 0.0013 63 5080 5098 0.0014 67 5196 5210 0.0011 71 5308 5317 0.0007 74 5386 5394 0.0006 79 5501 5516 0.0011 81 5545 5563 0.0014 85 5629 5653 0.0018 90 5729 5761 0.0024 95 5824 5864 0.0030

101 5932 5981 0.0037 111 6102 6163 0.0046 121 6262 6330 0.0051 134 6457 6530 0.0055 148 6653 6725 0.0054 158 6783 6854 0.0054 180 7047 7114 0.0051 205 7322 7374 0.0039 225 7522 7561 0.0029 255 7792 7814 0.0017 280 7992 8004 0.0009 321 8279 8282 0.0002 387 8675 8665 0.0008 486 9170 9133 0.0028 586 9570 9520 0.0038 802 10218 10170 0.0036

1296 11206 11170 0.0027 3342 13252 13153 0.0075

�9 Using the second term in the equationRl(r ) = 880+2103.47 In(1 + 0.10202 0 and

the method to fred a 0.75-nucleus as described in Ref. 28, yields a nucleus of 29 core words (el Fig. 1). These words are shown in Table 5.

Conclusion

Following Zipf, ll, 2 Meyer 13 and X u 14 w e may say that Chinese word frequencies

follow Zipfs distribution as well as Western languages. The fit is, however, only global (as a general trend) and can not be corroborated by a statistical test. There seems, moreover, to be a problem with the word (or words) with highest frequencies (a problem that arises with any language).

Scientoraetrics 24 (1992) 215

Page 16: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIA O Q IA O ZHANG: FREQUENCY OF CHINESE WORDS

Table 5

Most frequently used Chinese words

Rank Hoed Teansl i tent ion T~anslation F~equencg

1 J~ de (of , * a~j) 985

2 ~" lo oe lian m 290

3 ~ shi (9es, a~e, is etc .) 246

4 ,A. sen (people, htman Leins) 156

5 ~1~ t � 9 (he) 147

6 ~ zai (in, on, � 9 e tc . ) 146

7 ~ uo ( I ) 142

8 51" gnu (have, possess) 131

9 - - 9i (one) 118

10 ~ gao (rant, desis, e) 195

11 ~ hu (no, not) 195

12 ~ zhe ( this) 195

13 J na ( that ) 182

14 IN dao ( to , ar r ive) 101

15 ~IE lai (ooue) 83

16 U~ shun (sag) 81

17 t~ 9e (also, too) ?8

18 I-_ shansf (uP, above etc .) 75

19 41: nan| (busg) ?3

20 ~ qu (go) 72

21 4B zhao (vheNabouts, ueam, touch) 68

22 ~l:W neig�9 (have not) 66

23 J~ l | vo~en ( re ) 66

24 - - ' ~ gige ( �9 lPiece or. �9 kind or) 60

25 m du (�9 al~eaJg) 58

26 (R shi (t~ake. cause) 57

27 I~ j i u (vi th s, egards to) 55

28 ilk zuo (mLke, do, be a . . . ) 55

29 ~ he (and, vith, as) 52

~ 7 (L) particle: exp~ssion ot completion o~ change of an action

(2) suffix: exl, x, ession or oo.l, letion or an �9 (3) Ue~h: end ulP, un~lerstand

216 Sciemometrics 24 (1992)

Page 17: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

Concerning this problem, Arapov 29 writes that the real cause of the deviation from the Zipfian curve seems to be related to additional constraints on the use of auxiliary words which have high frequencies. These additional restrictions are not accounted for in mathematical derivations of the law (Hill,9J 0 Arapov, 29 MandelbrotS). If Zipfs law held precisely for these words too, then the internal

structure of any text would be controlled by the physical size of the text (and vice versa). Knowing the number of different words used would predict the length of the

text. However, in practice the length of texts is usually much less than predicted by Zipfs law. 3~

Our analysis has clearly shown that the rank-frequency form, and especially the cumulative rank-frequency form contains more information than the Lotka form. Indeed, it was only in the rank-frequency form that difficulties in fitting, and hence some special features in the data, became apparent. The value and importance of the cumulative rank distributions was already stressed many times before, notably by Brookes, see e.g. Refs. 31, 32 and by one of the authors. 33

We conclude our reinvestigation of Zipf's data on Chinese word frequencies by stating the following problems. Can other distributions (perhaps the cumulative rank-

frequency form of Lotka's law with a ~ 2, see Refs 18, 33, 34 give better fits to Zipfs data? What theory, probably based on a mathematical formulation of phylological principles, can really explain all features of Zipfs data?

We like to thank M. Dekeyser and G. W. Wu for helpful comments concerning this paper. We thank professor B.C. Brookes for bringing the book Studies on Zipfs Law to our attention.

$cientometrics 24 (1992) 217

Page 18: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

Appendix

Sources of Chinese words

1. Original titles and transliterations

Liang Qichao: zui ku 9u zui le

J4ei Xiang: xzn JZ guan (lu zi .ei zhou lun)

(3) m ~ : ~ . ~ , ~ bai Jitan: hu zhou de san xiang

(4) :~ ~ : ~ O ~ - . ~ - ~ q t ~ J4ei Xiang: 9ei luo lan de 9i fens xin

(5) Jq-~t~: ~ W Ye Shaojun: Sun 9i

(6) m ~ . : I~P~ lg~- -~ l t 2banS Taisan: lui xue de .u di gu fens t a

xln ss cha~o~e dins si '

c8) ~ ..q: ~ 14ei xzang: lao r Sou j i

(9) ~J~L.~: I q ; ~ J ~ Shen Chongven: tong zhi de 9an dou de gu shi

(10) ,ei~l~ Xia g: j ~ : n ~ i , e fa

J4ei Xians: shi ~ei de si se

14el Xiang: ~ang fu

(13) I ~ t : ~ e X ~ Chen Duxiu: lan dons zbe due wu

?,hang geici: zben zhi xue dao gang

Xu $hi 9i: zui ];ou de gi ke

Sun ~honsshan" ~tin zu zhu si

Sun ~.hongshan: lao dons ~ie j i SU I~ pin dens tiao sue

Liang qichae: ~,en sheng de gi gi

Sun ~]~ngshan: nu zz Sing Sat San ,jiu san nin zhu gi

Mei Xian9: gun xiao zi xun qin ,jill

218 Scientometrics 24 (1992)

Page 19: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

2. Translated titles (translation by the authors of this paper)

(1) Liang Qichao: (2) unknown (3) Dai Jitao: (4) unknown (5) Ye Shaojun: (6) Zhang Taiyan: (7) Hu Shi: (s) unknown (9) Shen Chongwcn:

(10) unknown (11) unknown (12) unknown (13) Chen Duxiu: (14) Zhang Weici: (15) Hu Shi Yi: (16) Sun Zhongshan: (17) Sun Zhongshan: (18) Liang Qichao: (19) Sun Zhongshan: (20) unknown

The Bitterest and the Happiest New Epoch Impression of Huzhou A letter to Rowland Nebula Objectives and Methods of Studying Abroad Definitions of New Trends of Thoughts Travel Notes of Laochan The story of the Comrades' Pipes Haircut A Sleepness Night A Blind Woman The Consciousness of the Working Classes The Syllabus of Political Science The Last lessons (translated by Hu Shi) Nationalism The Working Classes and the Unequal Treaties The Meaning of Life Woman Should Study San Min Zhu Yi* Story of Guo Xiaozi Tracing his Parents**

*Shan Min Zhu Yi: The People's Welfare, Civil Rights, and Democracy. ** By the term 'Guo Xiaozi' is meant a son who deeply loves his parents. This son wants to find his

parents to help them.

References

1. G. MIIJ m~, Introduction ~ The Psycho-biology of language: An Introduction to Dynamic Philology, by G. K. Zipf, M.I.T. Press, Cambridge (Mass.), 1965.

2. G.K. ZIrF, The Psycho-biology of language: An Introduction to Dynamic Philology, Houghton Mifflin, 1935. Reprinted in 1965 by the M.I.T. Press, Cambridge (Mass.).

3. P. 1". NIcHotJ.s, Estimation of Zipf parameters, Journal of the American Society for Information Science, 38 (1987) 443-445.

4. 1L E. WYtaJs, Empirical and theoretical bases of Zipf's law. Library Trends, 30 (1) (1981) 53-64. 5. L. Egghe, R. Rousseau, An Introduction to lnformetrics, Elsevier, Amsterdam, 1990. 6. R. ROUSSF.AU, Relations between continuous versions of bibliometric laws, Journal of the American

Society for Information Science, 41 (1990) 197-203. 7. L. EC~HE, The exact place of Zipf's and Pareto's law amongst the classical informetrie laws,

Scientoraetrics, 20 (1991) 93-106. 8. B. MANDel.mtOr, The Fractal Geometry of Nature, Freeman, New York, 1977. 9. B.M. HILL, The rank-frequency form of Zipf's law, Journal of the American Statistical Association,

69 (1974) 1017-1026. 10. B.M. Hilt., A theoretical derivation of the Zipf (Pareto) law, In: H. GUrmER, M.V. ~ v (Eds),

Studies on ZipfsLaw, Brockmeyer, Bochum, 1982, pp. 53-64. 11. 13. K. ZtPF, Selected Studies of the Principle of Relative Frequency in Language, Harvard University

Press, Cambridge (Mass.), 1932. 12. S.C. BRAOFOm), Sources of information on specific subjects, Engineering, 137 (1934) 85-86.

Scientometrics 24 (1992) 219

Page 20: Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by

R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS

13. J. MEYER, Gilt das Zipfsche Gesetz auch fiir die chinesische Schriftsprache? NTZArchiv, 11 (1989) 13-16.

14. W.-X. XtJ, Zipf's law and mechanism of distribution of Chinese term frequency. Paper presented at the 2rid International Conference on Bibliometrics, Scientometrics and lnformetrics, London (Ontario), July 1989.

15. Press Digest 2707 p, Current Contents, July 3, 1989. 16. A.J. LOTKA, The frequency distribution of scientific productivity, Journal of the Waslu'ngton

Academy of Seienceg 16 (1926) 317-323. 17. G.K. ZIPF, Relative Frequency as a Determinant of Phonetic Change, Harvard Studies in Classical

Philology, Vol. 40, Harvard University Press, Cambridge (Mass.), 1929. 18. G.K. ZzrF, Human Behavior and the Pn'nciple of Least Effort. Harrier, New York and London

(reprinted edition), 1965. 19. G. D ~ , Relative Frequency of English Speech Sounds, Harvard University Press, Cambridge

(Mass.), 1923. 20. E.V. CONOON, Statistics of vocabulary, Science, 67 (1928) 300. 21. M. Pm'RUSZEWYCZ, L'histoire de la loi d'Estoup-Zipf: documents, Matlt Sci. hum., 11 (44) (1973)

41-56. 22. D.H. HER~F_.L, Bibliometrics, history of the developments of ideas in. In: Encyclopedia of Library

andlnforraation Science, A. KENT (Ed.), Vol. 42; suppl. 7 (1987) 144-219. 23. L. EC~HE, The Duality of Informetric Systems with applications to the Empirical Laws, Ph.D.

Thesis, The City University London (UK), 1989. 24. R. ROUSSFAtJ, Een vleug~e bibliometrie: de equivalentie tussen de wetten van Bradford en

Leimkuhler (Some bibliometrics: the equivalence between the Bradford and the Leimkuhler laws), IVtskunde en Onderwijs, 13 (1987) 71-78.

25. L. EC, G,F., Applications of the theory of Bradford's law to the calculation of Leimkuhler's law and to the completion of bibliographies, Journal of the American Society for Information Science, 41 (1990) 469-492.

26. Q. ZnAlqc~, Obsolescence and Bradford Distribution of Rice Literature, M.Sc. Thesis, The City University London (UK), 1986.

27. I. ASAI, A general formulation of Bradford's distribution: the graph-oriented approach, Journal of the American Society for Information Science, 32 (1981) 113-119.

28. IL RoussEAu, The nuclear zone of a Leimkuhler curve, Journal of Documentation, 43 (1987) 322- 333.

29. M.V. ARAPOV, A variational approach to frequency-rank distributions of text elements, In: H. GtJrrug, M.V. ARAPOV (Eds), Studies on ZipJ's law, Brockmeyer, Bochum, 1982, pp. 29-52.

30. R.E. PRATH.~t, Comparison and extension of theories of Zipf and Halstead, The Computer Journal, 31 (1988) 248-252.

31. B.C. BROOKES, Quantitative analysis in the humanities: the advantage of ranking techniques, In: H. Guiter, M.V. Arapov (Eds), Studies on Zipf s law, Brockmeyer, Bochum, 1982, pp. 65-115.

32. B.C. BP.OOKES, Comments on the scope of bibliometrics. In: lnformetrics 87/88, L. Egghe, R. Rousseau (Eds), Elsevier, Amsterdam, 1988, pp. 29-41.

33. R. ROUSSEAU, Lotka's law and its Leimkuhler representation, Library Science with a Slant to Documentation and Information Studies, 25 (1988) 150-1 78.

34. L. EC, GHE, New Bradfordian laws equivalent with old Lotka laws, evolving from a source-item duality argument, In: Informetrics 89/90, L. Egghe, R. Rousseau (Eds), Elsevier, Amsterdam, 1990, 79-96.

220 Scientometrics 24 (1992)