Scientometrics, Vol. 24. No. 2 (1992) 201-220 ZIPF'S DATA ON THE FREQUENCY OF CHINESE WORDS REVISITED R. ROUSSEAU*, QIAOQIAO ZHANG** * Katholieke Industri~ie Hogeschoo~ West-VlaanderenZeedijk 101, B-8400 Oostende (Belgium) and Speciale Licentie Documentatie- en Bibliotheekwetenschap Universityof Antwerp, Un~ersiteitsplein 1, B-2610 Wilrijk (Belgium) ** Dep. of filformation Science, The City University,Northampton Squarr London, ECIV OHB (UK) and Department of Sci-Tech Inf., China National Rice Research Instiw.te, Hangzhou (The People's Republic of China) (Received February 20, 1990) At the occasion of the 40th anniversary of George Zipf's premature dead, we re.analyse his data on the frequency of Chinese words. We find the best fitting Lotka, Zipf, Bradford and Leimkuhler distribution and show that only Lotka's function is not rejected by a Kolmogorov- Smirnov test. Using an additional term to Leimkuhler's function leads to a statistically acceptable fit. In this way we can determine a core (nucleus) of most frequently used Chinese words. Introduction George Kingsley Zipf was born in Freeport, Illinois (USA), on January 7, 1902 and died, only 48 years old, on September 25, 1950. He graduated summa cure laude from Harvard College in 1924, and spent the next year in Germany, studying at Bonn and Berlin. He returned to Harvard and received his Ph.D. in Comparative Phylology in 1930. He stayed at Harvard and became instructor in German until 1936, assistant professor of German until 1939, and university lecturer until the year of his dead. This paper is written to commemorate the 40th anniversary of Zipf's premature dead. The data mentioned in this introductory section are taken from the introduction, by George Miller, 1 of the reprinted edition of Zipf's book 'The psycho- biology of language: an introductionto dynamic philology. '2 In the informetric and linguistic literature the terms 'Zipf curve' and 'Zipf's equation' express either a relation between the frequency of occurrence of an event and the number of different events occurring with that frequency (see e.g. Nicholls' Scientometrics 24 (1992) Elsevier, Amsterdam-Oxford-New York- Tokyo Akad~miai Kiad6
20
Embed
Zipf's data on the frequency of Chinese words …...Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scientometrics, Vol. 24. No. 2 (1992) 201-220
ZIPF'S D A T A ON THE FREQUENCY OF CHINESE WORDS REVISITED
Speciale Licentie Documentatie- en Bibliotheekwetenschap University of Antwerp, Un~ersiteitsplein 1, B-2610 Wilrijk (Belgium)
** Dep. of filformation Science, The City University, Northampton Squarr London, ECIV OHB (UK) and
Department of Sci-Tech Inf., China National Rice Research Instiw.te, Hangzhou (The People's Republic of China)
(Received February 20, 1990)
At the occasion of the 40th anniversary of George Zipf's premature dead, we re.analyse his data on the frequency of Chinese words. We find the best fitting Lotka, Zipf, Bradford and Leimkuhler distribution and show that only Lotka's function is not rejected by a Kolmogorov- Smirnov test. Using an additional term to Leimkuhler's function leads to a statistically acceptable fit. In this way we can determine a core (nucleus) of most frequently used Chinese words.
Introduction
George Kingsley Zipf was born in Freeport, Illinois (USA), on January 7, 1902 and died, only 48 years old, on September 25, 1950. He graduated summa cure laude from Harvard College in 1924, and spent the next year in Germany, studying at Bonn and Berlin. He returned to Harvard and received his Ph.D. in Comparative Phylology in 1930. He stayed at Harvard and became instructor in German until 1936, assistant professor of German until 1939, and university lecturer until the year of his dead. This paper is written to commemorate the 40th anniversary of Zipf's premature dead. The data mentioned in this introductory section are taken from the introduction, by George Miller, 1 of the reprinted edition of Zipf's book 'The psycho- biology of language: an introductionto dynamic philology. '2
In the informetric and linguistic literature the terms 'Zipf curve' and 'Zipf's equation' express either a relation between the frequency of occurrence of an event and the number of different events occurring with that frequency (see e.g. Nicholls'
Scientometrics 24 (1992) Elsevier, Amsterdam-Oxford-New York- Tokyo Akad~miai Kiad6
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
formulation3), or a relation between the frequency of occurrence of an event and its rank when the events are ordered with respect to the frequency of occurrence (the
most frequent one first). Using the terminology of Ref. 4, see also Ref. 5 we will restrict the usage of the
term 'Zipf's equation' or 'Zipf's law' to the second situation. More precisely, when r
denotes the rank, and f (r) denotes the frequency of the event at rank r, Zipf's
equation states that
r(r) .,- = c, (1)
where c is a constant.
The equation
I ( 0 = c l (2)
where 13 is a positive parameter, will be termed the generalised form of Zipf's
equation. The former situation, dealing with the frequency of occurrence of an event
(denoted as y) and the number of different events occurring with that frequency (denoted as gO')), will be termed Lotka's law or Lotka's equation: more precisely, it
states that
go') =A/y (3)
where A and cz are parameters. The special case with a = 2 has a special historical interest. Moreover, it can be
shown that under some conditions Lotka's inverse square law is mathematically equivalent to Zipf's law.6, 7
Zipf tried to explain the observed regularities in the occurrence of words in texts
by the principle of least effort. According to this principle, people try to fred an equilibrium between uniformity and diversity in the use of words. For example, when someone does not remember the name of an object or a person, he may replace it by
'thing' or 'that man', but when a sentence is cluttered with 'things' and 'men" it soon becomes unintelligible. On the other hand, when a scientist, say an historian, speaks on some topic that occurred during the Middle Ages, and uses the correct terminology, he will after a short while, have to expand his sentences, and add some
202 Scientometrics 24 (1992)
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
explanations, or else again the audience will not be able to follow. So a vocabulary
should not be too poor or too rich: there must be an equilibrium.
Mandelbrot, however, has shown that the principle of least effort, as formulated
by Zipf, is too vague, and that Zipf curves can be explained as the result of a
particular stochastic process. In his proof Mandelbrot used the nowadays very popular
notion of fractals (see Ref. 8, and part IV of Ref. 5). So with Miller 1 we may say that
Zipf belongs among those rare but stimulating men whose failures are more fruitful
than most men's successes. The authors of this paper, however, would like to add some comments here.
Although several people, such as Wyllis 4 do not hesitate to point out the absence of serious statistical methods in Zipf's original work, we still believe that the stochastic
approach as used e.g. by Mandelbrot 8 or Hill,9,10 does not explain everything. The
tendency, observed by Zipf, for words that are in heavy use, to be abbreviated or replaced by simpler words, is a very real phenomenon. Consider, for example, the
chains: automobile - auto - car, or television - tele - TV, or telefacsimile - telefax - fax. The equilibrium hypothesis and the principle of least effort too, have
some intuitive value. So, although the purely mathematical, say, stochastic approach
leads to the correct distribution function, and some of Zipfs vague ideas were incorporated in the mathematical derivations, we do not think that mathematics gives
the ultimate explanation about the use of words in texts or in spoken language.
The history of Zipf's law
In 1932 the Harvard Universily Press published 'Selected Studies of the Principle
of Relative Frequency in Language', a report written by George K. Zipf. 11 In thi.~ report he investigated the occurrences of words in Latin (Plautus) and Chinese texts.
To perform the analysis of the Chinese texts Zipf needed the help of two Chinese
collaborators: Mr Kan Yu Wang and Mr H. Y. Chang. We may say that Mr Wang
and Mr Chang were to Zipf, as was Mr E. Lancaster Jones to Bradford. 12 As these
investigations are among the fhrst studies in quantitative linguistics, this shows that, through these men, China played a pioneering role in the history of quantitative linguistics, and hence also in informetrics.
Zipf and his collaborators analysed twenty fragments of text, each containing 1000 syllables, taken from twenty different sources (see Appendix: Tables 6 and 7). This yielded a corpus of 20,000 syllables.
Scientometrics 24 (1992) 203
IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
In this paper we wish to re-analyse Zipfs data. To do this we will use recent advances in informetrics, resulting in a new interpretation of the data. It seems
moreover, that many people have forgotten that Zipf studied not only Western languages, but also Chinese. Indeed, in 1989 Meyer 13 wrote that 'until now it was unknown whether Zipfls law' is valid for other [than Western] languages, for example for the Chinese one'. Even in China the investigations by Zipf, Wang and Chang are not common knowledge, as witnessed by miss Xu 14 who wrote that 'Professor Zipf only made a statistic of Indo-European language families'. Moreover, Meyer's statement was reprinted as a Press Digest 15 in Current Contents. So, it seems that publishing and re-analyzing Zipf's original data will help to set the record straight.
Another remarkable point is the following: in 1932 Zipf only used Lotka's inverse square law (without referring to Lotka, 16 who had published his work as early as 1926). Furthermore, he did not perform any statistical test and did not include the most frequently occurring words (occurring more than 50 times), because applying Lotka's law to these higher frequencies would have yielded fractional numbers. (Note that this difficulty is easily resolved by taking a continuous approach.) The main point is that he did not consider the rank-frequency form (although he had mentioned it in Ref. 17). It is only in 'The psycho-biology of language '2 that he wrote, after having used Lotka's inverse square distribution: 'There is, however, another method of viewing and plotting these frequency distributions which is less dependent upon the size of the bulk and which reveals an additional feature. As suggested by a friend, one can consider the words of a vocabulary as ranked in the order of their frequency, e.g. the fhrst most frequent word, the second most frequent, the third most frequent, the five-hundredth most frequent, the thousandth most frequent, etc. We can indicate on the abscissa of a double logarithmic chart the number of the word in the series and on the ordinate its frequency'. Could it be that Zipf did not invent 'Zipf's law', but that a friend showed him the mathematical relation? We have, however, no due who this mysterious friend might have been. In the works of Zipf we have examhaed we found only one person whom he addresses as 'my friend', namely R. Y. Chao 18 (p.
548). As Professor Chao is also mentioned in Ref. 11, p. 5, the conjecture that this friend is professor R. Y. Chao is not too far-fetched. Note also that in Refs 11 and 2 there is no reference to Estoup or to Pareto, although Estoup is mentioned on page 3 of Zipf's very first publication on the relative frequency of words. 17 Later, in 'Human behavior and the principle of least effort '18 Zipf did refer to Lotka, Estoup and Pareto. Clearly, the relation between Zipfls function and Bradford's law was not yet established at that moment. It is also interesting that besides to F_stoup, Zipf 18 refers
204 Scientometrics 24 (1992)
IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
in this later work also to Godfrey Dewey 19 and to E. V. Condon 20 as scientists that
have noted the hyperbolic nature of the frequency of word usage. Perhaps it was to
one of these works that the mysterious friend drew Z ip f s attention. In this context
we should mention the work of Petruszewycz, 21 who thinks that the mysterious friend
could have been Alan N. Holden. The reason for this conjecture is that Holden is
mentioned in an earlier work of Zipf and that he worked for the same company as
Condon, 20 namely the Bell Telephone Company. For further information on the
history of Zipf 's law we refer the reader to Refs 21 and 22.
An informetric analysis of Zipf's data on Chinese word frequencies: fitting a Lotka and a Zipf function
In this section we will present Z ip f s data on Chinese word frequencies (Tables 1
and 2) and we will try to fit Lotka 's inverse square law and Z ip f s law. We will show
(see Table 1) that Lotka's law fits; Zipf 's law however, is rejected by a Kolmogorov-
Smirnov test, even on the 1% level (see Table 2). For more information on this test
we refer e.g. to Ref. 5, 1.3.6.
Table 1 Frequency of occurrence of Chinese words (Zipf tl, 1932)
A: number of occurrences 13: frequency of occurrence C: cumulative frequency of occurrence D: cumulative relative frequency of occurrence E: expected cumulative frequency of occurrence (oL= 2) F: expected cumulative relative frequency of occurrence G: absolute value of differences between columns D and F
The texts analysed by Zipf and his collaborators contained 13,252 words, of which 3342 were different. (Note that Zipf, 11 p. 23, writes 13,248 and 3332, and admits that there is some error.) This means that on the average each word occurred almost four times. Note that the number of words occurring exactly once is slightly higher than expected.
A Kolmogorov-Smirnov test on the 10% level allows a maximum difference between the observed and the expected cumulative relative distribution equal to 0.021 (= 1.22/~/3342). The largest difference occurs for y=4 and is equal to 0.0109. This implies that the words in the Chinese texts analysed by Zipf and his collaborators satisfy Lotka's inverse square law.
To fred Zipf's distribution (1) we need only the parameter C. Summing Eq. (1) over all r yields:
N N E f ( r ) = C(~ , 1 / r ) , (4)
r = l r = l N
with N = 3342. As ,X 1 1/r = In(N) + ~t (Euler's constant, approximately equal to 0.5772), we find (using Eq. (4)): 13252=C (In(3342) + 0.5772), or C= 1524.7. If we lower our standards and try a fit on the 1% level, the maximum allowed difference between the observed relative cumulative distribution and the theoretical one is 0.014 (again, using a Kolmogorov-Smirnov test). For r= 1 this difference is already too high, namely (1524.7-905) / 13252 = 0.047. Putting C= 905 (the observed frequency of one), yields e.g. for R(10) a theoretical value of 2651, which must be compared
Scientometrics 24 (1992) 207
IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
with the observed value of 2390, or a relative difference of 0.0197. This shows that Zipf's distribution can not be fitted to Zipf's data on the frequency of Chinese words.
This is also illustrated by Fig. 1.
f(r)
,o ,0 2 , e %. I I I
r
_ 10 3
10 2
1 2 3 4
Fig. 1.Zipf's data on a double logarithmic scale. �9 : observations (to prevent cluttering the table to much we represent only some selected data); full line: theoretical distribution according to Zipf's law;, dotted line: theoretical distribution according to column E of Table 2
Next, we will try to fit the generalised form of Zipf's equation to the data. To do
this we take the logarithm of Eq. (2), which results in:
in(/'(,')) = In (C) - 13 In ( 0 (5),
hence 13 and In (C) can be obtained by linear regression. A calculation based on the
67 ranks of Table 2 (the 67 data points given by Zipf), column D yields 13 = 0.8124 and In (C)=6.624, hence C=754. The correlation coefficient for this linear equation is -0.99414, which is excellent. Yet, when we turn to Eq. (2) the fit is rejected by a
Kolmogorov-Smirnov test. However, when the number of observations is large, the requirements for a good fit according to a K-S test are very severe. Moreover, to go from Eq. (5) to Eq. (2) we have to take exponentials. In this process we also take
exponentials of the errors! This explains why the fit to Eq. (2) is rejected.
208 Scientometrics 24 (1992)
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
Table 2 Rank-frequency form of Chinese words (Zipf 11, 1932)
A: Ranks (0 B: Frequency (f(r)) at rank r C: Cumulative frequency at rank r, denoted by R(r) D: Cumulative frequency predicted by the generalised form of Zipf's law,
C= 754, [~ = 0.81 E: Cumulative frequency predicted by the generalised form of Zipf's law,
In an attempt to fred a better fit for Eq. (5) we used other ranks (see column E of Table 2) and excluded the first one. This gave 13 =0.784, C=689 and a correlation coefficient of -0.995. Unfortunately, this too deviates too much from the observed data. Hence, we conclude that, although the generalised form of Zipfs equation agrees in general terms with observed data, it does not fit in the statistical sense. One of the problems, namely the large difference between f(1) (= 905) and f(2) (= 290) will be investigated in the next section.
Zipf's data and the Bradford distribution
In view of the relation between Lotka's inverse square law, the Bradford distribution and Leimkuhler's law (see e.g. Refs 5, 6, 23, 24), where it is shown that these laws are 'equivalent') it is natural to try to fit Leimkuhler's distribution
R ( r ) = a In(1 + b r) (6)
to Zipf's data. Here, as in Table 2, R(r) denotes the cumulative number Of
occurrences of the first r words. Figure 2 shows the rank-frequency distribution of the data on semi-logarithmic scales. We will also try to find Bradford groups as is done e.g. in Refs 5, 25, 26.
Scientometrics 24 (1992) 211
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
12
o: 9
6
I I I 1 2 3
Ig(r)
Fig. 2. Cumulative form of Chinese word frequencies (semi-logarithmic scale). All words with ranks belonging to the left of the square bracket are core words
Applying Egghe's global method,25, 5 we are free in our choice of the number of
groups (denoted asp). We will takep = 5. The Bradford multiplier k is then equal to k = (e't f (1))l/p -- (1.781) o.2 (905) o.2 ~ 4.38.
From this we calculate the number of items in every group, denotedasY0:
Y0 is equal to 13252 / 5 = 2650.4. The number of sources (different words) in the
In'st Bradford group is then r 0 = T (k-l) / kP-1 , where T is the total number of
sources (different words). In our case this yields: r 0 = (3342) (3.38) / (4.38) 5 - 1 ~- 7.
This leads to Table 3 which shows that the number of occurrences is not constant. As
this number must be a constant for data satisfying Bradford's law, we conclude that
Zipf's data do not satisfy Bradford's distribution.
Table 3 Bradford groups for Zipf's data on Chinese word frequencies
Group # words # occurrences
1 r o = 7 2032 2 rok ~ 31 2180 3 rok2 ~- 135 2751 4 ro k3 ~ 589 3135 5 rold ,~ 2580 3154
2 1 2 Scientometrics 24 (1992)
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
From these calculations we find for the parameters a and b in Eq. (6):
a = Y0 / In(k) = 2650.4 / 1.477 = 1794~4,
and b = k - 1 / r 0 = 4.3799./7.0124 = 0.6246 (6)
Hence we obtain the following equation:
R (r) = 1794.4 In(1 + 0.6246 r).
A quick check shows that e.g. R(240) -- 9002, which should be compared with the
observed value of 7657. Hence, this equation does not fit the data.
We have tried other methods to find R(r), including the use of truncation, or an application of the Weber-Fechner law, i.e. R(r) = w ln(b r). (Among the more exotic
attempts we mention a multiple regression on 67 points with In(r) and (In(r)) 2 as independent variables. This yielded R(r) = 574 + 513.096 In(r) + 137.7 (in(r)), 2 with
a generalised correlation coefficient R2 equal to 0.9978. This model is, of course,
totally unexplained!) The negative results of these attempts prompted us to a closer
examination of the data. An obvious feature is the large gap between the fast and the
second number of occurrences. So, perhaps the first number is 'too high' and consists
of two parts: a part described by the Leimkuhler law and an additional part. This
leads to the conjecture that the frequency of occurrence of Chinese words can be
described by the equation:
R 1 (r) = c + a In(1 + O r) (7)
This equation was proposed in the literature byAsai . 27
Using a computer program we found the following least-squares solution of the
parameters for Eq. (7): c = 880, a = 2103.47 and b = 0.10202; with a coefficient of determination equal to 0.9997.
Table 4 shows the result of fitting this equation to Zipf's data. A Kolmogorov- Smirnov test on the 1% level allows a maximum value of D (the maximum difference
between the observed and theoretical cumulative relative distribution) of 0.0141. On the 5% level this value is 0.0118, and on the 10% level the value is 0.0106. Table 4 shows that Eq. (7) is accepted at the 1% level. Moreover, were it not for the furst
value, it would have been accepted at the 10% level!
Scientometrics 24 (1992) 213
IL ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
Table 4 Cumulative rank-frequency form of Zipf's data
A: Rank (r) B: Observed cumulative distribution C: Theoretical distribution (R 1 (0=880 + 2103.47 In(1 + 0.102020) D: Absolute value of the difference between the observed and the
theoretical relative distribution
A B C D 1 905 1084 0.0135 2 1195 1271 0.0057 3 1441 1442 0.0001 4 1597 1560 0.0028 5 1744 1747 0.0002 6 1890 1885 0.0004 7 2032 2014 0.0014 8 2163 2135 0.0021 9 2281 2250 0.0023
�9 Using the second term in the equationRl(r ) = 880+2103.47 In(1 + 0.10202 0 and
the method to fred a 0.75-nucleus as described in Ref. 28, yields a nucleus of 29 core words (el Fig. 1). These words are shown in Table 5.
Conclusion
Following Zipf, ll, 2 Meyer 13 and X u 14 w e may say that Chinese word frequencies
follow Zipfs distribution as well as Western languages. The fit is, however, only global (as a general trend) and can not be corroborated by a statistical test. There seems, moreover, to be a problem with the word (or words) with highest frequencies (a problem that arises with any language).
Scientoraetrics 24 (1992) 215
R. ROUSSEAU, QIA O Q IA O ZHANG: FREQUENCY OF CHINESE WORDS
Table 5
Most frequently used Chinese words
Rank Hoed Teansl i tent ion T~anslation F~equencg
1 J~ de (of , * a~j) 985
2 ~" lo oe lian m 290
3 ~ shi (9es, a~e, is etc .) 246
4 ,A. sen (people, htman Leins) 156
5 ~1~ t � 9 (he) 147
6 ~ zai (in, on, � 9 e tc . ) 146
7 ~ uo ( I ) 142
8 51" gnu (have, possess) 131
9 - - 9i (one) 118
10 ~ gao (rant, desis, e) 195
11 ~ hu (no, not) 195
12 ~ zhe ( this) 195
13 J na ( that ) 182
14 IN dao ( to , ar r ive) 101
15 ~IE lai (ooue) 83
16 U~ shun (sag) 81
17 t~ 9e (also, too) ?8
18 I-_ shansf (uP, above etc .) 75
19 41: nan| (busg) ?3
20 ~ qu (go) 72
21 4B zhao (vheNabouts, ueam, touch) 68
22 ~l:W neig�9 (have not) 66
23 J~ l | vo~en ( re ) 66
24 - - ' ~ gige ( �9 lPiece or. �9 kind or) 60
25 m du (�9 al~eaJg) 58
26 (R shi (t~ake. cause) 57
27 I~ j i u (vi th s, egards to) 55
28 ilk zuo (mLke, do, be a . . . ) 55
29 ~ he (and, vith, as) 52
~ 7 (L) particle: exp~ssion ot completion o~ change of an action
(2) suffix: exl, x, ession or oo.l, letion or an �9 (3) Ue~h: end ulP, un~lerstand
216 Sciemometrics 24 (1992)
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
Concerning this problem, Arapov 29 writes that the real cause of the deviation from the Zipfian curve seems to be related to additional constraints on the use of auxiliary words which have high frequencies. These additional restrictions are not accounted for in mathematical derivations of the law (Hill,9J 0 Arapov, 29 MandelbrotS). If Zipfs law held precisely for these words too, then the internal
structure of any text would be controlled by the physical size of the text (and vice versa). Knowing the number of different words used would predict the length of the
text. However, in practice the length of texts is usually much less than predicted by Zipfs law. 3~
Our analysis has clearly shown that the rank-frequency form, and especially the cumulative rank-frequency form contains more information than the Lotka form. Indeed, it was only in the rank-frequency form that difficulties in fitting, and hence some special features in the data, became apparent. The value and importance of the cumulative rank distributions was already stressed many times before, notably by Brookes, see e.g. Refs. 31, 32 and by one of the authors. 33
We conclude our reinvestigation of Zipf's data on Chinese word frequencies by stating the following problems. Can other distributions (perhaps the cumulative rank-
frequency form of Lotka's law with a ~ 2, see Refs 18, 33, 34 give better fits to Zipfs data? What theory, probably based on a mathematical formulation of phylological principles, can really explain all features of Zipfs data?
We like to thank M. Dekeyser and G. W. Wu for helpful comments concerning this paper. We thank professor B.C. Brookes for bringing the book Studies on Zipfs Law to our attention.
$cientometrics 24 (1992) 217
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
Appendix
Sources of Chinese words
1. Original titles and transliterations
Liang Qichao: zui ku 9u zui le
J4ei Xiang: xzn JZ guan (lu zi .ei zhou lun)
(3) m ~ : ~ . ~ , ~ bai Jitan: hu zhou de san xiang
(4) :~ ~ : ~ O ~ - . ~ - ~ q t ~ J4ei Xiang: 9ei luo lan de 9i fens xin
(5) Jq-~t~: ~ W Ye Shaojun: Sun 9i
(6) m ~ . : I~P~ lg~- -~ l t 2banS Taisan: lui xue de .u di gu fens t a
xln ss cha~o~e dins si '
c8) ~ ..q: ~ 14ei xzang: lao r Sou j i
(9) ~J~L.~: I q ; ~ J ~ Shen Chongven: tong zhi de 9an dou de gu shi
(10) ,ei~l~ Xia g: j ~ : n ~ i , e fa
J4ei Xians: shi ~ei de si se
14el Xiang: ~ang fu
(13) I ~ t : ~ e X ~ Chen Duxiu: lan dons zbe due wu
?,hang geici: zben zhi xue dao gang
Xu $hi 9i: zui ];ou de gi ke
Sun ~honsshan" ~tin zu zhu si
Sun ~.hongshan: lao dons ~ie j i SU I~ pin dens tiao sue
Liang qichae: ~,en sheng de gi gi
Sun ~]~ngshan: nu zz Sing Sat San ,jiu san nin zhu gi
Mei Xian9: gun xiao zi xun qin ,jill
218 Scientometrics 24 (1992)
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
2. Translated titles (translation by the authors of this paper)
(1) Liang Qichao: (2) unknown (3) Dai Jitao: (4) unknown (5) Ye Shaojun: (6) Zhang Taiyan: (7) Hu Shi: (s) unknown (9) Shen Chongwcn:
(10) unknown (11) unknown (12) unknown (13) Chen Duxiu: (14) Zhang Weici: (15) Hu Shi Yi: (16) Sun Zhongshan: (17) Sun Zhongshan: (18) Liang Qichao: (19) Sun Zhongshan: (20) unknown
The Bitterest and the Happiest New Epoch Impression of Huzhou A letter to Rowland Nebula Objectives and Methods of Studying Abroad Definitions of New Trends of Thoughts Travel Notes of Laochan The story of the Comrades' Pipes Haircut A Sleepness Night A Blind Woman The Consciousness of the Working Classes The Syllabus of Political Science The Last lessons (translated by Hu Shi) Nationalism The Working Classes and the Unequal Treaties The Meaning of Life Woman Should Study San Min Zhu Yi* Story of Guo Xiaozi Tracing his Parents**
*Shan Min Zhu Yi: The People's Welfare, Civil Rights, and Democracy. ** By the term 'Guo Xiaozi' is meant a son who deeply loves his parents. This son wants to find his
parents to help them.
References
1. G. MIIJ m~, Introduction ~ The Psycho-biology of language: An Introduction to Dynamic Philology, by G. K. Zipf, M.I.T. Press, Cambridge (Mass.), 1965.
2. G.K. ZIrF, The Psycho-biology of language: An Introduction to Dynamic Philology, Houghton Mifflin, 1935. Reprinted in 1965 by the M.I.T. Press, Cambridge (Mass.).
3. P. 1". NIcHotJ.s, Estimation of Zipf parameters, Journal of the American Society for Information Science, 38 (1987) 443-445.
4. 1L E. WYtaJs, Empirical and theoretical bases of Zipf's law. Library Trends, 30 (1) (1981) 53-64. 5. L. Egghe, R. Rousseau, An Introduction to lnformetrics, Elsevier, Amsterdam, 1990. 6. R. ROUSSF.AU, Relations between continuous versions of bibliometric laws, Journal of the American
Society for Information Science, 41 (1990) 197-203. 7. L. EC~HE, The exact place of Zipf's and Pareto's law amongst the classical informetrie laws,
Scientoraetrics, 20 (1991) 93-106. 8. B. MANDel.mtOr, The Fractal Geometry of Nature, Freeman, New York, 1977. 9. B.M. HILL, The rank-frequency form of Zipf's law, Journal of the American Statistical Association,
69 (1974) 1017-1026. 10. B.M. Hilt., A theoretical derivation of the Zipf (Pareto) law, In: H. GUrmER, M.V. ~ v (Eds),
Studies on ZipfsLaw, Brockmeyer, Bochum, 1982, pp. 53-64. 11. 13. K. ZtPF, Selected Studies of the Principle of Relative Frequency in Language, Harvard University
Press, Cambridge (Mass.), 1932. 12. S.C. BRAOFOm), Sources of information on specific subjects, Engineering, 137 (1934) 85-86.
Scientometrics 24 (1992) 219
R. ROUSSEAU, QIAOQIAO ZHANG: FREQUENCY OF CHINESE WORDS
13. J. MEYER, Gilt das Zipfsche Gesetz auch fiir die chinesische Schriftsprache? NTZArchiv, 11 (1989) 13-16.
14. W.-X. XtJ, Zipf's law and mechanism of distribution of Chinese term frequency. Paper presented at the 2rid International Conference on Bibliometrics, Scientometrics and lnformetrics, London (Ontario), July 1989.
15. Press Digest 2707 p, Current Contents, July 3, 1989. 16. A.J. LOTKA, The frequency distribution of scientific productivity, Journal of the Waslu'ngton
Academy of Seienceg 16 (1926) 317-323. 17. G.K. ZIPF, Relative Frequency as a Determinant of Phonetic Change, Harvard Studies in Classical
Philology, Vol. 40, Harvard University Press, Cambridge (Mass.), 1929. 18. G.K. ZzrF, Human Behavior and the Pn'nciple of Least Effort. Harrier, New York and London
(reprinted edition), 1965. 19. G. D ~ , Relative Frequency of English Speech Sounds, Harvard University Press, Cambridge
(Mass.), 1923. 20. E.V. CONOON, Statistics of vocabulary, Science, 67 (1928) 300. 21. M. Pm'RUSZEWYCZ, L'histoire de la loi d'Estoup-Zipf: documents, Matlt Sci. hum., 11 (44) (1973)
41-56. 22. D.H. HER~F_.L, Bibliometrics, history of the developments of ideas in. In: Encyclopedia of Library
andlnforraation Science, A. KENT (Ed.), Vol. 42; suppl. 7 (1987) 144-219. 23. L. EC~HE, The Duality of Informetric Systems with applications to the Empirical Laws, Ph.D.
Thesis, The City University London (UK), 1989. 24. R. ROUSSFAtJ, Een vleug~e bibliometrie: de equivalentie tussen de wetten van Bradford en
Leimkuhler (Some bibliometrics: the equivalence between the Bradford and the Leimkuhler laws), IVtskunde en Onderwijs, 13 (1987) 71-78.
25. L. EC, G,F., Applications of the theory of Bradford's law to the calculation of Leimkuhler's law and to the completion of bibliographies, Journal of the American Society for Information Science, 41 (1990) 469-492.
26. Q. ZnAlqc~, Obsolescence and Bradford Distribution of Rice Literature, M.Sc. Thesis, The City University London (UK), 1986.
27. I. ASAI, A general formulation of Bradford's distribution: the graph-oriented approach, Journal of the American Society for Information Science, 32 (1981) 113-119.
28. IL RoussEAu, The nuclear zone of a Leimkuhler curve, Journal of Documentation, 43 (1987) 322- 333.
29. M.V. ARAPOV, A variational approach to frequency-rank distributions of text elements, In: H. GtJrrug, M.V. ARAPOV (Eds), Studies on ZipJ's law, Brockmeyer, Bochum, 1982, pp. 29-52.
30. R.E. PRATH.~t, Comparison and extension of theories of Zipf and Halstead, The Computer Journal, 31 (1988) 248-252.
31. B.C. BROOKES, Quantitative analysis in the humanities: the advantage of ranking techniques, In: H. Guiter, M.V. Arapov (Eds), Studies on Zipf s law, Brockmeyer, Bochum, 1982, pp. 65-115.
32. B.C. BP.OOKES, Comments on the scope of bibliometrics. In: lnformetrics 87/88, L. Egghe, R. Rousseau (Eds), Elsevier, Amsterdam, 1988, pp. 29-41.
33. R. ROUSSEAU, Lotka's law and its Leimkuhler representation, Library Science with a Slant to Documentation and Information Studies, 25 (1988) 150-1 78.
34. L. EC, GHE, New Bradfordian laws equivalent with old Lotka laws, evolving from a source-item duality argument, In: Informetrics 89/90, L. Egghe, R. Rousseau (Eds), Elsevier, Amsterdam, 1990, 79-96.