-
arX
iv:c
ond-
mat
/041
2004
v2
9 Ja
n 20
05Power laws, Pareto distributions and Zipf’s law
M. E. J. Newman
Department of Physics and Center for the Study of Complex
Systems, University of Michigan, Ann Arbor,MI 48109. U.S.A.
When the probability of measuring a particular value of some
quantity varies inversely as a powerof that value, the quantity is
said to follow a power law, also known variously as Zipf’s law or
thePareto distribution. Power laws appear widely in physics,
biology, earth and planetary sciences,economics and finance,
computer science, demography and the social sciences. For
instance,the distributions of the sizes of cities, earthquakes,
solar flares, moon craters, wars and people’spersonal fortunes all
appear to follow power laws. The origin of power-law behaviour has
beena topic of debate in the scientific community for more than a
century. Here we review some ofthe empirical evidence for the
existence of power-law forms and the theories proposed to
explainthem.
I. INTRODUCTION
Many of the things that scientists measure have a typ-ical size
or “scale”—a typical value around which in-dividual measurements
are centred. A simple examplewould be the heights of human beings.
Most adult hu-man beings are about 180cm tall. There is some
varia-tion around this figure, notably depending on sex, but
wenever see people who are 10cm tall, or 500cm. To makethis
observation more quantitative, one can plot a his-togram of
people’s heights, as I have done in Fig. 1a. Thefigure shows the
heights in centimetres of adult men inthe United States measured
between 1959 and 1962, andindeed the distribution is relatively
narrow and peakedaround 180cm. Another telling observation is the
ratio ofthe heights of the tallest and shortest people. The
Guin-ness Book of Records claims the world’s tallest and short-est
adult men (both now dead) as having had heights272cm and 57cm
respectively, making the ratio 4.8. Thisis a relatively low value;
as we will see in a moment,some other quantities have much higher
ratios of largestto smallest.
Figure 1b shows another example of a quantity witha typical
scale: the speeds in miles per hour of cars onthe motorway. Again
the histogram of speeds is stronglypeaked, in this case around
75mph.
But not all things we measure are peaked around a typ-ical
value. Some vary over an enormous dynamic range,sometimes many
orders of magnitude. A classic exampleof this type of behaviour is
the sizes of towns and cities.The largest population of any city in
the US is 8.00 mil-lion for New York City, as of the most recent
(2000) cen-sus. The town with the smallest population is harder
topin down, since it depends on what you call a town. Theauthor
recalls in 1993 passing through the town of Mil-liken, Oregon,
population 4, which consisted of one largehouse occupied by the
town’s entire human population,a wooden shack occupied by an
extraordinary numberof cats and a very impressive flea market.
According tothe Guinness Book, however, America’s smallest town
isDuffield, Virginia, with a population of 52. Whicheverway you
look at it, the ratio of largest to smallest pop-
ulation is at least 150 000. Clearly this is quite differentfrom
what we saw for heights of people. And an evenmore startling
pattern is revealed when we look at thehistogram of the sizes of
cities, which is shown in Fig. 2.
In the left panel of the figure, I show a simple his-togram of
the distribution of US city sizes. The his-togram is highly
right-skewed, meaning that while thebulk of the distribution occurs
for fairly small sizes—most US cities have small populations—there
is a smallnumber of cities with population much higher than
thetypical value, producing the long tail to the right of
thehistogram. This right-skewed form is qualitatively
quitedifferent from the histograms of people’s heights, but isnot
itself very surprising. Given that we know there is alarge dynamic
range from the smallest to the largest citysizes, we can
immediately deduce that there can onlybe a small number of very
large cities. After all, in acountry such as America with a total
population of 300million people, you could at most have about 40
cities thesize of New York. And the 2700 cities in the histogramof
Fig. 2 cannot have a mean population of more than3 × 108/2700 = 110
000.
What is surprising on the other hand, is the right panelof Fig.
2, which shows the histogram of city sizes again,but this time
replotted with logarithmic horizontal andvertical axes. Now a
remarkable pattern emerges: thehistogram, when plotted in this
fashion, follows quiteclosely a straight line. This observation
seems first tohave been made by Auerbach [1], although it is often
at-tributed to Zipf [2]. What does it mean? Let p(x) dxbe the
fraction of cities with population between x andx + dx. If the
histogram is a straight line on log-logscales, then ln p(x) = −α
lnx+ c, where α and c are con-stants. (The minus sign is optional,
but convenient sincethe slope of the line in Fig. 2 is clearly
negative.) Takingthe exponential of both sides, this is equivalent
to:
p(x) = Cx−α, (1)
with C = ec.Distributions of the form (1) are said to follow a
power
law. The constant α is called the exponent of the powerlaw. (The
constant C is mostly uninteresting; once α
-
2 Power laws, Pareto distributions and Zipf’s law
0 50 100 150 200 250
heights of males
0
2
4
6
perc
enta
ge
0 20 40 60 80 100
speeds of cars
0
1
2
3
4
FIG. 1 Left: histogram of heights in centimetres of American
males. Data from the National Health Examination Survey,1959–1962
(US Department of Health and Human Services). Right: histogram of
speeds in miles per hour of cars on UKmotorways. Data from
Transport Statistics 2003 (UK Department for Transport).
0 2×105 4×105
population of city
0
0.001
0.002
0.003
0.004
perc
enta
ge o
f citi
es
104 105 106 10710-8
10-7
10-6
10-5
10-4
10-3
10-2
FIG. 2 Left: histogram of the populations of all US cities with
population of 10 000 or more. Right: another histogram of thesame
data, but plotted on logarithmic scales. The approximate
straight-line form of the histogram in the right panel impliesthat
the distribution follows a power law. Data from the 2000 US
Census.
is fixed, it is determined by the requirement that
thedistribution p(x) sum to 1; see Section III.A.)
Power-law distributions occur in an extraordinarily di-verse
range of phenomena. In addition to city popula-tions, the sizes of
earthquakes [3], moon craters [4], solarflares [5], computer files
[6] and wars [7], the frequency ofuse of words in any human
language [2, 8], the frequencyof occurrence of personal names in
most cultures [9], thenumbers of papers scientists write [10], the
number ofcitations received by papers [11], the number of hits
onweb pages [12], the sales of books, music recordings andalmost
every other branded commodity [13, 14], the num-bers of species in
biological taxa [15], people’s annual in-comes [16] and a host of
other variables all follow power-law distributions.1
1 Power laws also occur in many situations other than the
statis-
Power-law distributions are the subject of this arti-cle. In the
following sections, I discuss ways of detectingpower-law behaviour,
give empirical evidence for powerlaws in a variety of systems and
describe some of themechanisms by which power-law behaviour can
arise.
Readers interested in pursuing the subject further mayalso wish
to consult the reviews by Sornette [18] andMitzenmacher [19], as
well as the bibliography by Li.2
tical distributions of quantities. For instance, Newton’s
famous1/r2 law for gravity has a power-law form with exponent α =
2.While such laws are certainly interesting in their own way,
theyare not the topic of this paper. Thus, for instance, there
hasin recent years been some discussion of the “allometric”
scal-ing laws seen in the physiognomy and physiology of
biologicalorganisms [17], but since these are not statistical
distributionsthey will not be discussed here.
2 http://linkage.rockefeller.edu/wli/zipf/.
-
II Measuring power laws 3
0 2 4 6 8
x
0
0.5
1
1.5
sam
ples
1 10 100
x
10-5
10-4
10-3
10-2
10-1
100
sam
ples
1 10 100 1000
x
10-9
10-7
10-5
10-3
10-1
sam
ples
1 10 100 1000
x
10-4
10-2
100
sam
ples
with
val
ue >
x
(a) (b)
(c) (d)
FIG. 3 (a) Histogram of the set of 1 million random numbers
described in the text, which have a power-law distribution
withexponent α = 2.5. (b) The same histogram on logarithmic scales.
Notice how noisy the results get in the tail towards theright-hand
side of the panel. This happens because the number of samples in
the bins becomes small and statistical fluctuationsare therefore
large as a fraction of sample number. (c) A histogram constructed
using “logarithmic binning”. (d) A cumulativehistogram or
rank/frequency plot of the same data. The cumulative distribution
also follows a power law, but with an exponentof α − 1 = 1.5.
II. MEASURING POWER LAWS
Identifying power-law behaviour in either natural orman-made
systems can be tricky. The standard strategymakes use of a result
we have already seen: a histogramof a quantity with a power-law
distribution appears asa straight line when plotted on logarithmic
scales. Justmaking a simple histogram, however, and plotting it
onlog scales to see if it looks straight is, in most cases, apoor
way proceed.
Consider Fig. 3. This example shows a fake data set:I have
generated a million random real numbers drawnfrom a power-law
probability distribution p(x) = Cx−α
with exponent α = 2.5, just for illustrative purposes.3
Panel (a) of the figure shows a normal histogram of the
3 This can be done using the so-called transformation method.
Ifwe can generate a random real number r uniformly distributed
inthe range 0 ≤ r < 1, then x = xmin(1 − r)−1/(α−1) is a
randompower-law-distributed real number in the range xmin ≤ x <
∞with exponent α. Note that there has to be a lower limit xminon
the range; the power-law distribution diverges as x → 0—seeSection
II.A.
numbers, produced by binning them into bins of equalsize 0.1.
That is, the first bin goes from 1 to 1.1, thesecond from 1.1 to
1.2, and so forth. On the linear scalesused this produces a nice
smooth curve.
To reveal the power-law form of the distribution it isbetter, as
we have seen, to plot the histogram on logarith-mic scales, and
when we do this for the current data wesee the characteristic
straight-line form of the power-lawdistribution, Fig. 3b. However,
the plot is in some re-spects not a very good one. In particular
the right-handend of the distribution is noisy because of sampling
er-rors. The power-law distribution dwindles in this region,meaning
that each bin only has a few samples in it, ifany. So the
fractional fluctuations in the bin counts arelarge and this appears
as a noisy curve on the plot. Oneway to deal with this would be
simply to throw out thedata in the tail of the curve. But there is
often useful in-formation in those data and furthermore, as we will
seein Section II.A, many distributions follow a power lawonly in
the tail, so we are in danger of throwing out thebaby with the
bathwater.
An alternative solution is to vary the width of the binsin the
histogram. If we are going to do this, we mustalso normalize the
sample counts by the width of the
-
4 Power laws, Pareto distributions and Zipf’s law
bins they fall in. That is, the number of samples in a binof
width ∆x should be divided by ∆x to get a count perunit interval of
x. Then the normalized sample countbecomes independent of bin width
on average and we arefree to vary the bin widths as we like. The
most commonchoice is to create bins such that each is a fixed
multiplewider than the one before it. This is known as loga-rithmic
binning. For the present example, for instance,we might choose a
multiplier of 2 and create bins thatspan the intervals 1 to 1.1,
1.1 to 1.3, 1.3 to 1.7 and soforth (i.e., the sizes of the bins are
0.1, 0.2, 0.4 and soforth). This means the bins in the tail of the
distribu-tion get more samples than they would if bin sizes
werefixed, and this reduces the statistical errors in the tail.
Italso has the nice side-effect that the bins appear to be
ofconstant width when we plot the histogram on log scales.
I used logarithmic binning in the construction ofFig. 2b, which
is why the points representing the individ-ual bins appear equally
spaced. In Fig. 3c I have donethe same for our computer-generated
power-law data. Aswe can see, the straight-line power-law form of
the his-togram is now much clearer and can be seen to extend forat
least a decade further than was apparent in Fig. 3b.
Even with logarithmic binning there is still some noisein the
tail, although it is sharply decreased. Suppose thebottom of the
lowest bin is at xmin and the ratio of thewidths of successive bins
is a. Then the kth bin extendsfrom xk−1 = xminak−1 to xk = xminak
and the expectednumber of samples falling in this interval is
∫ xk
xk−1
p(x) dx = C
∫ xk
xk−1
x−α dx
= Caα−1 − 1α− 1
(xminak)−α+1. (2)
Thus, so long as α > 1, the number of samples per bingoes
down as k increases and the bins in the tail will havemore
statistical noise than those that precede them. Aswe will see in
the next section, most power-law distribu-tions occurring in nature
have 2 ≤ α ≤ 3, so noisy tailsare the norm.
Another, and in many ways a superior, method of plot-ting the
data is to calculate a cumulative distributionfunction. Instead of
plotting a simple histogram of thedata, we make a plot of the
probability P (x) that x hasa value greater than or equal to x:
P (x) =
∫
∞
xp(x′) dx′. (3)
The plot we get is no longer a simple representation ofthe
distribution of the data, but it is useful nonetheless.If the
distribution follows a power law p(x) = Cx−α, then
P (x) = C
∫
∞
xx′
−αdx′ =
C
α− 1x−(α−1). (4)
Thus the cumulative distribution function P (x) also fol-lows a
power law, but with a different exponent α − 1,
which is 1 less than the original exponent. Thus, if weplot P
(x) on logarithmic scales we should again get astraight line, but
with a shallower slope.
But notice that there is no need to bin the data atall to
calculate P (x). By its definition, P (x) is well-defined for every
value of x and so can be plotted as aperfectly normal function
without binning. This avoidsall questions about what sizes the bins
should be. Italso makes much better use of the data: binning of
datalumps all samples within a given range together into thesame
bin and so throws out any information that wascontained in the
individual values of the samples withinthat range. Cumulative
distributions don’t throw awayany information; it’s all there in
the plot.
Figure 3d shows our computer-generated power-lawdata as a
cumulative distribution, and indeed we againsee the tell-tale
straight-line form of the power law, butwith a shallower slope than
before. Cumulative distribu-tions like this are sometimes also
called rank/frequencyplots for reasons explained in Appendix A.
Cumula-tive distributions with a power-law form are sometimessaid
to follow Zipf’s law or a Pareto distribution, af-ter two early
researchers who championed their study.Since power-law cumulative
distributions imply a power-law form for p(x), “Zipf’s law” and
“Pareto distribu-tion” are effectively synonymous with “power-law
distri-bution”. (Zipf’s law and the Pareto distribution differfrom
one another in the way the cumulative distributionis plotted—Zipf
made his plots with x on the horizon-tal axis and P (x) on the
vertical one; Pareto did it theother way around. This causes much
confusion in the lit-erature, but the data depicted in the plots
are of courseidentical.4)
We know the value of the exponent α for our artifi-cial data set
since it was generated deliberately to havea particular value, but
in practical situations we wouldoften like to estimate α from
observed data. One wayto do this would be to fit the slope of the
line in plotslike Figs. 3b, c or d, and this is the most commonly
usedmethod. Unfortunately, it is known to introduce system-atic
biases into the value of the exponent [20], so it shouldnot be
relied upon. For example, a least-squares fit of astraight line to
Fig. 3b gives α = 2.26 ± 0.02, which isclearly incompatible with
the known value of α = 2.5from which the data were generated.
An alternative, simple and reliable method for extract-ing the
exponent is to employ the formula
α = 1 + n
[
n∑
i=1
lnxi
xmin
]−1
. (5)
Here the quantities xi, i = 1 . . . n are the measured val-ues
of x and xmin is again the minimum value of x. (As
4 See http://www.hpl.hp.com/research/idl/papers/ranking/for a
useful discussion of these and related points.
-
II Measuring power laws 5
discussed in the following section, in practical situationsxmin
usually corresponds not to the smallest value of xmeasured but to
the smallest for which the power-law be-haviour holds.) The
derivation of this formula is given inAppendix B. An error estimate
for α can be derived by astandard bootstrap or jackknife resampling
method [21];for large data sets of the type discussed in this
paper, abootstrap is normally the more computationally econom-ical
of the two.
Applying Eq. (5) to our present data gives an estimateof α =
2.500 ± 0.002 for the exponent, which agrees wellwith the known
value of 2.5.
A. Examples of power laws
In Fig. 4 we show cumulative distributions of twelvedifferent
quantities measured in physical, biological, tech-nological and
social systems of various kinds. All havebeen proposed to follow
power laws over some part oftheir range. The ubiquity of power-law
behaviour in thenatural world has led many scientists to wonder
whetherthere is a single, simple, underlying mechanism link-ing all
these different systems together. Several candi-dates for such
mechanisms have been proposed, going bynames like “self-organized
criticality” and “highly opti-mized tolerance”. However, the
conventional wisdom isthat there are actually many different
mechanisms forproducing power laws and that different ones are
appli-cable to different cases. We discuss these points furtherin
Section IV.
The distributions shown in Fig. 4 are as follows.
(a) Word frequency: Estoup [8] observed that thefrequency with
which words are used appears to fol-low a power law, and this
observation was famouslyexamined in depth and confirmed by Zipf
[2].Panel (a) of Fig. 4 shows the cumulative distribu-tion of the
number of times that words occur in atypical piece of English text,
in this case the text ofthe novel Moby Dick by Herman Melville.5
Similardistributions are seen for words in other languages.
(b) Citations of scientific papers: As first observedby Price
[11], the numbers of citations received byscientific papers appear
to have a power-law distri-bution. The data in panel (b) are taken
from theScience Citation Index, as collated by Redner [23],and are
for papers published in 1981. The plotshows the cumulative
distribution of the number ofcitations received by a paper between
publicationand June 1997.
5 The most common words in this case are, in order, “the”,
“of”,“and”, “a” and “to”, and the same is true for most written
En-glish texts. Interestingly, however, it is not true for spoken
En-glish. The most common words in spoken English are, in
order,“I”, “and”, “the”, “to” and “that” [22].
(c) Web hits: The cumulative distribution of thenumber of “hits”
received by web sites (i.e., servers,not pages) during a single day
from a subset of theusers of the AOL Internet service. The site
withthe most hits, by a long way, was yahoo.com. Af-ter Adamic and
Huberman [12].
(d) Copies of books sold: The cumulative distribu-tion of the
total number of copies sold in Amer-ica of the 633 bestselling
books that sold 2 millionor more copies between 1895 and 1965. The
datawere compiled painstakingly over a period of sev-eral decades
by Alice Hackett, an editor at Pub-lisher’s Weekly [24]. The best
selling book dur-ing the period covered was Benjamin Spock’s
TheCommon Sense Book of Baby and Child Care. (TheBible, which
certainly sold more copies, is not reallya single book, but exists
in many different transla-tions, versions and publications, and was
excludedby Hackett from her statistics.) Substantially bet-ter data
on book sales than Hackett’s are now avail-able from operations
such as Nielsen BookScan, butunfortunately at a price this author
cannot afford.I should be very interested to see a plot of
salesfigures from such a modern source.
(e) Telephone calls: The cumulative distribution ofthe number of
calls received on a single day by 51million users of AT&T long
distance telephone ser-vice in the United States. After Aiello et
al. [25].The largest number of calls received by a customerin that
day was 375 746, or about 260 calls a minute(obviously to a
telephone number that has manypeople manning the phones). Similar
distributionsare seen for the number of calls placed by users
andalso for the numbers of email messages that peoplesend and
receive [26, 27].
(f) Magnitude of earthquakes: The cumulative dis-tribution of
the Richter (local) magnitude of earth-quakes occurring in
California between January1910 and May 1992, as recorded in the
BerkeleyEarthquake Catalog. The Richter magnitude is de-fined as
the logarithm, base 10, of the maximumamplitude of motion detected
in the earthquake,and hence the horizontal scale in the plot,
whichis drawn as linear, is in effect a logarithmic scaleof
amplitude. The power law relationship in theearthquake distribution
is thus a relationship be-tween amplitude and frequency of
occurrence. Thedata are from the National Geophysical Data Cen-ter,
www.ngdc.noaa.gov.
(g) Diameter of moon craters: The cumulative dis-tribution of
the diameter of moon craters. Ratherthan measuring the (integer)
number of craters ofa given size on the whole surface of the moon,
thevertical axis is normalized to measure number ofcraters per
square kilometre, which is why the axisgoes below 1, unlike the
rest of the plots, since it is
-
6
100 102 104
word frequency
100
102
104
100 102 104
citations
100
102
104
106
100 102 104
web hits
100
102
104
106 107
books sold
1
10
100
100 102 104 106
telephone calls received
100
103
106
2 3 4 5 6 7earthquake magnitude
102
103
104
0.01 0.1 1crater diameter in km
10-4
10-2
100
102
102 103 104 105
peak intensity
101
102
103
104
1 10 100intensity
1
10
100
109 1010
net worth in US dollars
1
10
100
104 105 106
name frequency
100
102
104
103 105 107
population of city
100
102
104
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)
(j) (k) (l)
FIG. 4 Cumulative distributions or “rank/frequency plots” of
twelve quantities reputed to follow power laws. The
distributionswere computed as described in Appendix A. Data in the
shaded regions were excluded from the calculations of the
exponentsin Table I. Source references for the data are given in
the text. (a) Numbers of occurrences of words in the novel Moby
Dickby Hermann Melville. (b) Numbers of citations to scientific
papers published in 1981, from time of publication until June1997.
(c) Numbers of hits on web sites by 60 000 users of the America
Online Internet service for the day of 1 December 1997.(d) Numbers
of copies of bestselling books sold in the US between 1895 and
1965. (e) Number of calls received by AT&Ttelephone customers
in the US for a single day. (f) Magnitude of earthquakes in
California between January 1910 and May 1992.Magnitude is
proportional to the logarithm of the maximum amplitude of the
earthquake, and hence the distribution obeys apower law even though
the horizontal axis is linear. (g) Diameter of craters on the moon.
Vertical axis is measured per squarekilometre. (h) Peak gamma-ray
intensity of solar flares in counts per second, measured from Earth
orbit between February1980 and November 1989. (i) Intensity of wars
from 1816 to 1980, measured as battle deaths per 10 000 of the
population of theparticipating countries. (j) Aggregate net worth
in dollars of the richest individuals in the US in October 2003.
(k) Frequencyof occurrence of family names in the US in the year
1990. (l) Populations of US cities in the year 2000.
-
II Measuring power laws 7
entirely possible for there to be less than one craterof a given
size per square kilometre. After Neukumand Ivanov [4].
(h) Intensity of solar flares: The cumulative dis-tribution of
the peak gamma-ray intensity ofsolar flares. The observations were
made be-tween 1980 and 1989 by the instrument knownas the Hard
X-Ray Burst Spectrometer aboardthe Solar Maximum Mission satellite
launchedin 1980. The spectrometer used a CsI scin-tillation
detector to measure gamma-rays fromsolar flares and the horizontal
axis in the fig-ure is calibrated in terms of scintillation
countsper second from this detector. The data arefrom the NASA
Goddard Space Flight Center,umbra.nascom.nasa.gov/smm/hxrbs.html.
Seealso Lu and Hamilton [5].
(i) Intensity of wars: The cumulative distributionof the
intensity of 119 wars from 1816 to 1980. In-tensity is defined by
taking the number of battledeaths among all participant countries
in a war,dividing by the total combined populations of thecountries
and multiplying by 10 000. For instance,the intensities of the
First and Second World Warswere 141.5 and 106.3 battle deaths per
10 000 re-spectively. The worst war of the period coveredwas the
small but horrifically destructive Paraguay-Bolivia war of
1932–1935 with an intensity of 382.4.The data are from Small and
Singer [28]. See alsoRoberts and Turcotte [7].
(j) Wealth of the richest people: The cumulativedistribution of
the total wealth of the richest peoplein the United States. Wealth
is defined as aggre-gate net worth, i.e., total value in dollars at
currentmarket prices of all an individual’s holdings, minustheir
debts. For instance, when the data were com-piled in 2003,
America’s richest person, William H.Gates III, had an aggregate net
worth of $46 bil-lion, much of it in the form of stocks of the
companyhe founded, Microsoft Corporation. Note that networth
doesn’t actually correspond to the amount ofmoney individuals could
spend if they wanted to:if Bill Gates were to sell all his
Microsoft stock, forinstance, or otherwise divest himself of any
signif-icant portion of it, it would certainly depress thestock
price. The data are from Forbes magazine, 6October 2003.
(k) Frequencies of family names: Cumulative dis-tribution of the
frequency of occurrence in the US ofthe 89 000 most common family
names, as recordedby the US Census Bureau in 1990. Similar
distribu-tions are observed for names in some other culturesas well
(for example in Japan [29]) but not in allcases. Korean family
names for instance appear tohave an exponential distribution
[30].
(l) Populations of cities: Cumulative distributionof the size of
the human populations of US cities asrecorded by the US Census
Bureau in 2000.
Few real-world distributions follow a power law overtheir entire
range, and in particular not for smaller val-ues of the variable
being measured. As pointed out inthe previous section, for any
positive value of the expo-nent α the function p(x) = Cx−α diverges
as x → 0. Inreality therefore, the distribution must deviate from
thepower-law form below some minimum value xmin. In
ourcomputer-generated example of the last section we sim-ply cut
off the distribution altogether below xmin so thatp(x) = 0 in this
region, but most real-world examplesare not that abrupt. Figure 4
shows distributions witha variety of behaviours for small values of
the variablemeasured; the straight-line power-law form asserts
itselfonly for the higher values. Thus one often hears it saidthat
the distribution of such-and-such a quantity “has apower-law
tail”.
Extracting a value for the exponent α from distribu-tions like
these can be a little tricky, since it requiresus to make a
judgement, sometimes imprecise, about thevalue xmin above which the
distribution follows the powerlaw. Once this judgement is made,
however, α can becalculated simply from Eq. (5).6 (Care must be
taken touse the correct value of n in the formula; n is the
numberof samples that actually go into the calculation, exclud-ing
those with values below xmin, not the overall totalnumber of
samples.)
Table I lists the estimated exponents for each of
thedistributions of Fig. 4, along with standard errors calcu-lated
by bootstrapping 100 times, and also the values ofxmin used in the
calculations. Note that the quoted errorscorrespond only to the
statistical sampling error in theestimation of α; I have included
no estimate of any errorsintroduced by the fact that a single
power-law functionmay not be a good model for the data in some
cases or forvariation of the estimates with the value chosen for
xmin.
In the author’s opinion, the identification of some ofthe
distributions in Fig. 4 as following power laws shouldbe considered
unconfirmed. While the power law seemsto be an excellent model for
most of the data sets de-picted, a tenable case could be made that
the distribu-tions of web hits and family names might have two
differ-ent power-law regimes with slightly different
exponents.7
And the data for the numbers of copies of books sold
6 Sometimes the tail is also cut off because there is, for one
reasonor another, a limit on the largest value that may occur.
Anexample is the finite-size effects found in critical
phenomena—see Section IV.E. In this case, Eq. (5) must be modified
[20].
7 Significantly more tenuous claims to power-law behaviour
forother quantities have appeared elsewhere in the literature,
forinstance in the discussion of the distribution of the sizes of
elec-trical blackouts [31, 32]. These however I consider
insufficientlysubstantiated for inclusion in the present work.
-
8 Power laws, Pareto distributions and Zipf’s law
minimum exponentquantity xmin α
(a) frequency of use of words 1 2.20(1)
(b) number of citations to papers 100 3.04(2)
(c) number of hits on web sites 1 2.40(1)
(d) copies of books sold in the US 2 000 000 3.51(16)
(e) telephone calls received 10 2.22(1)
(f) magnitude of earthquakes 3.8 3.04(4)
(g) diameter of moon craters 0.01 3.14(5)
(h) intensity of solar flares 200 1.83(2)
(i) intensity of wars 3 1.80(9)
(j) net worth of Americans $600m 2.09(4)
(k) frequency of family names 10 000 1.94(1)
(l) population of US cities 40 000 2.30(5)
TABLE I Parameters for the distributions shown in Fig. 4.The
labels on the left refer to the panels in the figure. Expo-nent
values were calculated using the maximum likelihoodmethod of Eq.
(5) and Appendix B, except for the mooncraters (g), for which only
cumulative data were available. Forthis case the exponent quoted is
from a simple least-squares fitand should be treated with caution.
Numbers in parenthesesgive the standard error on the trailing
figures.
cover rather a small range—little more than one
decadehorizontally. Nonetheless, one can, without stretchingthe
interpretation of the data unreasonably, claim thatpower-law
distributions have been observed in language,demography, commerce,
information and computer sci-ences, geology, physics and astronomy,
and this on itsown is an extraordinary statement.
B. Distributions that do not follow a power law
Power-law distributions are, as we have seen, impres-sively
ubiquitous, but they are not the only form of broaddistribution.
Lest I give the impression that everythinginteresting follows a
power law, let me emphasize thatthere are quite a number of
quantities with highly right-skewed distributions that nonetheless
do not obey powerlaws. A few of them, shown in Fig. 5, are the
following:
(a) The abundance of North American bird species,which spans
over five orders of magnitude but isprobably distributed according
to a log-normal. Alog-normally distributed quantity is one whose
log-arithm is normally distributed; see Section IV.Gand Ref. [33]
for further discussions.
(b) The number of entries in people’s email addressbooks, which
spans about three orders of magni-tude but seems to follow a
stretched exponential.A stretched exponential is curve of the form
e−ax
b
for some constants a, b.
(c) The distribution of the sizes of forest fires, whichspans
six orders of magnitude and could follow apower law but with an
exponential cutoff.
100 102 104
abundance
1
10
100
1000
0 100 200 300
number of addresses
100
101
102
103
104
100 102 104 106
size in acres
100
102
104
(a) (b)
(c)
FIG. 5 Cumulative distributions of some quantities
whosedistributions span several orders of magnitude but
thatnonetheless do not follow power laws. (a) The number
ofsightings of 591 species of birds in the North American Breed-ing
Bird Survey 2003. (b) The number of addresses in theemail address
books of 16 881 users of a large university com-puter system [34].
(c) The size in acres of all wildfires occur-ring on US federal
land between 1986 and 1996 (National FireOccurrence Database, USDA
Forest Service and Departmentof the Interior). Note that the
horizontal axis is logarithmicin frames (a) and (c) but linear in
frame (b).
This being an article about power laws, I will not
discussfurther the possible explanations for these
distributions,but the scientist confronted with a new set of data
havinga broad dynamic range and a highly skewed distributionshould
certainly bear in mind that a power-law model isonly one of several
possibilities for fitting it.
III. THE MATHEMATICS OF POWER LAWS
A continuous real variable with a power-law distribu-tion has a
probability p(x) dx of taking a value in theinterval from x to x +
dx, where
p(x) = Cx−α, (6)
with α > 0. As we saw in Section II.A, there must besome
lowest value xmin at which the power law is obeyed,and we consider
only the statistics of x above this value.
-
III The mathematics of power laws 9
A. Normalization
The constant C in Eq. (6) is given by the
normalizationrequirement that
1 =
∫
∞
xmin
p(x)dx = C
∫
∞
xmin
x−α dx =C
1 − α
[
x−α+1]∞
xmin.
(7)We see immediately that this only makes sense if α >1,
since otherwise the right-hand side of the equationwould diverge:
power laws with exponents less than unitycannot be normalized and
don’t normally occur in nature.If α > 1 then Eq. (7) gives
C = (α− 1)xα−1min , (8)
and the correct normalized expression for the power lawitself
is
p(x) =α− 1xmin
(
x
xmin
)−α
. (9)
Some distributions follow a power law for part of theirrange but
are cut off at high values of x. That is, abovesome value they
deviate from the power law and fall offquickly towards zero. If
this happens, then the distribu-tion may be normalizable no matter
what the value ofthe exponent α. Even so, exponents less than unity
arerarely, if ever, seen.
B. Moments
The mean value of x in our power law is given by
〈x〉 =∫
∞
xmin
xp(x) dx = C
∫
∞
xmin
x−α+1 dx
=C
2 − α
[
x−α+2]∞
xmin. (10)
Note that this expression becomes infinite if α ≤ 2.Power laws
with such low values of α have no finite mean.The distributions of
sizes of solar flares and wars in Ta-ble I are examples of such
power laws.
What does it mean to say that a distribution has nofinite mean?
Surely we can take the data for real solarflares and calculate
their average? Indeed we can, butthis is only because the data set
is of finite size. Equa-tion (10) can be made to give a finite
value of 〈x〉 if wecut the integral off at some upper limit, i.e.,
if there is amaximum as well as a minimum value of x. In any
realdata set of finite size there is indeed such a maximum,which is
just the largest value of x observed. But if wemake more
measurements and generate a larger dataset,we have a non-negligible
chance of getting a larger maxi-mum value of x, and this will make
the value of 〈x〉 largerin turn. The divergence of Eq. (10) is
telling us that aswe go to larger and larger data sets, our
estimate of themean 〈x〉 will increase without bound. We discuss
thismore below.
For α > 2 however, the mean does not diverge: thevalue of 〈x〉
will settle down to a normal finite value asthe data set becomes
large, and that value is given byEq. (10) to be
〈x〉 =α− 1α− 2
xmin. (11)
We can also calculate higher moments of the distribu-tion p(x).
For instance, the second moment, the meansquare, is given by
〈
x2〉
=C
3 − α
[
x−α+3]∞
xmin. (12)
This diverges if α ≤ 3. Thus power-law distributions inthis
range, which includes almost all of those in Table I,have no finite
mean square in the limit of a large dataset, and thus also no
finite variance or standard deviation.We discuss the meaning of
this statement further below.If α > 3, then the second moment is
finite and well-defined, taking the value
〈
x2〉
=α− 1α− 3
x2min. (13)
These results can easily be extended to show that ingeneral all
moments 〈xm〉 exist for m < α and all highermoments diverge. The
ones that do exist are given by
〈xm〉 =α− 1
α− 1 − mxmmin. (14)
C. Largest value
Suppose we draw n measurements from a power-lawdistribution.
What value is the largest of those measure-ments likely to take?
Or, more precisely, what is theprobability π(x) dx that the largest
value falls in the in-terval between x and x + dx?
The definitive property of the largest value in a sampleis that
there are no others larger than it. The probabilitythat a
particular sample will be larger than x is given bythe quantity P
(x) defined in Eq. (3):
P (x) =
∫
∞
xp(x′) dx′ =
C
α− 1x−α+1 =
(
x
xmin
)−α+1
,
(15)so long as α > 1. And the probability that a sample isnot
greater than x is 1−P (x). Thus the probability thata particular
sample we draw, sample i, will lie betweenx and x + dx and that all
the others will be no greaterthan it is p(x)dx× [1−P (x)]n−1. Then
there are n waysto choose i, giving a total probability
π(x) = np(x)[1 − P (x)]n−1. (16)
Now we can calculate the mean value 〈xmax〉 of thelargest sample
thus:
〈xmax〉 =∫
∞
xmin
xπ(x)dx = n
∫
∞
xmin
xp(x)[1−P (x)]n−1dx.
(17)
-
10 Power laws, Pareto distributions and Zipf’s law
Using Eqs. (9) and (15), this is
〈xmax〉 = n(α− 1) ×∫
∞
xmin
(
x
xmin
)−α+1[
1 −(
x
xmin
)−α+1]n−1
dx
= nxmin
∫ 1
0
yn−1
(1 − y)1/(α−1)dy
= nxmin B(
n, (α− 2)/(α− 1))
, (18)
where I have made the substitution y = 1−(x/xmin)−α+1and B(a, b)
is Legendre’s beta-function,8 which is definedby
B(a, b) =Γ(a)Γ(b)
Γ(a + b), (19)
with Γ(a) the standard Γ-function:
Γ(a) =
∫
∞
0ta−1e−t dt. (20)
The beta-function has the interesting property thatfor large
values of either of its arguments it itself fol-lows a power law.9
For instance, for large a and fixed b,B(a, b) ∼ a−b. In most cases
of interest, the number nof samples from our power-law distribution
will be large(meaning much greater than 1), so
B(
n, (α− 2)/(α− 1))
∼ n−(α−2)/(α−1), (21)
and
〈xmax〉 ∼ n1/(α−1). (22)
Thus, as long as α > 1, we find that 〈xmax〉 always in-creases
as n becomes larger.10
This allows us to complete the calculation of the mo-ments in
Section III.B. Consider for instance the secondmoment, which is
often of interest in power laws. For thecrucial case 2 < α ≤ 3,
which covers most of the power-law distributions observed in real
life, we saw in Eq. (12)that the second moment of the distribution
diverges asthe size of the data set becomes infinite. But in
real-ity all data sets are finite and so have a finite
maximumsample xmax. This means that (12) becomes
〈
x2〉
=C
3 − α
[
x−α+3]xmax
xmin. (23)
8 Also called the Eulerian integral of the first kind.9 This can
be demonstrated by approximating the Γ-functions of
Eq. (19) using Sterling’s formula.10 Equation (22) can also be
derived by a simpler, although less
rigorous, heuristic argument: if P (x) = 1/n for some value of
xthen we expect there to be on average one sample in the rangefrom
x to ∞, and this of course will the largest sample. Thus arough
estimate of 〈xmax〉 can be derived by setting our expressionfor P
(x), Eq. (15), equal to 1/n and rearranging for x, whichimmediately
gives 〈xmax〉 ∼ n1/(α−1).
As xmax becomes large this expression is dominated bythe upper
limit, and using the result, Eq. (22), for xmax,we get
〈
x2〉
∼ n(3−α)/(α−1). (24)
So, for instance, if α = 52 , then the mean-square samplevalue,
and hence also the sample variance, goes as n1/3
as the size of the data set gets larger.
D. Top-heavy distributions and the 80/20 rule
Another interesting question is where the majority ofthe
distribution of x lies. For any power law with expo-nent α > 1,
the median is well defined. That is, there isa point x1/2 that
divides the distribution in half so thathalf the measured values of
x lie above x1/2 and half liebelow. That point is given by
∫
∞
x1/2
p(x) dx = 12
∫
∞
xmin
p(x) dx, (25)
or
x1/2 = 21/(α−1)xmin. (26)
So, for example, if we are considering the distributionof
wealth, there will be some well-defined median wealththat divides
the richer half of the population from thepoorer. But we can also
ask how much of the wealthitself lies in those two halves.
Obviously more than halfof the total amount of money belongs to the
richer half ofthe population. The fraction of the money in the
richerhalf is given by
∫
∞
x1/2xp(x) dx
∫
∞
xminxp(x) dx
=
(
x1/2xmin
)−α+2
= 2−(α−2)/(α−1), (27)
provided α > 2 so that the integrals converge. Thus,for
instance, if α = 2.1 for the wealth distribution, asindicated in
Table I, then a fraction 2−0.091 ( 94% of thewealth is in the hands
of the richer 50% of the population,making the distribution quite
top-heavy.
More generally, the fraction of the population whosepersonal
wealth exceeds x is given by the quantity P (x),Eq. (15), and the
fraction of the total wealth in the handsof those people is
W (x) =
∫
∞
x x′p(x′) dx′
∫
∞
xminx′p(x′) dx′
=
(
x
xmin
)−α+2
, (28)
assuming again that α > 2. Eliminating x/xmin be-tween (15)
and (28), we find that the fraction W of thewealth in the hands of
the richest P of the population is
W = P (α−2)/(α−1), (29)
of which Eq. (27) is a special case. This again has apower-law
form, but with a positive exponent now. In
-
III The mathematics of power laws 11
0 0.2 0.4 0.6 0.8 1
fraction of population P
0
0.2
0.4
0.6
0.8
1
fract
ion
of w
ealth
W
α = 2.1α = 2.2α = 2.4α = 2.7α = 3.5
FIG. 6 The fraction W of the total wealth in a country held
bythe fraction P of the richest people, if wealth is distributed
fol-lowing a power law with exponent α. If α = 2.1, for instance,as
it appears to in the United States (Table I), then the richest20%
of the population hold about 86% of the wealth (dashedlines).
Fig. 6 I show the form of the curve of W against P forvarious
values of α. For all values of α the curve is con-cave downwards,
and for values only a little above 2 thecurve has a very fast
initial increase, meaning that a largefraction of the wealth is
concentrated in the hands of asmall fraction of the population.
Curves of this kind arecalled Lorenz curves, after Max Lorenz, who
first studiedthem around the turn of the twentieth century
[35].
Using the exponents from Table I, we can for examplecalculate
that about 80% of the wealth should be in thehands of the richest
20% of the population (the so-called“80/20 rule”, which is borne
out by more detailed obser-vations of the wealth distribution), the
top 20% of websites get about two-thirds of all web hits, and the
largest10% of US cities house about 60% of the country’s
totalpopulation.
If α ≤ 2 then the situation becomes even more ex-treme. In that
case, the integrals in Eq. (28) divergeat their upper limits,
meaning that in fact they dependon the value xmax of the largest
sample, as described inSection III.C. But for α > 1, Eq. (22)
tells us that theexpected value of xmax goes to ∞ as n becomes
large,and in that limit the fraction of money in the top halfof the
population, Eq. (27), tends to unity. In fact, thefraction of money
in the top anything of the population,even the top 1%, tends to
unity, as Eq. (28) shows. Inother words, for distributions with α
< 2, essentially allof the wealth (or other commodity) lies in
the tail of thedistribution. The distribution of family names in
the US,which has an exponent α = 1.9, is an example of this typeof
behaviour. For the data of Fig. 4k, about 75% of thepopulation have
names in the top 15 000. Estimates of
the total number of unique family names in the US putthe figure
at around 1.5 million. So in this case 75% ofthe population have
names in the most common 1%—a very top-heavy distribution indeed.
The line α = 2thus separates the regime in which you will with
somefrequency meet people with uncommon names from theregime in
which you will rarely meet such people.
E. Scale-free distributions
A power-law distribution is also sometimes called ascale-free
distribution. Why? Because a power law is theonly distribution that
is the same whatever scale we lookat it on. By this we mean the
following.
Suppose we have some probability distribution p(x) fora quantity
x, and suppose we discover or somehow deducethat it satisfies the
property that
p(bx) = g(b)p(x), (30)
for any b. That is, if we increase the scale or units bywhich we
measure x by a factor of b, the shape of the dis-tribution p(x) is
unchanged, except for an overall multi-plicative constant. Thus for
instance, we might find thatcomputer files of size 2kB are 14 as
common as files ofsize 1kB. Switching to measuring size in
megabytes wealso find that files of size 2MB are 14 as common as
filesof size 1MB. Thus the shape of the file-size distributioncurve
(at least for these particular values) does not de-pend on the
scale on which we measure file size.
This scale-free property is certainly not true of
mostdistributions. It is not true for instance of the exponen-tial
distribution. In fact, as we now show, it is only trueof one type
of distribution, the power law.
Starting from Eq. (30), let us first set x = 1, givingp(b) =
g(b)p(1). Thus g(b) = p(b)/p(1) and (30) can bewritten as
p(bx) =p(b)p(x)
p(1). (31)
Since this equation is supposed to be true for any b, wecan
differentiate both sides with respect to b to get
xp′(bx) =p′(b)p(x)
p(1), (32)
where p′ indicates the derivative of p with respect to
itsargument. Now we set b = 1 and get
xdp
dx=
p′(1)
p(1)p(x). (33)
This is a simple first-order differential equation which hasthe
solution
ln p(x) =p(1)
p′(1)lnx + constant. (34)
-
12 Power laws, Pareto distributions and Zipf’s law
Setting x = 1 we find that the constant is simply lnp(1),and
then taking exponentials of both sides
p(x) = p(1)x−α, (35)
where α = −p(1)/p′(1). Thus, as advertised, the power-law
distribution is the only function satisfying the scale-free
criterion (30).
This fact is more than just a curiosity. As we willsee in
Section IV.E, there are some systems that becomescale-free for
certain special values of their governing pa-rameters. The point
defined by such a special value iscalled a “continuous phase
transition” and the argumentgiven above implies that at such a
point the observablequantities in the system should adopt a
power-law dis-tribution. This indeed is seen experimentally and
thedistributions so generated provided the original motiva-tion for
the study of power laws in physics (althoughmost experimentally
observed power laws are probablynot the result of phase
transitions—a variety of othermechanisms produce power-law
behaviour as well, as wewill shortly see).
F. Power laws for discrete variables
So far I have focused on power-law distributions forcontinuous
real variables, but many of the quantities wedeal with in practical
situations are in fact discrete—usually integers. For instance,
populations of cities, num-bers of citations to papers or numbers
of copies of bookssold are all integer quantities. In most cases,
the distinc-tion is not very important. The power law is obeyed
onlyin the tail of the distribution where the values measuredare so
large that, to all intents and purposes, they can beconsidered
continuous. Technically however, power-lawdistributions should be
defined slightly differently for in-teger quantities.
If k is an integer variable, then one way to proceed isto
declare that it follows a power law if the probability pkof
measuring the value k obeys
pk = Ck−α, (36)
for some constant exponent α. Clearly this distributioncannot
hold all the way down to k = 0, since it divergesthere, but it
could in theory hold down to k = 1. If wediscard any data for k =
0, the constant C would thenbe given by the normalization
condition
1 =∞∑
k=1
pk = C∞∑
k=1
k−α = Cζ(α), (37)
where ζ(α) is the Riemann ζ-function. Rearranging, C =1/ζ(α)
and
pk =k−α
ζ(α). (38)
If, as is usually the case, the power-law behaviour is seenonly
in the tail of the distribution, for values k ≥ kmin,then the
equivalent expression is
pk =k−α
ζ(α, kmin), (39)
where ζ(α, kmin) =∑
∞
k=kmink−α is the generalized or
incomplete ζ-function.Most of the results of the previous
sections can be gen-
eralized to the case of discrete variables, although
themathematics is usually harder and often involves
specialfunctions in place of the more tractable integrals of
thecontinuous case.
It has occasionally been proposed that Eq. (36) is notthe best
generalization of the power law to the discretecase. An alternative
and often more convenient form is
pk = CΓ(k)Γ(α)
Γ(k + α)= C B(k,α), (40)
where B(a, b) is, as before, the Legendre beta-function,Eq.
(19). As mentioned in Section III.C, the beta-function behaves as a
power law B(k,α) ∼ k−α for large kand so the distribution has the
desired asymptotic form.Simon [36] proposed that Eq. (40) be called
the Yule dis-tribution, after Udny Yule who derived it as the
limitingdistribution in a certain stochastic process [37], and
thisname is often used today. Yule’s result is described inSection
IV.D.
The Yule distribution is nice because sums involving itcan
frequently be performed in closed form, where sumsinvolving Eq.
(36) can only be written in terms of specialfunctions. For
instance, the normalizing constant C forthe Yule distribution is
given by
1 = C∞∑
k=1
B(k,α) =C
α− 1, (41)
and hence C = α− 1 and
pk = (α− 1)B(k,α). (42)
The first and second moments (i.e., the mean and meansquare of
the distribution) are
〈k〉 =α− 1α− 2
,〈
k2〉
=(α− 1)2
(α− 2)(α− 3), (43)
and there are similarly simple expressions correspondingto many
of our earlier results for the continuous case.
IV. MECHANISMS FOR GENERATING POWER-LAWDISTRIBUTIONS
In this section we look at possible candidate mech-anisms by
which power-law distributions might arise innatural and man-made
systems. Some of the possibilitiesthat have been suggested are
quite complex—notably the
-
IV Mechanisms for generating power-law distributions 13
physics of critical phenomena and the tools of the
renor-malization group that are used to analyse it. But let usstart
with some simple algebraic methods of generatingpower-law functions
and progress to the more involvedmechanisms later.
A. Combinations of exponentials
A much more common distribution than the power lawis the
exponential, which arises in many circumstances,such as survival
times for decaying atomic nuclei or theBoltzmann distribution of
energies in statistical mechan-ics. Suppose some quantity y has an
exponential distri-bution:
p(y) ∼ eay. (44)
The constant a might be either negative or positive. Ifit is
positive then there must also be a cutoff on thedistribution—a
limit on the maximum value of y—so thatthe distribution is
normalizable.
Now suppose that the real quantity we are interested inis not y
but some other quantity x, which is exponentiallyrelated to y
thus:
x ∼ eby, (45)
with b another constant, also either positive or negative.Then
the probability distribution of x is
p(x) = p(y)dy
dx∼
eay
beby=
x−1+a/b
b, (46)
which is a power law with exponent α = 1 − a/b.A version of this
mechanism was used by Miller [38] to
explain the power-law distribution of the frequencies ofwords as
follows (see also [39]). Suppose we type ran-domly on a
typewriter,11 pressing the space bar withprobability qs per stroke
and each letter with equal prob-ability ql per stroke. If there are
m letters in the alpha-bet then ql = (1− qs)/m. (In this simplest
version of theargument we also type no punctuation, digits or
othernon-letter symbols.) Then the frequency x with whicha
particular word with y letters (followed by a space)occurs is
x =
[
1 − qsm
]y
qs ∼ eby, (47)
where b = ln(1− qs)− lnm. The number (or fraction) ofdistinct
possible words with length between y and y +dygoes up exponentially
as p(y) ∼ my = eay with a = lnm.
11 This argument is sometimes called the “monkeys with
typewrit-ers” argument, the monkey being the traditional exemplar
of arandom typist.
Thus, following our argument above, the distribution
offrequencies of words has the form p(x) ∼ x−α with
α = 1 −a
b=
2 lnm − ln(1 − qs)lnm − ln(1 − qs)
. (48)
For the typical case where m is reasonably large and qsquite
small this gives α ( 2 in approximate agreementwith Table I.
This is a reasonable theory as far as it goes, but realtext is
not made up of random letters. Most combina-tions of letters don’t
occur in natural languages; most arenot even pronounceable. We
might imagine that someconstant fraction of possible letter
sequences of a givenlength would correspond to real words and the
argumentabove would then work just fine when applied to
thatfraction, but upon reflection this suggestion is
obviouslybogus. It is clear for instance that very long words
sim-ply don’t exist in most languages, although there are
ex-ponentially many possible combinations of letters avail-able to
make them up. This observation is backed upby empirical data. In
Fig. 7a we show a histogram ofthe lengths of words occurring in the
text of Moby Dick,and one would need a particularly vivid
imagination toconvince oneself that this histogram follows anything
likethe exponential assumed by Miller’s argument. (In fact,the
curve appears roughly to follow a log-normal [33].)
There may still be some merit in Miller’s argumenthowever. The
problem may be that we are measuringword “length” in the wrong
units. Letters are not reallythe basic units of language. Some
basic units are letters,but some are groups of letters. The letters
“th” for ex-ample often occur together in English and make a
singlesound, so perhaps they should be considered to be a sep-arate
symbol in their own right and contribute only oneunit to the word
length?
Following this idea to its logical conclusion wecan imagine
replacing each fundamental unit of thelanguage—whatever that is—by
its own symbol and thenmeasuring lengths in terms of numbers of
symbols. Thepursuit of ideas along these lines led Claude Shannonin
the 1940s to develop the field of information the-ory, which gives
a precise prescription for calculating thenumber of symbols
necessary to transmit words or anyother data [40, 41]. The units of
information are bits andthe true “length” of a word can be
considered to be thenumber of bits of information it carries.
Shannon showedthat if we regard words as the basic divisions of a
mes-sage, the information y carried by any particular wordis
y = −k lnx, (49)
where x is the frequency of the word as before and k isa
constant. (The reader interested in finding out moreabout where
this simple relation comes from is recom-mended to look at the
excellent introduction to informa-tion theory by Cover and Thomas
[42].)
But this has precisely the form that we want. Invertingit we
have x = e−y/k and if the probability distribution of
-
14 Power laws, Pareto distributions and Zipf’s law
0 10 20
length in letters
101
102
103
104nu
mbe
r of w
ords
5 10
information in bits
(a) (b)
FIG. 7 (a) Histogram of the lengths in letters of all
distinctwords in the text of the novel Moby Dick. (b) Histogram
ofthe information content a la Shannon of words in Moby Dick.The
former does not, by any stretch of the imagination, followan
exponential, but the latter could easily be said to do so.(Note
that the vertical axes are logarithmic.)
the “lengths” measured in terms of bits is also exponen-tial as
in Eq. (44) we will get our power-law distribution.Figure 7b shows
the latter distribution, and indeed itfollows a nice
exponential—much better than Fig. 7a.
This is still not an entirely satisfactory explanation.Having
made the shift from pure word length to informa-tion content, our
simple count of the number of words oflength y—that it goes
exponentially as my—is no longervalid, and now we need some reason
why there should beexponentially more distinct words in the
language of highinformation content than of low. That this is the
case isexperimentally verified by Fig. 7b, but the reason mustbe
considered still a matter of debate. Some possibilitiesare
discussed by, for instance, Mandelbrot [43] and morerecently by
Mitzenmacher [19].
Another example of the “combination of exponentials”mechanism
has been discussed by Reed and Hughes [44].They consider a process
in which a set of items, piles orgroups each grows exponentially in
time, having size x ∼ebt with b > 0. For instance, populations
of organismsreproducing freely without resource constraints grow
ex-ponentially. Items also have some fixed probability ofdying per
unit time (populations might have a stochas-tically constant
probability of extinction), so that thetimes t at which they die
are exponentially distributedp(t) ∼ eat with a < 0.
These functions again follow the form of Eqs. (44)and (45) and
result in a power-law distribution of thesizes x of the items or
groups at the time they die. Reedand Hughes suggest that variations
on this argument mayexplain the sizes of biological taxa, incomes
and cities,among other things.
B. Inverses of quantities
Suppose some quantity y has a distribution p(y) thatpasses
through zero, thus having both positive and neg-ative values. And
suppose further that the quantity weare really interested in is the
reciprocal x = 1/y, whichwill have distribution
p(x) = p(y)dy
dx= −
p(y)
x2. (50)
The large values of x, those in the tail of the
distribution,correspond to the small values of y close to zero and
thusthe large-x tail is given by
p(x) ∼ x−2, (51)
where the constant of proportionality is p(y = 0).More
generally, any quantity x = y−γ for some γ will
have a power-law tail to its distribution p(x) ∼ x−α, withα =
1+1/γ. It is not clear who the first author or authorswere to
describe this mechanism,12 but clear descriptionshave been given
recently by Bouchaud [45], Jan et al. [46]and Sornette [47].
One might argue that this mechanism merely generatesa power law
by assuming another one: the power-law re-lationship between x and
y generates a power-law distri-bution for x. This is true, but the
point is that the mecha-nism takes some physical power-law
relationship betweenx and y—not a stochastic probability
distribution—andfrom that generates a power-law probability
distribution.This is a non-trivial result.
One circumstance in which this mechanism arises isin
measurements of the fractional change in a quantity.For instance,
Jan et al. [46] consider one of the mostfamous systems in
theoretical physics, the Ising model ofa magnet. In its
paramagnetic phase, the Ising model hasa magnetization that
fluctuates around zero. Suppose wemeasure the magnetization m at
uniform intervals andcalculate the fractional change δ = (∆m)/m
betweeneach successive pair of measurements. The change ∆mis
roughly normally distributed and has a typical size setby the width
of that normal distribution. The 1/m on theother hand produces a
power-law tail when small valuesof m coincide with large values of
∆m, so that the tail ofthe distribution of δ follows p(δ) ∼ δ−2 as
above.
In Fig. 8 I show a cumulative histogram of mea-surements of δ
for simulations of the Ising model on asquare lattice, and the
power-law distribution is clearlyvisible. Using Eq. (5), the value
of the exponent isα = 1.98 ± 0.04, in good agreement with the
expectedvalue of 2.
12 A correspondent tells me that a similar mechanism was
describedin an astrophysical context by Chandrasekhar in a paper in
1943,but I have been unable to confirm this.
-
IV Mechanisms for generating power-law distributions 15
1 10 100 1000
fractional change in magnetization δ
100
101
102
103
104
fluct
uatio
ns o
bser
ved
of si
ze δ
or g
reat
er
FIG. 8 Cumulative histogram of the magnetization fluctu-ations
of a 128 × 128 nearest-neighbour Ising model on asquare lattice.
The model was simulated at a tempera-ture of 2.5 times the
spin-spin coupling for 100 000 timesteps using the cluster
algorithm of Swendsen and Wang [48]and the magnetization per spin
measured at intervals often steps. The fluctuations were calculated
as the ratioδi = 2(mi+1 − mi)/(mi+1 + mi).
C. Random walks
Many properties of random walks are distributed ac-cording to
power laws, and this could explain somepower-law distributions
observed in nature. In particu-lar, a randomly fluctuating process
that undergoes “gam-bler’s ruin”,13 i.e., that ends when it hits
zero, has apower-law distribution of possible lifetimes.
Consider a random walk in one dimension, in which awalker takes
a single step randomly one way or the otheralong a line in each
unit of time. Suppose the walkerstarts at position 0 on the line
and let us ask what theprobability is that the walker returns to
position 0 for thefirst time at time t (i.e., after exactly t
steps). This is theso-called first return time of the walk and
represents thelifetime of a gambler’s ruin process. A trick for
answeringthis question is depicted in Fig. 9. We consider first
theunconstrained problem in which the walk is allowed toreturn to
zero as many times as it likes, before returningthere again at time
t. Let us denote the probability ofthis event as ut. Let us also
denote by ft the probabilitythat the first return time is t. We
note that both of theseprobabilities are non-zero only for even
values of their
13 Gambler’s ruin is so called because a gambler’s night of
bettingends when his or her supply of money hits zero (assuming
thegambling establishment declines to offer him or her a line
ofcredit).
t = m t = n
posit
ion
t
2 2
FIG. 9 The position of a one-dimensional random walker
(ver-tical axis) as a function of time (horizontal axis). The
proba-bility u2n that the walk returns to zero at time t = 2n is
equalto the probability f2m that it returns to zero for the first
timeat some earlier time t = 2m, multiplied by the
probabilityu2n−2m that it returns again a time 2n − 2m later,
summedover all possible values of m. We can use this observationto
write a consistency relation, Eq. (52), that can be solvedfor ft,
Eq. (60).
arguments since there is no way to get back to zero inany odd
number of steps.
As Fig. 9 illustrates, the probability ut = u2n, with ninteger,
can be written
u2n =
{
1 if n = 0,∑n
m=1 f2mu2n−2m if n ≥ 1,(52)
where m is also an integer and we define f0 = 0 andu0 = 1. This
equation can conveniently be solved for f2nusing a generating
function approach. We define
U(z) =∞∑
n=0
u2nzn, F (z) =
∞∑
n=1
f2nzn. (53)
Then, multiplying Eq. (52) throughout by zn and sum-ming, we
find
U(z) = 1 +∞∑
n=1
n∑
m=1
f2mu2n−2mzn
= 1 +∞∑
m=1
f2mzm
∞∑
n=m
u2n−2mzn−m
= 1 + F (z)U(z). (54)
So
F (z) = 1 −1
U(z). (55)
The function U(z) however is quite easy to calculate.The
probability u2n that we are at position zero after 2nsteps is
u2n = 2−2n
(
2n
n
)
, (56)
-
16 Power laws, Pareto distributions and Zipf’s law
so14
U(z) =∞∑
n=0
(
2n
n
)
zn
4n=
1√1 − z
. (57)
And hence
F (z) = 1 −√
1 − z. (58)
Expanding this function using the binomial theoremthus:
F (z) = 12z +12 ×
12
2!z2 +
12 ×
12 ×
32
3!z3 + . . .
=∞∑
n=1
(2nn
)
(2n − 1) 22nzn (59)
and comparing this expression with Eq. (53), we imme-diately see
that
f2n =
(2nn
)
(2n − 1) 22n, (60)
and we have our solution for the distribution of first re-turn
times.
Now consider the form of f2n for large n. Writing outthe
binomial coefficient as
(
2nn
)
= (2n)!/(n!)2, we takelogs thus:
ln f2n = ln(2n)! − 2 lnn! − 2n ln 2 − ln(2n − 1), (61)
and use Sterling’s formula lnn! ( n lnn − n + 12 lnn toget ln
f2n ( 12 ln 2 −
12 lnn − ln(2n − 1), or
f2n (
√
2
n(2n − 1)2. (62)
In the limit n → ∞, this implies that f2n ∼ n−3/2,
orequivalently
ft ∼ t−3/2. (63)
So the distribution of return times follows a power lawwith
exponent α = 32 . Note that the distribution has adivergent mean
(because α ≤ 2). As discussed in Sec-tion III.C, in practice this
implies that the mean is de-termined by the size of the sample. If
we measure thefirst return times of a large number of random
walks,the mean will of course be finite. But the more walkswe
measure, the larger that mean will become, withoutbound.
As an example application, the random walk can beconsidered a
simple model for the lifetime of biological
14 The enthusiastic reader can easily derive this result for him
orherself by expanding (1 − z)−1/2 using the binomial theorem.
taxa. A taxon is a branch of the evolutionary tree, agroup of
species all descended by repeated speciationfrom a common
ancestor.15 The ranks of the Linneanhierarchy—genus, family, order
and so forth—are exam-ples of taxa. If a taxon gains and loses
species at randomover time, then the number of species performs a
ran-dom walk, the taxon becoming extinct when the numberof species
reaches zero for the first (and only) time. (Thisis one example of
“gambler’s ruin”.) Thus the time forwhich taxa live should have the
same distribution as thefirst return times of random walks.
In fact, it has been argued that the distribution of
thelifetimes of genera in the fossil record does indeed followa
power law [49]. The best fits to the available fossil dataput the
value of the exponent at α = 1.7 ± 0.3, which isin agreement with
the simple random walk model [50].16
D. The Yule process
One of the most convincing and widely applicablemechanisms for
generating power laws is the Yule pro-cess, whose invention was,
coincidentally, also inspiredby observations of the statistics of
biological taxa as dis-cussed in the previous section.
In addition to having a (possibly) power-law distribu-tion of
lifetimes, biological taxa also have a very convinc-ing power-law
distribution of sizes. That is, the distribu-tion of the number of
species in a genus, family or othertaxonomic group appears to
follow a power law quiteclosely. This phenomenon was first reported
by Willisand Yule in 1922 for the example of flowering plants
[15].Three years later, Yule [37] offered an explanation usinga
simple model that has since found wide application inother areas.
He argued as follows.
Suppose first that new species appear but they neverdie; species
are only ever added to genera and never re-moved. This differs from
the random walk model of thelast section, and certainly from
reality as well. It is be-lieved that in practice all species and
all genera becomeextinct in the end. But let us persevere; there is
nonethe-less much of worth in Yule’s simple model.
Species are added to genera by speciation, the splitting
15 Modern phylogenetic analysis, the quantitative comparison
ofspecies’ genetic material, can provide a picture of the
evolution-ary tree and hence allow the accurate “cladistic”
assignment ofspecies to taxa. For prehistoric species, however,
whose geneticmaterial is not usually available, determination of
evolutionaryancestry is difficult, so classification into taxa is
based insteadon morphology, i.e., on the shapes of organisms. It is
widely ac-knowledged that such classifications are subjective and
that thetaxonomic assignments of fossil species are probably
riddled witherrors.
16 To be fair, I consider the power law for the distribution of
genuslifetimes to fall in the category of “tenuous” identifications
towhich I alluded in footnote 7. This theory should be taken witha
pinch of salt.
-
IV Mechanisms for generating power-law distributions 17
of one species into two, which is known to happen by a va-riety
of mechanisms, including competition for resources,spatial
separation of breeding populations and geneticdrift. If we assume
that this happens at some stochasti-cally constant rate, then it
follows that a genus with kspecies in it will gain new species at a
rate proportionalto k, since each of the k species has the same
chance perunit time of dividing in two. Let us further suppose
thatoccasionally, say once every m speciation events, the
newspecies produced is, by chance, sufficiently different fromthe
others in its genus as to be considered the foundermember of an
entire new genus. (To be clear, we definem such that m species are
added to pre-existing generaand then one species forms a new genus.
So m + 1 newspecies appear for each new genus and there are m +
1species per genus on average.) Thus the number of gen-era goes up
steadily in this model, as does the number ofspecies within each
genus.
We can analyse this Yule process mathematically asfollows.17 Let
us measure the passage of time in themodel by the number of genera
n. At each time-stepone new species founds a new genus, thereby
increasingn by 1, and m other species are added to various
pre-existing genera which are selected in proportion to thenumber
of species they already have. We denote by pk,nthe fraction of
genera that have k species when the totalnumber of genera is n.
Thus the number of such generais npk,n. We now ask what the
probability is that thenext species added to the system happens to
be added toa particular genus i having ki species in it already.
Thisprobability is proportional to ki, and so when
properlynormalized is just ki/
∑
i ki. But∑
i ki is simply the to-tal number of species, which is n(m + 1).
Furthermore,between the appearance of the nth and the (n +
1)thgenera, m other new species are added, so the probabil-ity that
genus i gains a new species during this interval ismki/(n(m + 1)).
And the total expected number of gen-era of size k that gain a new
species in the same intervalis
mk
n(m + 1)× npk,n =
m
m + 1kpk,n. (64)
Now we observe that the number of genera with kspecies will
decrease on each time step by exactly thisnumber, since by gaining
a new species they become gen-era with k + 1 instead. At the same
time the numberincreases because of species that previously had k −
1species and now have an extra one. Thus we can writea master
equation for the new number (n + 1)pk,n+1 of
17 Yule’s analysis of the process was considerably more
involvedthan the one presented here, essentially because the theory
ofstochastic processes as we now know it did not yet exist in
histime. The master equation method we employ is a relativelymodern
innovation, introduced in this context by Simon [36].
genera with k species thus:
(n + 1)pk,n+1 = npk,n +m
m + 1
[
(k − 1)pk−1,n − kpk,n]
.
(65)The only exception to this equation is for genera of size
1,which instead obey the equation
(n + 1)p1,n+1 = np1,n + 1 −m
m + 1p1,n, (66)
since by definition exactly one new such genus appearson each
time step.
Now we ask what form the distribution of the sizes ofgenera
takes in the limit of long times. To do this weallow n → ∞ and
assume that the distribution tendsto some fixed value pk = limn→∞
pn,k independent of n.Then Eq. (66) becomes p1 = 1−mp1/(m + 1),
which hasthe solution
p1 =m + 1
2m + 1. (67)
And Eq. (65) becomes
pk =m
m + 1
[
(k − 1)pk−1 − kpk]
, (68)
which can be rearranged to read
pk =k − 1
k + 1 + 1/mpk−1, (69)
and then iterated to get
pk =(k − 1)(k − 2) . . . 1
(k + 1 + 1/m)(k + 1/m) . . . (3 + 1/m)p1
= (1 + 1/m)(k − 1) . . . 1
(k + 1 + 1/m) . . . (2 + 1/m), (70)
where I have made use of Eq. (67). This can be simpli-fied
further by making use of a handy property of theΓ-function, Eq.
(20), that Γ(a) = (a − 1)Γ(a− 1). Usingthis, and noting that Γ(1) =
1, we get
pk = (1 + 1/m)Γ(k)Γ(2 + 1/m)
Γ(k + 2 + 1/m)
= (1 + 1/m)B(k, 2 + 1/m), (71)
where B(a, b) is again the beta-function, Eq. (19). This,we
note, is precisely the distribution defined in Eq. (40),which Simon
called the Yule distribution. Since the beta-function has a
power-law tail B(a, b) ∼ a−b, we can im-mediately see that pk also
has a power-law tail with anexponent
α = 2 +1
m. (72)
The mean number m + 1 of species per genus for theexample of
flowering plants is about 3, making m ( 2and α ( 2.5. The actual
exponent for the distribution
-
18 Power laws, Pareto distributions and Zipf’s law
found by Willis and Yule [15] is α = 2.5 ± 0.1, which isin
excellent agreement with the theory.
Most likely this agreement is fortuitous, however. TheYule
process is probably not a terribly realistic expla-nation for the
distribution of the sizes of genera, princi-pally because it
ignores the fact that species (and gen-era) become extinct.
However, it has been adapted andgeneralized by others to explain
power laws in manyother systems, most famously city sizes [36],
paper ci-tations [51, 52], and links to pages on the world wideweb
[53, 54]. The most general form of the Yule processis as
follows.
Suppose we have a system composed of a collection ofobjects,
such as genera, cities, papers, web pages and soforth. New objects
appear every once in a while as citiesgrow up or people publish new
papers. Each object alsohas some property k associated with it,
such as number ofspecies in a genus, people in a city or citations
to a paper,that is reputed to obey a power law, and it is this
powerlaw that we wish to explain. Newly appearing objectshave some
initial value of k which we will denote k0.New genera initially
have only a single species k0 = 1,but new towns or cities might
have quite a large initialpopulation—a single person living in a
house somewhereis unlikely to constitute a town in their own right
butk0 = 100 people might do so. The value of k0 can also bezero in
some cases: newly published papers usually have
zero citations for instance.In between the appearance of one
object and the next,
m new species/people/citations etc. are added to the en-tire
system. That is some cities or papers will get newpeople or
citations, but not necessarily all will. And inthe simplest case
these are added to objects in propor-tion to the number that the
object already has. Thusthe probability of a city gaining a new
member is pro-portional to the number already there; the
probabilityof a paper getting a new citation is proportional to
thenumber it already has. In many cases this seems like anatural
process. For example, a paper that already hasmany citations is
more likely to be discovered during aliterature search and hence
more likely to be cited again.Simon [36] dubbed this type of
“rich-get-richer” processthe Gibrat principle. Elsewhere it also
goes by the namesof the Matthew effect [55], cumulative advantage
[51], orpreferential attachment [53].
There is a problem however when k0 = 0. For example,if new
papers appear with no citations and garner cita-tions in proportion
to the number they currently have,which is zero, then no paper will
ever get any citations!To overcome this problem one typically
assigns new cita-tions not in proportion simply to k, but to k + c,
wherec is some constant. Thus there are three parameters k0,c and m
that control the behaviour of the model.
By an argument exactly analogous to the one given above, one can
then derive the master equation
(n + 1)pk,n+1 = npk,n + mk − 1 + c
k0 + c + mpk−1,n − m
k + c
k0 + c + mpk,n, for k > k0, (73)
and
(n + 1)pk0,n+1 = npk0,n + 1 − mk0 + c
k0 + c + mpk0,n, for k = k0. (74)
(Note that k is never less than k0, since each object appears
with k = k0 initially.)
Looking for stationary solutions of these equations asbefore, we
define pk = limn→∞ pn,k and find that
pk0 =k0 + c + m
(m + 1)(k0 + c) + m, (75)
and
pk =(k − 1 + c)(k − 2 + c) . . . (k0 + c)
(k − 1 + c + α)(k − 2 + c + α) . . . (k0 + c + α)pk0
=Γ(k + c)Γ(k0 + c + α)
Γ(k0 + c)Γ(k + c + α)pk0 , (76)
where I have made use of the Γ-function notation intro-duced for
Eq. (71) and, for reasons that will become clearin just moment, I
have defined α = 2 + (k0 + c)/m. Asbefore, this expression can also
be written in terms of the
beta-function, Eq.(19):
pk =B(k + c,α)
B(k0 + c,α)pk0 . (77)
Since the beta-function follows a power law in its tail,B(a, b)
∼ a−b, the general Yule process generates apower-law distribution
pk ∼ k−α with exponent relatedto the three parameters of the
process according to
α = 2 +k0 + c
m. (78)
For example, the original Yule process for number ofspecies per
genus has c = 0 and k0 = 1, which reproducesthe result of Eq. (72).
For citations of papers or links toweb pages we have k0 = 0 and we
must have c > 0 to getany citations or links at all. So α = 2+
c/m. In his work
-
IV Mechanisms for generating power-law distributions 19
on citations Price [51] assumed that c = 1, so that
papercitations have the same exponent α = 2 + 1/m as thestandard
Yule process, although there doesn’t seem to beany very good reason
for making this assumption. As wesaw in Table I (and as Price
himself also reported), realcitations seem to have an exponent α (
3, so we shouldexpect c ( m. For the data from the Science
CitationIndex examined in Section II.A, the mean number m
ofcitations per paper is 8.6. So we should put c ( 8.6too if we
want the Yule process to match the observedexponent.
The most widely studied model of links on the web,that of
Barabási and Albert [53], assumes c = m so thatα = 3, but again
there doesn’t seem to be a good reasonfor this assumption. The
measured exponent for numbersof links to web sites is about α =
2.2, so if the Yuleprocess is to match the data in this case, we
should putc ( 0.2m.
However, the important point is that the Yule processis a
plausible and general mechanism that can explain anumber of the
power-law distributions observed in natureand can produce a wide
range of exponents to match theobservations by suitable adjustments
of the parameters.For several of the distributions shown in Fig. 4,
especiallycitations, city populations and personal income, it is
nowthe most widely accepted theory.
E. Phase transitions and critical phenomena
A completely different mechanism for generating powerlaws, one
that has received a huge amount of attentionover the past few
decades from the physics community,is that of critical
phenomena.
Some systems have only a single macroscopic length-scale,
size-scale or time-scale governing them. A classicexample is a
magnet, which has a correlation length thatmeasures the typical
size of magnetic domains. Undercertain circumstances this
length-scale can diverge, leav-ing the system with no scale at all.
As we will now see,such a system is “scale-free” in the sense of
Section III.Eand hence the distributions of macroscopic physical
quan-tities have to follow power laws. Usually the circum-stances
under which the divergence takes place are veryspecific ones. The
parameters of the system have to betuned very precisely to produce
the power-law behaviour.This is something of a disadvantage; it
makes the diver-gence of length-scales an unlikely explanation for
genericpower-law distributions of the type highlighted in
thispaper. As we will shortly see, however, there are someelegant
and interesting ways around this problem.
The precise point at which the length-scale in a sys-tem
diverges is called a critical point or a phase transi-tion. More
specifically it is a continuous phase transi-tion. (There are other
kinds of phase transitions too.)Things that happen in the vicinity
of continuous phasetransitions are known as critical phenomena, of
whichpower-law distributions are one example.
FIG. 10 The percolation model on a square lattice: squareson the
lattice are coloured in independently at random withsome
probability p. In this example p = 12 .
To better understand the physics of critical phenom-ena, let us
explore one simple but instructive example,that of the “percolation
transition”. Consider a squarelattice like the one depicted in Fig.
10 in which some ofthe squares have been coloured in. Suppose we
coloureach square with independent probability p, so that onaverage
a fraction p of them are coloured in. Now we lookat the clusters of
coloured squares that form, i.e., the con-tiguous regions of
adjacent coloured squares. We can ask,for instance, what the mean
area 〈s〉 is of the cluster towhich a randomly chosen square
belongs. If that squareis not coloured in then the area is zero. If
it is colouredin but none of the adjacent ones is coloured in then
thearea is one, and so forth.
When p is small, only a few squares are coloured inand most
coloured squares will be alone on the lattice,or maybe grouped in
twos or threes. So 〈s〉 will be small.This situation is depicted in
Fig. 11 for p = 0.3. Con-versely, if p is large—almost 1, which is
the largest valueit can have—then most squares will be coloured in
andthey will almost all be connected together in one largecluster,
the so-called spanning cluster. In this situationwe say that the
system percolates. Now the mean sizeof the cluster to which a
vertex belongs is limited onlyby the size of the lattice itself and
as we let the latticesize become large 〈s〉 also becomes large. So
we have twodistinctly different behaviours, one for small p in
which〈s〉 is small and doesn’t depend on the size of the system,and
one for large p in which 〈s〉 much larger and increaseswith the size
of the system.
And what happens in between these two extremes?As we increase p
from small values, the value of 〈s〉 alsoincreases. But at some
point we reach the start of theregime in which 〈s〉 goes up with
system size instead ofstaying constant. We now know that this point
is at p =0.5927462 . . ., which is called the critical value of p
andis denoted pc. If the size of the lattice is large, then 〈s〉also
becomes large at this point, and in the limit wherethe lattice size
goes to infinity 〈s〉 actually diverges. To
-
20 Power laws, Pareto distributions and Zipf’s law
FIG. 11 Three examples of percolation systems on 100× 100 square
lattices with p = 0.3, p = pc = 0.5927 . . . and p = 0.9. Thefirst
and last are well below and above the critical point respectively,
while the middle example is precisely at it.
illustrate this phenomenon, I show in Fig. 12 a plot of〈s〉 from
simulations of the percolation model and thedivergence is
clear.
Now consider not just the mean cluster size but the en-tire
distribution of cluster sizes. Let p(s) be the probabil-ity that a
randomly chosen square belongs to a cluster ofarea s. In general,
what forms can p(s) take as a functionof s? The important point to
notice is that p(s), beinga probability distribution, is a
dimensionless quantity—just a number—but s is an area. We could
measure s interms of square metres, or whatever units the lattice
iscalibrated in. The average 〈s〉 is also an area and thenthere is
the area of a unit square itself, which we will de-note a. Other
than these three quantities, however, thereare no other independent
parameters with dimensions inthis problem. (There is the area of
the whole lattice, butwe are considering the limit where that
becomes infinite,so it’s out of the picture.)
If we want to make a dimensionless function p(s) outof these
three dimensionful parameters, there are three
0 0.2 0.4 0.6 0.8 1
percolation probability
0
50
100
150
200
mea
n cl
uste
r siz
e
FIG. 12 The mean area of the cluster to which a randomlychosen
square belongs for the percolation model described inthe text,
calculated from an average over 1000 simulations ona 1000×1000
squar