MI 48109. U.S.A.web.mit.edu/kardar/www/teaching/IITS/lectures/lec7/Power...arXiv:cond-mat/0412004 v2 9 Jan 2005 Power laws, Pareto distributions and Zipf’s law M. E. J. Newman Department

arX

iv:c

ond-

mat

/041

2004

v2

9 Ja

n 20

05Power laws, Pareto distributions and Zipf’s law

M. E. J. Newman

Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor,MI 48109. U.S.A.

When the probability of measuring a particular value of some quantity varies inversely as a powerof that value, the quantity is said to follow a power law, also known variously as Zipf’s law or thePareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences,economics and finance, computer science, demography and the social sciences. For instance,the distributions of the sizes of cities, earthquakes, solar flares, moon craters, wars and people’spersonal fortunes all appear to follow power laws. The origin of power-law behaviour has beena topic of debate in the scientific community for more than a century. Here we review some ofthe empirical evidence for the existence of power-law forms and the theories proposed to explainthem.

I. INTRODUCTION

Many of the things that scientists measure have a typ-ical size or “scale”—a typical value around which in-dividual measurements are centred. A simple examplewould be the heights of human beings. Most adult hu-man beings are about 180cm tall. There is some varia-tion around this figure, notably depending on sex, but wenever see people who are 10cm tall, or 500cm. To makethis observation more quantitative, one can plot a his-togram of people’s heights, as I have done in Fig. 1a. Thefigure shows the heights in centimetres of adult men inthe United States measured between 1959 and 1962, andindeed the distribution is relatively narrow and peakedaround 180cm. Another telling observation is the ratio ofthe heights of the tallest and shortest people. The Guin-ness Book of Records claims the world’s tallest and short-est adult men (both now dead) as having had heights272cm and 57cm respectively, making the ratio 4.8. Thisis a relatively low value; as we will see in a moment,some other quantities have much higher ratios of largestto smallest.

Figure 1b shows another example of a quantity witha typical scale: the speeds in miles per hour of cars onthe motorway. Again the histogram of speeds is stronglypeaked, in this case around 75mph.

But not all things we measure are peaked around a typ-ical value. Some vary over an enormous dynamic range,sometimes many orders of magnitude. A classic exampleof this type of behaviour is the sizes of towns and cities.The largest population of any city in the US is 8.00 mil-lion for New York City, as of the most recent (2000) cen-sus. The town with the smallest population is harder topin down, since it depends on what you call a town. Theauthor recalls in 1993 passing through the town of Mil-liken, Oregon, population 4, which consisted of one largehouse occupied by the town’s entire human population,a wooden shack occupied by an extraordinary numberof cats and a very impressive flea market. According tothe Guinness Book, however, America’s smallest town isDuffield, Virginia, with a population of 52. Whicheverway you look at it, the ratio of largest to smallest pop-

ulation is at least 150 000. Clearly this is quite differentfrom what we saw for heights of people. And an evenmore startling pattern is revealed when we look at thehistogram of the sizes of cities, which is shown in Fig. 2.

In the left panel of the figure, I show a simple his-togram of the distribution of US city sizes. The his-togram is highly right-skewed, meaning that while thebulk of the distribution occurs for fairly small sizes—most US cities have small populations—there is a smallnumber of cities with population much higher than thetypical value, producing the long tail to the right of thehistogram. This right-skewed form is qualitatively quitedifferent from the histograms of people’s heights, but isnot itself very surprising. Given that we know there is alarge dynamic range from the smallest to the largest citysizes, we can immediately deduce that there can onlybe a small number of very large cities. After all, in acountry such as America with a total population of 300million people, you could at most have about 40 cities thesize of New York. And the 2700 cities in the histogramof Fig. 2 cannot have a mean population of more than3 × 108/2700 = 110 000.

What is surprising on the other hand, is the right panelof Fig. 2, which shows the histogram of city sizes again,but this time replotted with logarithmic horizontal andvertical axes. Now a remarkable pattern emerges: thehistogram, when plotted in this fashion, follows quiteclosely a straight line. This observation seems first tohave been made by Auerbach [1], although it is often at-tributed to Zipf [2]. What does it mean? Let p(x) dxbe the fraction of cities with population between x andx + dx. If the histogram is a straight line on log-logscales, then ln p(x) = −α lnx+ c, where α and c are con-stants. (The minus sign is optional, but convenient sincethe slope of the line in Fig. 2 is clearly negative.) Takingthe exponential of both sides, this is equivalent to:

p(x) = Cx−α, (1)

with C = ec.Distributions of the form (1) are said to follow a power

law. The constant α is called the exponent of the powerlaw. (The constant C is mostly uninteresting; once α

2 Power laws, Pareto distributions and Zipf’s law

0 50 100 150 200 250

heights of males

0

2

4

6

perc

enta

ge

0 20 40 60 80 100

speeds of cars

0

1

2

3

4

FIG. 1 Left: histogram of heights in centimetres of American males. Data from the National Health Examination Survey,1959–1962 (US Department of Health and Human Services). Right: histogram of speeds in miles per hour of cars on UKmotorways. Data from Transport Statistics 2003 (UK Department for Transport).

0 2×105 4×105

population of city

0

0.001

0.002

0.003

0.004

perc

enta

ge o

f citi

es

104 105 106 10710-8

10-7

10-6

10-5

10-4

10-3

10-2

FIG. 2 Left: histogram of the populations of all US cities with population of 10 000 or more. Right: another histogram of thesame data, but plotted on logarithmic scales. The approximate straight-line form of the histogram in the right panel impliesthat the distribution follows a power law. Data from the 2000 US Census.

is fixed, it is determined by the requirement that thedistribution p(x) sum to 1; see Section III.A.)

Power-law distributions occur in an extraordinarily di-verse range of phenomena. In addition to city popula-tions, the sizes of earthquakes [3], moon craters [4], solarflares [5], computer files [6] and wars [7], the frequency ofuse of words in any human language [2, 8], the frequencyof occurrence of personal names in most cultures [9], thenumbers of papers scientists write [10], the number ofcitations received by papers [11], the number of hits onweb pages [12], the sales of books, music recordings andalmost every other branded commodity [13, 14], the num-bers of species in biological taxa [15], people’s annual in-comes [16] and a host of other variables all follow power-law distributions.1

1 Power laws also occur in many situations other than the statis-

Power-law distributions are the subject of this arti-cle. In the following sections, I discuss ways of detectingpower-law behaviour, give empirical evidence for powerlaws in a variety of systems and describe some of themechanisms by which power-law behaviour can arise.

Readers interested in pursuing the subject further mayalso wish to consult the reviews by Sornette [18] andMitzenmacher [19], as well as the bibliography by Li.2

tical distributions of quantities. For instance, Newton’s famous1/r2 law for gravity has a power-law form with exponent α = 2.While such laws are certainly interesting in their own way, theyare not the topic of this paper. Thus, for instance, there hasin recent years been some discussion of the “allometric” scal-ing laws seen in the physiognomy and physiology of biologicalorganisms [17], but since these are not statistical distributionsthey will not be discussed here.

2 http://linkage.rockefeller.edu/wli/zipf/.

II Measuring power laws 3

0 2 4 6 8

x

0

0.5

1

1.5

sam

ples

1 10 100

x

10-5

10-4

10-3

10-2

10-1

100

sam

ples

1 10 100 1000

x

10-9

10-7

10-5

10-3

10-1

sam

ples

1 10 100 1000

x

10-4

10-2

100

sam

ples

with

val

ue >

x

(a) (b)

(c) (d)

FIG. 3 (a) Histogram of the set of 1 million random numbers described in the text, which have a power-law distribution withexponent α = 2.5. (b) The same histogram on logarithmic scales. Notice how noisy the results get in the tail towards theright-hand side of the panel. This happens because the number of samples in the bins becomes small and statistical fluctuationsare therefore large as a fraction of sample number. (c) A histogram constructed using “logarithmic binning”. (d) A cumulativehistogram or rank/frequency plot of the same data. The cumulative distribution also follows a power law, but with an exponentof α − 1 = 1.5.

II. MEASURING POWER LAWS

Identifying power-law behaviour in either natural orman-made systems can be tricky. The standard strategymakes use of a result we have already seen: a histogramof a quantity with a power-law distribution appears asa straight line when plotted on logarithmic scales. Justmaking a simple histogram, however, and plotting it onlog scales to see if it looks straight is, in most cases, apoor way proceed.

Consider Fig. 3. This example shows a fake data set:I have generated a million random real numbers drawnfrom a power-law probability distribution p(x) = Cx−α

with exponent α = 2.5, just for illustrative purposes.3

Panel (a) of the figure shows a normal histogram of the

3 This can be done using the so-called transformation method. Ifwe can generate a random real number r uniformly distributed inthe range 0 ≤ r < 1, then x = xmin(1 − r)−1/(α−1) is a randompower-law-distributed real number in the range xmin ≤ x < ∞with exponent α. Note that there has to be a lower limit xminon the range; the power-law distribution diverges as x → 0—seeSection II.A.

numbers, produced by binning them into bins of equalsize 0.1. That is, the first bin goes from 1 to 1.1, thesecond from 1.1 to 1.2, and so forth. On the linear scalesused this produces a nice smooth curve.

To reveal the power-law form of the distribution it isbetter, as we have seen, to plot the histogram on logarith-mic scales, and when we do this for the current data wesee the characteristic straight-line form of the power-lawdistribution, Fig. 3b. However, the plot is in some re-spects not a very good one. In particular the right-handend of the distribution is noisy because of sampling er-rors. The power-law distribution dwindles in this region,meaning that each bin only has a few samples in it, ifany. So the fractional fluctuations in the bin counts arelarge and this appears as a noisy curve on the plot. Oneway to deal with this would be simply to throw out thedata in the tail of the curve. But there is often useful in-formation in those data and furthermore, as we will seein Section II.A, many distributions follow a power lawonly in the tail, so we are in danger of throwing out thebaby with the bathwater.

An alternative solution is to vary the width of the binsin the histogram. If we are going to do this, we mustalso normalize the sample counts by the width of the


bins they fall in. That is, the number of samples in a binof width ∆x should be divided by ∆x to get a count perunit interval of x. Then the normalized sample countbecomes independent of bin width on average and we arefree to vary the bin widths as we like. The most commonchoice is to create bins such that each is a fixed multiplewider than the one before it. This is known as loga-rithmic binning. For the present example, for instance,we might choose a multiplier of 2 and create bins thatspan the intervals 1 to 1.1, 1.1 to 1.3, 1.3 to 1.7 and soforth (i.e., the sizes of the bins are 0.1, 0.2, 0.4 and soforth). This means the bins in the tail of the distribu-tion get more samples than they would if bin sizes werefixed, and this reduces the statistical errors in the tail. Italso has the nice side-effect that the bins appear to be ofconstant width when we plot the histogram on log scales.

I used logarithmic binning in the construction ofFig. 2b, which is why the points representing the individ-ual bins appear equally spaced. In Fig. 3c I have donethe same for our computer-generated power-law data. Aswe can see, the straight-line power-law form of the his-togram is now much clearer and can be seen to extend forat least a decade further than was apparent in Fig. 3b.

Even with logarithmic binning there is still some noisein the tail, although it is sharply decreased. Suppose thebottom of the lowest bin is at xmin and the ratio of thewidths of successive bins is a. Then the kth bin extendsfrom xk−1 = xminak−1 to xk = xminak and the expectednumber of samples falling in this interval is

∫ xk

xk−1

p(x) dx = C

∫ xk

xk−1

x−α dx

= Caα−1 − 1α− 1

(xminak)−α+1. (2)

Thus, so long as α > 1, the number of samples per bingoes down as k increases and the bins in the tail will havemore statistical noise than those that precede them. Aswe will see in the next section, most power-law distribu-tions occurring in nature have 2 ≤ α ≤ 3, so noisy tailsare the norm.

Another, and in many ways a superior, method of plot-ting the data is to calculate a cumulative distributionfunction. Instead of plotting a simple histogram of thedata, we make a plot of the probability P (x) that x hasa value greater than or equal to x:

P (x) =

∫

∞

xp(x′) dx′. (3)

The plot we get is no longer a simple representation ofthe distribution of the data, but it is useful nonetheless.If the distribution follows a power law p(x) = Cx−α, then

P (x) = C

∫

∞

xx′

−αdx′ =

C

α− 1x−(α−1). (4)

Thus the cumulative distribution function P (x) also fol-lows a power law, but with a different exponent α − 1,

which is 1 less than the original exponent. Thus, if weplot P (x) on logarithmic scales we should again get astraight line, but with a shallower slope.

But notice that there is no need to bin the data atall to calculate P (x). By its definition, P (x) is well-defined for every value of x and so can be plotted as aperfectly normal function without binning. This avoidsall questions about what sizes the bins should be. Italso makes much better use of the data: binning of datalumps all samples within a given range together into thesame bin and so throws out any information that wascontained in the individual values of the samples withinthat range. Cumulative distributions don’t throw awayany information; it’s all there in the plot.

Figure 3d shows our computer-generated power-lawdata as a cumulative distribution, and indeed we againsee the tell-tale straight-line form of the power law, butwith a shallower slope than before. Cumulative distribu-tions like this are sometimes also called rank/frequencyplots for reasons explained in Appendix A. Cumula-tive distributions with a power-law form are sometimessaid to follow Zipf’s law or a Pareto distribution, af-ter two early researchers who championed their study.Since power-law cumulative distributions imply a power-law form for p(x), “Zipf’s law” and “Pareto distribu-tion” are effectively synonymous with “power-law distri-bution”. (Zipf’s law and the Pareto distribution differfrom one another in the way the cumulative distributionis plotted—Zipf made his plots with x on the horizon-tal axis and P (x) on the vertical one; Pareto did it theother way around. This causes much confusion in the lit-erature, but the data depicted in the plots are of courseidentical.4)

We know the value of the exponent α for our artifi-cial data set since it was generated deliberately to havea particular value, but in practical situations we wouldoften like to estimate α from observed data. One wayto do this would be to fit the slope of the line in plotslike Figs. 3b, c or d, and this is the most commonly usedmethod. Unfortunately, it is known to introduce system-atic biases into the value of the exponent [20], so it shouldnot be relied upon. For example, a least-squares fit of astraight line to Fig. 3b gives α = 2.26 ± 0.02, which isclearly incompatible with the known value of α = 2.5from which the data were generated.

An alternative, simple and reliable method for extract-ing the exponent is to employ the formula

α = 1 + n

[

n∑

i=1

lnxi

xmin

]−1

. (5)

Here the quantities xi, i = 1 . . . n are the measured val-ues of x and xmin is again the minimum value of x. (As

4 See http://www.hpl.hp.com/research/idl/papers/ranking/for a useful discussion of these and related points.


discussed in the following section, in practical situationsxmin usually corresponds not to the smallest value of xmeasured but to the smallest for which the power-law be-haviour holds.) The derivation of this formula is given inAppendix B. An error estimate for α can be derived by astandard bootstrap or jackknife resampling method [21];for large data sets of the type discussed in this paper, abootstrap is normally the more computationally econom-ical of the two.

Applying Eq. (5) to our present data gives an estimateof α = 2.500 ± 0.002 for the exponent, which agrees wellwith the known value of 2.5.

A. Examples of power laws

In Fig. 4 we show cumulative distributions of twelvedifferent quantities measured in physical, biological, tech-nological and social systems of various kinds. All havebeen proposed to follow power laws over some part oftheir range. The ubiquity of power-law behaviour in thenatural world has led many scientists to wonder whetherthere is a single, simple, underlying mechanism link-ing all these different systems together. Several candi-dates for such mechanisms have been proposed, going bynames like “self-organized criticality” and “highly opti-mized tolerance”. However, the conventional wisdom isthat there are actually many different mechanisms forproducing power laws and that different ones are appli-cable to different cases. We discuss these points furtherin Section IV.

The distributions shown in Fig. 4 are as follows.

(a) Word frequency: Estoup [8] observed that thefrequency with which words are used appears to fol-low a power law, and this observation was famouslyexamined in depth and confirmed by Zipf [2].Panel (a) of Fig. 4 shows the cumulative distribu-tion of the number of times that words occur in atypical piece of English text, in this case the text ofthe novel Moby Dick by Herman Melville.5 Similardistributions are seen for words in other languages.

(b) Citations of scientific papers: As first observedby Price [11], the numbers of citations received byscientific papers appear to have a power-law distri-bution. The data in panel (b) are taken from theScience Citation Index, as collated by Redner [23],and are for papers published in 1981. The plotshows the cumulative distribution of the number ofcitations received by a paper between publicationand June 1997.

5 The most common words in this case are, in order, “the”, “of”,“and”, “a” and “to”, and the same is true for most written En-glish texts. Interestingly, however, it is not true for spoken En-glish. The most common words in spoken English are, in order,“I”, “and”, “the”, “to” and “that” [22].

(c) Web hits: The cumulative distribution of thenumber of “hits” received by web sites (i.e., servers,not pages) during a single day from a subset of theusers of the AOL Internet service. The site withthe most hits, by a long way, was yahoo.com. Af-ter Adamic and Huberman [12].

(d) Copies of books sold: The cumulative distribu-tion of the total number of copies sold in Amer-ica of the 633 bestselling books that sold 2 millionor more copies between 1895 and 1965. The datawere compiled painstakingly over a period of sev-eral decades by Alice Hackett, an editor at Pub-lisher’s Weekly [24]. The best selling book dur-ing the period covered was Benjamin Spock’s TheCommon Sense Book of Baby and Child Care. (TheBible, which certainly sold more copies, is not reallya single book, but exists in many different transla-tions, versions and publications, and was excludedby Hackett from her statistics.) Substantially bet-ter data on book sales than Hackett’s are now avail-able from operations such as Nielsen BookScan, butunfortunately at a price this author cannot afford.I should be very interested to see a plot of salesfigures from such a modern source.

(e) Telephone calls: The cumulative distribution ofthe number of calls received on a single day by 51million users of AT&T long distance telephone ser-vice in the United States. After Aiello et al. [25].The largest number of calls received by a customerin that day was 375 746, or about 260 calls a minute(obviously to a telephone number that has manypeople manning the phones). Similar distributionsare seen for the number of calls placed by users andalso for the numbers of email messages that peoplesend and receive [26, 27].

(f) Magnitude of earthquakes: The cumulative dis-tribution of the Richter (local) magnitude of earth-quakes occurring in California between January1910 and May 1992, as recorded in the BerkeleyEarthquake Catalog. The Richter magnitude is de-fined as the logarithm, base 10, of the maximumamplitude of motion detected in the earthquake,and hence the horizontal scale in the plot, whichis drawn as linear, is in effect a logarithmic scaleof amplitude. The power law relationship in theearthquake distribution is thus a relationship be-tween amplitude and frequency of occurrence. Thedata are from the National Geophysical Data Cen-ter, www.ngdc.noaa.gov.

(g) Diameter of moon craters: The cumulative dis-tribution of the diameter of moon craters. Ratherthan measuring the (integer) number of craters ofa given size on the whole surface of the moon, thevertical axis is normalized to measure number ofcraters per square kilometre, which is why the axisgoes below 1, unlike the rest of the plots, since it is

6

100 102 104

word frequency

100

102

104

100 102 104

citations

100

102

104

106

100 102 104

web hits

100

102

104

106 107

books sold

1

10

100

100 102 104 106

telephone calls received

100

103

106

2 3 4 5 6 7earthquake magnitude

102

103

104

0.01 0.1 1crater diameter in km

10-4

10-2

100

102

102 103 104 105

peak intensity

101

102

103

104

1 10 100intensity

1

10

100

109 1010

net worth in US dollars

1

10

100

104 105 106

name frequency

100

102

104

103 105 107

population of city

100

102

104

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

(j) (k) (l)

FIG. 4 Cumulative distributions or “rank/frequency plots” of twelve quantities reputed to follow power laws. The distributionswere computed as described in Appendix A. Data in the shaded regions were excluded from the calculations of the exponentsin Table I. Source references for the data are given in the text. (a) Numbers of occurrences of words in the novel Moby Dickby Hermann Melville. (b) Numbers of citations to scientific papers published in 1981, from time of publication until June1997. (c) Numbers of hits on web sites by 60 000 users of the America Online Internet service for the day of 1 December 1997.(d) Numbers of copies of bestselling books sold in the US between 1895 and 1965. (e) Number of calls received by AT&Ttelephone customers in the US for a single day. (f) Magnitude of earthquakes in California between January 1910 and May 1992.Magnitude is proportional to the logarithm of the maximum amplitude of the earthquake, and hence the distribution obeys apower law even though the horizontal axis is linear. (g) Diameter of craters on the moon. Vertical axis is measured per squarekilometre. (h) Peak gamma-ray intensity of solar flares in counts per second, measured from Earth orbit between February1980 and November 1989. (i) Intensity of wars from 1816 to 1980, measured as battle deaths per 10 000 of the population of theparticipating countries. (j) Aggregate net worth in dollars of the richest individuals in the US in October 2003. (k) Frequencyof occurrence of family names in the US in the year 1990. (l) Populations of US cities in the year 2000.


entirely possible for there to be less than one craterof a given size per square kilometre. After Neukumand Ivanov [4].

(h) Intensity of solar flares: The cumulative dis-tribution of the peak gamma-ray intensity ofsolar flares. The observations were made be-tween 1980 and 1989 by the instrument knownas the Hard X-Ray Burst Spectrometer aboardthe Solar Maximum Mission satellite launchedin 1980. The spectrometer used a CsI scin-tillation detector to measure gamma-rays fromsolar flares and the horizontal axis in the fig-ure is calibrated in terms of scintillation countsper second from this detector. The data arefrom the NASA Goddard Space Flight Center,umbra.nascom.nasa.gov/smm/hxrbs.html. Seealso Lu and Hamilton [5].

(i) Intensity of wars: The cumulative distributionof the intensity of 119 wars from 1816 to 1980. In-tensity is defined by taking the number of battledeaths among all participant countries in a war,dividing by the total combined populations of thecountries and multiplying by 10 000. For instance,the intensities of the First and Second World Warswere 141.5 and 106.3 battle deaths per 10 000 re-spectively. The worst war of the period coveredwas the small but horrifically destructive Paraguay-Bolivia war of 1932–1935 with an intensity of 382.4.The data are from Small and Singer [28]. See alsoRoberts and Turcotte [7].

(j) Wealth of the richest people: The cumulativedistribution of the total wealth of the richest peoplein the United States. Wealth is defined as aggre-gate net worth, i.e., total value in dollars at currentmarket prices of all an individual’s holdings, minustheir debts. For instance, when the data were com-piled in 2003, America’s richest person, William H.Gates III, had an aggregate net worth of $46 bil-lion, much of it in the form of stocks of the companyhe founded, Microsoft Corporation. Note that networth doesn’t actually correspond to the amount ofmoney individuals could spend if they wanted to:if Bill Gates were to sell all his Microsoft stock, forinstance, or otherwise divest himself of any signif-icant portion of it, it would certainly depress thestock price. The data are from Forbes magazine, 6October 2003.

(k) Frequencies of family names: Cumulative dis-tribution of the frequency of occurrence in the US ofthe 89 000 most common family names, as recordedby the US Census Bureau in 1990. Similar distribu-tions are observed for names in some other culturesas well (for example in Japan [29]) but not in allcases. Korean family names for instance appear tohave an exponential distribution [30].

(l) Populations of cities: Cumulative distributionof the size of the human populations of US cities asrecorded by the US Census Bureau in 2000.

Few real-world distributions follow a power law overtheir entire range, and in particular not for smaller val-ues of the variable being measured. As pointed out inthe previous section, for any positive value of the expo-nent α the function p(x) = Cx−α diverges as x → 0. Inreality therefore, the distribution must deviate from thepower-law form below some minimum value xmin. In ourcomputer-generated example of the last section we sim-ply cut off the distribution altogether below xmin so thatp(x) = 0 in this region, but most real-world examplesare not that abrupt. Figure 4 shows distributions witha variety of behaviours for small values of the variablemeasured; the straight-line power-law form asserts itselfonly for the higher values. Thus one often hears it saidthat the distribution of such-and-such a quantity “has apower-law tail”.

Extracting a value for the exponent α from distribu-tions like these can be a little tricky, since it requiresus to make a judgement, sometimes imprecise, about thevalue xmin above which the distribution follows the powerlaw. Once this judgement is made, however, α can becalculated simply from Eq. (5).6 (Care must be taken touse the correct value of n in the formula; n is the numberof samples that actually go into the calculation, exclud-ing those with values below xmin, not the overall totalnumber of samples.)

Table I lists the estimated exponents for each of thedistributions of Fig. 4, along with standard errors calcu-lated by bootstrapping 100 times, and also the values ofxmin used in the calculations. Note that the quoted errorscorrespond only to the statistical sampling error in theestimation of α; I have included no estimate of any errorsintroduced by the fact that a single power-law functionmay not be a good model for the data in some cases or forvariation of the estimates with the value chosen for xmin.

In the author’s opinion, the identification of some ofthe distributions in Fig. 4 as following power laws shouldbe considered unconfirmed. While the power law seemsto be an excellent model for most of the data sets de-picted, a tenable case could be made that the distribu-tions of web hits and family names might have two differ-ent power-law regimes with slightly different exponents.7

And the data for the numbers of copies of books sold

6 Sometimes the tail is also cut off because there is, for one reasonor another, a limit on the largest value that may occur. Anexample is the finite-size effects found in critical phenomena—see Section IV.E. In this case, Eq. (5) must be modified [20].

7 Significantly more tenuous claims to power-law behaviour forother quantities have appeared elsewhere in the literature, forinstance in the discussion of the distribution of the sizes of elec-trical blackouts [31, 32]. These however I consider insufficientlysubstantiated for inclusion in the present work.


minimum exponentquantity xmin α

(a) frequency of use of words 1 2.20(1)

(b) number of citations to papers 100 3.04(2)

(c) number of hits on web sites 1 2.40(1)

(d) copies of books sold in the US 2 000 000 3.51(16)

(e) telephone calls received 10 2.22(1)

(f) magnitude of earthquakes 3.8 3.04(4)

(g) diameter of moon craters 0.01 3.14(5)

(h) intensity of solar flares 200 1.83(2)

(i) intensity of wars 3 1.80(9)

(j) net worth of Americans $600m 2.09(4)

(k) frequency of family names 10 000 1.94(1)

(l) population of US cities 40 000 2.30(5)

TABLE I Parameters for the distributions shown in Fig. 4.The labels on the left refer to the panels in the figure. Expo-nent values were calculated using the maximum likelihoodmethod of Eq. (5) and Appendix B, except for the mooncraters (g), for which only cumulative data were available. Forthis case the exponent quoted is from a simple least-squares fitand should be treated with caution. Numbers in parenthesesgive the standard error on the trailing figures.

cover rather a small range—little more than one decadehorizontally. Nonetheless, one can, without stretchingthe interpretation of the data unreasonably, claim thatpower-law distributions have been observed in language,demography, commerce, information and computer sci-ences, geology, physics and astronomy, and this on itsown is an extraordinary statement.

B. Distributions that do not follow a power law

Power-law distributions are, as we have seen, impres-sively ubiquitous, but they are not the only form of broaddistribution. Lest I give the impression that everythinginteresting follows a power law, let me emphasize thatthere are quite a number of quantities with highly right-skewed distributions that nonetheless do not obey powerlaws. A few of them, shown in Fig. 5, are the following:

(a) The abundance of North American bird species,which spans over five orders of magnitude but isprobably distributed according to a log-normal. Alog-normally distributed quantity is one whose log-arithm is normally distributed; see Section IV.Gand Ref. [33] for further discussions.

(b) The number of entries in people’s email addressbooks, which spans about three orders of magni-tude but seems to follow a stretched exponential.A stretched exponential is curve of the form e−ax

b

for some constants a, b.

(c) The distribution of the sizes of forest fires, whichspans six orders of magnitude and could follow apower law but with an exponential cutoff.

100 102 104

abundance

1

10

100

1000

0 100 200 300

number of addresses

100

101

102

103

104

100 102 104 106

size in acres

100

102

104

(a) (b)

(c)

FIG. 5 Cumulative distributions of some quantities whosedistributions span several orders of magnitude but thatnonetheless do not follow power laws. (a) The number ofsightings of 591 species of birds in the North American Breed-ing Bird Survey 2003. (b) The number of addresses in theemail address books of 16 881 users of a large university com-puter system [34]. (c) The size in acres of all wildfires occur-ring on US federal land between 1986 and 1996 (National FireOccurrence Database, USDA Forest Service and Departmentof the Interior). Note that the horizontal axis is logarithmicin frames (a) and (c) but linear in frame (b).

This being an article about power laws, I will not discussfurther the possible explanations for these distributions,but the scientist confronted with a new set of data havinga broad dynamic range and a highly skewed distributionshould certainly bear in mind that a power-law model isonly one of several possibilities for fitting it.

III. THE MATHEMATICS OF POWER LAWS

A continuous real variable with a power-law distribu-tion has a probability p(x) dx of taking a value in theinterval from x to x + dx, where

p(x) = Cx−α, (6)

with α > 0. As we saw in Section II.A, there must besome lowest value xmin at which the power law is obeyed,and we consider only the statistics of x above this value.

III The mathematics of power laws 9

A. Normalization

The constant C in Eq. (6) is given by the normalizationrequirement that

1 =

∫

∞

xmin

p(x)dx = C

∫

∞

xmin

x−α dx =C

1 − α

[

x−α+1]∞

xmin.

(7)We see immediately that this only makes sense if α >1, since otherwise the right-hand side of the equationwould diverge: power laws with exponents less than unitycannot be normalized and don’t normally occur in nature.If α > 1 then Eq. (7) gives

C = (α− 1)xα−1min , (8)

and the correct normalized expression for the power lawitself is

p(x) =α− 1xmin

(

x

xmin

)−α

. (9)

Some distributions follow a power law for part of theirrange but are cut off at high values of x. That is, abovesome value they deviate from the power law and fall offquickly towards zero. If this happens, then the distribu-tion may be normalizable no matter what the value ofthe exponent α. Even so, exponents less than unity arerarely, if ever, seen.

B. Moments

The mean value of x in our power law is given by

〈x〉 =∫

∞

xmin

xp(x) dx = C

∫

∞

xmin

x−α+1 dx

=C

2 − α

[

x−α+2]∞

xmin. (10)

Note that this expression becomes infinite if α ≤ 2.Power laws with such low values of α have no finite mean.The distributions of sizes of solar flares and wars in Ta-ble I are examples of such power laws.

What does it mean to say that a distribution has nofinite mean? Surely we can take the data for real solarflares and calculate their average? Indeed we can, butthis is only because the data set is of finite size. Equa-tion (10) can be made to give a finite value of 〈x〉 if wecut the integral off at some upper limit, i.e., if there is amaximum as well as a minimum value of x. In any realdata set of finite size there is indeed such a maximum,which is just the largest value of x observed. But if wemake more measurements and generate a larger dataset,we have a non-negligible chance of getting a larger maxi-mum value of x, and this will make the value of 〈x〉 largerin turn. The divergence of Eq. (10) is telling us that aswe go to larger and larger data sets, our estimate of themean 〈x〉 will increase without bound. We discuss thismore below.

For α > 2 however, the mean does not diverge: thevalue of 〈x〉 will settle down to a normal finite value asthe data set becomes large, and that value is given byEq. (10) to be

〈x〉 =α− 1α− 2

xmin. (11)

We can also calculate higher moments of the distribu-tion p(x). For instance, the second moment, the meansquare, is given by

〈

x2〉

=C

3 − α

[

x−α+3]∞

xmin. (12)

This diverges if α ≤ 3. Thus power-law distributions inthis range, which includes almost all of those in Table I,have no finite mean square in the limit of a large dataset, and thus also no finite variance or standard deviation.We discuss the meaning of this statement further below.If α > 3, then the second moment is finite and well-defined, taking the value

〈

x2〉

=α− 1α− 3

x2min. (13)

These results can easily be extended to show that ingeneral all moments 〈xm〉 exist for m < α and all highermoments diverge. The ones that do exist are given by

〈xm〉 =α− 1

α− 1 − mxmmin. (14)

C. Largest value

Suppose we draw n measurements from a power-lawdistribution. What value is the largest of those measure-ments likely to take? Or, more precisely, what is theprobability π(x) dx that the largest value falls in the in-terval between x and x + dx?

The definitive property of the largest value in a sampleis that there are no others larger than it. The probabilitythat a particular sample will be larger than x is given bythe quantity P (x) defined in Eq. (3):

P (x) =

∫

∞

xp(x′) dx′ =

C

α− 1x−α+1 =

(

x

xmin

)−α+1

,

(15)so long as α > 1. And the probability that a sample isnot greater than x is 1−P (x). Thus the probability thata particular sample we draw, sample i, will lie betweenx and x + dx and that all the others will be no greaterthan it is p(x)dx× [1−P (x)]n−1. Then there are n waysto choose i, giving a total probability

π(x) = np(x)[1 − P (x)]n−1. (16)

Now we can calculate the mean value 〈xmax〉 of thelargest sample thus:

〈xmax〉 =∫

∞

xmin

xπ(x)dx = n

∫

∞

xmin

xp(x)[1−P (x)]n−1dx.

(17)


Using Eqs. (9) and (15), this is

〈xmax〉 = n(α− 1) ×∫

∞

xmin

(

x

xmin

)−α+1[

1 −(

x

xmin

)−α+1]n−1

dx

= nxmin

∫ 1

0

yn−1

(1 − y)1/(α−1)dy

= nxmin B(

n, (α− 2)/(α− 1))

, (18)

where I have made the substitution y = 1−(x/xmin)−α+1and B(a, b) is Legendre’s beta-function,8 which is definedby

B(a, b) =Γ(a)Γ(b)

Γ(a + b), (19)

with Γ(a) the standard Γ-function:

Γ(a) =

∫

∞

0ta−1e−t dt. (20)

The beta-function has the interesting property thatfor large values of either of its arguments it itself fol-lows a power law.9 For instance, for large a and fixed b,B(a, b) ∼ a−b. In most cases of interest, the number nof samples from our power-law distribution will be large(meaning much greater than 1), so

B(

n, (α− 2)/(α− 1))

∼ n−(α−2)/(α−1), (21)

and

〈xmax〉 ∼ n1/(α−1). (22)

Thus, as long as α > 1, we find that 〈xmax〉 always in-creases as n becomes larger.10

This allows us to complete the calculation of the mo-ments in Section III.B. Consider for instance the secondmoment, which is often of interest in power laws. For thecrucial case 2 < α ≤ 3, which covers most of the power-law distributions observed in real life, we saw in Eq. (12)that the second moment of the distribution diverges asthe size of the data set becomes infinite. But in real-ity all data sets are finite and so have a finite maximumsample xmax. This means that (12) becomes

〈

x2〉

=C

3 − α

[

x−α+3]xmax

xmin. (23)

8 Also called the Eulerian integral of the first kind.9 This can be demonstrated by approximating the Γ-functions of

Eq. (19) using Sterling’s formula.10 Equation (22) can also be derived by a simpler, although less

rigorous, heuristic argument: if P (x) = 1/n for some value of xthen we expect there to be on average one sample in the rangefrom x to ∞, and this of course will the largest sample. Thus arough estimate of 〈xmax〉 can be derived by setting our expressionfor P (x), Eq. (15), equal to 1/n and rearranging for x, whichimmediately gives 〈xmax〉 ∼ n1/(α−1).

As xmax becomes large this expression is dominated bythe upper limit, and using the result, Eq. (22), for xmax,we get

〈

x2〉

∼ n(3−α)/(α−1). (24)

So, for instance, if α = 52 , then the mean-square samplevalue, and hence also the sample variance, goes as n1/3

as the size of the data set gets larger.

D. Top-heavy distributions and the 80/20 rule

Another interesting question is where the majority ofthe distribution of x lies. For any power law with expo-nent α > 1, the median is well defined. That is, there isa point x1/2 that divides the distribution in half so thathalf the measured values of x lie above x1/2 and half liebelow. That point is given by

∫

∞

x1/2

p(x) dx = 12

∫

∞

xmin

p(x) dx, (25)

or

x1/2 = 21/(α−1)xmin. (26)

So, for example, if we are considering the distributionof wealth, there will be some well-defined median wealththat divides the richer half of the population from thepoorer. But we can also ask how much of the wealthitself lies in those two halves. Obviously more than halfof the total amount of money belongs to the richer half ofthe population. The fraction of the money in the richerhalf is given by

∫

∞

x1/2xp(x) dx

∫

∞

xminxp(x) dx

=

(

x1/2xmin

)−α+2

= 2−(α−2)/(α−1), (27)

provided α > 2 so that the integrals converge. Thus,for instance, if α = 2.1 for the wealth distribution, asindicated in Table I, then a fraction 2−0.091 ( 94% of thewealth is in the hands of the richer 50% of the population,making the distribution quite top-heavy.

More generally, the fraction of the population whosepersonal wealth exceeds x is given by the quantity P (x),Eq. (15), and the fraction of the total wealth in the handsof those people is

W (x) =

∫

∞

x x′p(x′) dx′

∫

∞

xminx′p(x′) dx′

=

(

x

xmin

)−α+2

, (28)

assuming again that α > 2. Eliminating x/xmin be-tween (15) and (28), we find that the fraction W of thewealth in the hands of the richest P of the population is

W = P (α−2)/(α−1), (29)

of which Eq. (27) is a special case. This again has apower-law form, but with a positive exponent now. In

III The mathematics of power laws 11

0 0.2 0.4 0.6 0.8 1

fraction of population P

0

0.2

0.4

0.6

0.8

1

fract

ion

of w

ealth

W

α = 2.1α = 2.2α = 2.4α = 2.7α = 3.5

FIG. 6 The fraction W of the total wealth in a country held bythe fraction P of the richest people, if wealth is distributed fol-lowing a power law with exponent α. If α = 2.1, for instance,as it appears to in the United States (Table I), then the richest20% of the population hold about 86% of the wealth (dashedlines).

Fig. 6 I show the form of the curve of W against P forvarious values of α. For all values of α the curve is con-cave downwards, and for values only a little above 2 thecurve has a very fast initial increase, meaning that a largefraction of the wealth is concentrated in the hands of asmall fraction of the population. Curves of this kind arecalled Lorenz curves, after Max Lorenz, who first studiedthem around the turn of the twentieth century [35].

Using the exponents from Table I, we can for examplecalculate that about 80% of the wealth should be in thehands of the richest 20% of the population (the so-called“80/20 rule”, which is borne out by more detailed obser-vations of the wealth distribution), the top 20% of websites get about two-thirds of all web hits, and the largest10% of US cities house about 60% of the country’s totalpopulation.

If α ≤ 2 then the situation becomes even more ex-treme. In that case, the integrals in Eq. (28) divergeat their upper limits, meaning that in fact they dependon the value xmax of the largest sample, as described inSection III.C. But for α > 1, Eq. (22) tells us that theexpected value of xmax goes to ∞ as n becomes large,and in that limit the fraction of money in the top halfof the population, Eq. (27), tends to unity. In fact, thefraction of money in the top anything of the population,even the top 1%, tends to unity, as Eq. (28) shows. Inother words, for distributions with α < 2, essentially allof the wealth (or other commodity) lies in the tail of thedistribution. The distribution of family names in the US,which has an exponent α = 1.9, is an example of this typeof behaviour. For the data of Fig. 4k, about 75% of thepopulation have names in the top 15 000. Estimates of

the total number of unique family names in the US putthe figure at around 1.5 million. So in this case 75% ofthe population have names in the most common 1%—a very top-heavy distribution indeed. The line α = 2thus separates the regime in which you will with somefrequency meet people with uncommon names from theregime in which you will rarely meet such people.

E. Scale-free distributions

A power-law distribution is also sometimes called ascale-free distribution. Why? Because a power law is theonly distribution that is the same whatever scale we lookat it on. By this we mean the following.

Suppose we have some probability distribution p(x) fora quantity x, and suppose we discover or somehow deducethat it satisfies the property that

p(bx) = g(b)p(x), (30)

for any b. That is, if we increase the scale or units bywhich we measure x by a factor of b, the shape of the dis-tribution p(x) is unchanged, except for an overall multi-plicative constant. Thus for instance, we might find thatcomputer files of size 2kB are 14 as common as files ofsize 1kB. Switching to measuring size in megabytes wealso find that files of size 2MB are 14 as common as filesof size 1MB. Thus the shape of the file-size distributioncurve (at least for these particular values) does not de-pend on the scale on which we measure file size.

This scale-free property is certainly not true of mostdistributions. It is not true for instance of the exponen-tial distribution. In fact, as we now show, it is only trueof one type of distribution, the power law.

Starting from Eq. (30), let us first set x = 1, givingp(b) = g(b)p(1). Thus g(b) = p(b)/p(1) and (30) can bewritten as

p(bx) =p(b)p(x)

p(1). (31)

Since this equation is supposed to be true for any b, wecan differentiate both sides with respect to b to get

xp′(bx) =p′(b)p(x)

p(1), (32)

where p′ indicates the derivative of p with respect to itsargument. Now we set b = 1 and get

xdp

dx=

p′(1)

p(1)p(x). (33)

This is a simple first-order differential equation which hasthe solution

ln p(x) =p(1)

p′(1)lnx + constant. (34)


Setting x = 1 we find that the constant is simply lnp(1),and then taking exponentials of both sides

p(x) = p(1)x−α, (35)

where α = −p(1)/p′(1). Thus, as advertised, the power-law distribution is the only function satisfying the scale-free criterion (30).

This fact is more than just a curiosity. As we willsee in Section IV.E, there are some systems that becomescale-free for certain special values of their governing pa-rameters. The point defined by such a special value iscalled a “continuous phase transition” and the argumentgiven above implies that at such a point the observablequantities in the system should adopt a power-law dis-tribution. This indeed is seen experimentally and thedistributions so generated provided the original motiva-tion for the study of power laws in physics (althoughmost experimentally observed power laws are probablynot the result of phase transitions—a variety of othermechanisms produce power-law behaviour as well, as wewill shortly see).

F. Power laws for discrete variables

So far I have focused on power-law distributions forcontinuous real variables, but many of the quantities wedeal with in practical situations are in fact discrete—usually integers. For instance, populations of cities, num-bers of citations to papers or numbers of copies of bookssold are all integer quantities. In most cases, the distinc-tion is not very important. The power law is obeyed onlyin the tail of the distribution where the values measuredare so large that, to all intents and purposes, they can beconsidered continuous. Technically however, power-lawdistributions should be defined slightly differently for in-teger quantities.

If k is an integer variable, then one way to proceed isto declare that it follows a power law if the probability pkof measuring the value k obeys

pk = Ck−α, (36)

for some constant exponent α. Clearly this distributioncannot hold all the way down to k = 0, since it divergesthere, but it could in theory hold down to k = 1. If wediscard any data for k = 0, the constant C would thenbe given by the normalization condition

1 =∞∑

k=1

pk = C∞∑

k=1

k−α = Cζ(α), (37)

where ζ(α) is the Riemann ζ-function. Rearranging, C =1/ζ(α) and

pk =k−α

ζ(α). (38)

If, as is usually the case, the power-law behaviour is seenonly in the tail of the distribution, for values k ≥ kmin,then the equivalent expression is

pk =k−α

ζ(α, kmin), (39)

where ζ(α, kmin) =∑

∞

k=kmink−α is the generalized or

incomplete ζ-function.Most of the results of the previous sections can be gen-

eralized to the case of discrete variables, although themathematics is usually harder and often involves specialfunctions in place of the more tractable integrals of thecontinuous case.

It has occasionally been proposed that Eq. (36) is notthe best generalization of the power law to the discretecase. An alternative and often more convenient form is

pk = CΓ(k)Γ(α)

Γ(k + α)= C B(k,α), (40)

where B(a, b) is, as before, the Legendre beta-function,Eq. (19). As mentioned in Section III.C, the beta-function behaves as a power law B(k,α) ∼ k−α for large kand so the distribution has the desired asymptotic form.Simon [36] proposed that Eq. (40) be called the Yule dis-tribution, after Udny Yule who derived it as the limitingdistribution in a certain stochastic process [37], and thisname is often used today. Yule’s result is described inSection IV.D.

The Yule distribution is nice because sums involving itcan frequently be performed in closed form, where sumsinvolving Eq. (36) can only be written in terms of specialfunctions. For instance, the normalizing constant C forthe Yule distribution is given by

1 = C∞∑

k=1

B(k,α) =C

α− 1, (41)

and hence C = α− 1 and

pk = (α− 1)B(k,α). (42)

The first and second moments (i.e., the mean and meansquare of the distribution) are

〈k〉 =α− 1α− 2

,〈

k2〉

=(α− 1)2

(α− 2)(α− 3), (43)

and there are similarly simple expressions correspondingto many of our earlier results for the continuous case.

IV. MECHANISMS FOR GENERATING POWER-LAWDISTRIBUTIONS

In this section we look at possible candidate mech-anisms by which power-law distributions might arise innatural and man-made systems. Some of the possibilitiesthat have been suggested are quite complex—notably the

IV Mechanisms for generating power-law distributions 13

physics of critical phenomena and the tools of the renor-malization group that are used to analyse it. But let usstart with some simple algebraic methods of generatingpower-law functions and progress to the more involvedmechanisms later.

A. Combinations of exponentials

A much more common distribution than the power lawis the exponential, which arises in many circumstances,such as survival times for decaying atomic nuclei or theBoltzmann distribution of energies in statistical mechan-ics. Suppose some quantity y has an exponential distri-bution:

p(y) ∼ eay. (44)

The constant a might be either negative or positive. Ifit is positive then there must also be a cutoff on thedistribution—a limit on the maximum value of y—so thatthe distribution is normalizable.

Now suppose that the real quantity we are interested inis not y but some other quantity x, which is exponentiallyrelated to y thus:

x ∼ eby, (45)

with b another constant, also either positive or negative.Then the probability distribution of x is

p(x) = p(y)dy

dx∼

eay

beby=

x−1+a/b

b, (46)

which is a power law with exponent α = 1 − a/b.A version of this mechanism was used by Miller [38] to

explain the power-law distribution of the frequencies ofwords as follows (see also [39]). Suppose we type ran-domly on a typewriter,11 pressing the space bar withprobability qs per stroke and each letter with equal prob-ability ql per stroke. If there are m letters in the alpha-bet then ql = (1− qs)/m. (In this simplest version of theargument we also type no punctuation, digits or othernon-letter symbols.) Then the frequency x with whicha particular word with y letters (followed by a space)occurs is

x =

[

1 − qsm

]y

qs ∼ eby, (47)

where b = ln(1− qs)− lnm. The number (or fraction) ofdistinct possible words with length between y and y +dygoes up exponentially as p(y) ∼ my = eay with a = lnm.

11 This argument is sometimes called the “monkeys with typewrit-ers” argument, the monkey being the traditional exemplar of arandom typist.

Thus, following our argument above, the distribution offrequencies of words has the form p(x) ∼ x−α with

α = 1 −a

b=

2 lnm − ln(1 − qs)lnm − ln(1 − qs)

. (48)

For the typical case where m is reasonably large and qsquite small this gives α ( 2 in approximate agreementwith Table I.

This is a reasonable theory as far as it goes, but realtext is not made up of random letters. Most combina-tions of letters don’t occur in natural languages; most arenot even pronounceable. We might imagine that someconstant fraction of possible letter sequences of a givenlength would correspond to real words and the argumentabove would then work just fine when applied to thatfraction, but upon reflection this suggestion is obviouslybogus. It is clear for instance that very long words sim-ply don’t exist in most languages, although there are ex-ponentially many possible combinations of letters avail-able to make them up. This observation is backed upby empirical data. In Fig. 7a we show a histogram ofthe lengths of words occurring in the text of Moby Dick,and one would need a particularly vivid imagination toconvince oneself that this histogram follows anything likethe exponential assumed by Miller’s argument. (In fact,the curve appears roughly to follow a log-normal [33].)

There may still be some merit in Miller’s argumenthowever. The problem may be that we are measuringword “length” in the wrong units. Letters are not reallythe basic units of language. Some basic units are letters,but some are groups of letters. The letters “th” for ex-ample often occur together in English and make a singlesound, so perhaps they should be considered to be a sep-arate symbol in their own right and contribute only oneunit to the word length?

Following this idea to its logical conclusion wecan imagine replacing each fundamental unit of thelanguage—whatever that is—by its own symbol and thenmeasuring lengths in terms of numbers of symbols. Thepursuit of ideas along these lines led Claude Shannonin the 1940s to develop the field of information the-ory, which gives a precise prescription for calculating thenumber of symbols necessary to transmit words or anyother data [40, 41]. The units of information are bits andthe true “length” of a word can be considered to be thenumber of bits of information it carries. Shannon showedthat if we regard words as the basic divisions of a mes-sage, the information y carried by any particular wordis

y = −k lnx, (49)

where x is the frequency of the word as before and k isa constant. (The reader interested in finding out moreabout where this simple relation comes from is recom-mended to look at the excellent introduction to informa-tion theory by Cover and Thomas [42].)

But this has precisely the form that we want. Invertingit we have x = e−y/k and if the probability distribution of


0 10 20

length in letters

101

102

103

104nu

mbe

r of w

ords

5 10

information in bits

(a) (b)

FIG. 7 (a) Histogram of the lengths in letters of all distinctwords in the text of the novel Moby Dick. (b) Histogram ofthe information content a la Shannon of words in Moby Dick.The former does not, by any stretch of the imagination, followan exponential, but the latter could easily be said to do so.(Note that the vertical axes are logarithmic.)

the “lengths” measured in terms of bits is also exponen-tial as in Eq. (44) we will get our power-law distribution.Figure 7b shows the latter distribution, and indeed itfollows a nice exponential—much better than Fig. 7a.

This is still not an entirely satisfactory explanation.Having made the shift from pure word length to informa-tion content, our simple count of the number of words oflength y—that it goes exponentially as my—is no longervalid, and now we need some reason why there should beexponentially more distinct words in the language of highinformation content than of low. That this is the case isexperimentally verified by Fig. 7b, but the reason mustbe considered still a matter of debate. Some possibilitiesare discussed by, for instance, Mandelbrot [43] and morerecently by Mitzenmacher [19].

Another example of the “combination of exponentials”mechanism has been discussed by Reed and Hughes [44].They consider a process in which a set of items, piles orgroups each grows exponentially in time, having size x ∼ebt with b > 0. For instance, populations of organismsreproducing freely without resource constraints grow ex-ponentially. Items also have some fixed probability ofdying per unit time (populations might have a stochas-tically constant probability of extinction), so that thetimes t at which they die are exponentially distributedp(t) ∼ eat with a < 0.

These functions again follow the form of Eqs. (44)and (45) and result in a power-law distribution of thesizes x of the items or groups at the time they die. Reedand Hughes suggest that variations on this argument mayexplain the sizes of biological taxa, incomes and cities,among other things.

B. Inverses of quantities

Suppose some quantity y has a distribution p(y) thatpasses through zero, thus having both positive and neg-ative values. And suppose further that the quantity weare really interested in is the reciprocal x = 1/y, whichwill have distribution

p(x) = p(y)dy

dx= −

p(y)

x2. (50)

The large values of x, those in the tail of the distribution,correspond to the small values of y close to zero and thusthe large-x tail is given by

p(x) ∼ x−2, (51)

where the constant of proportionality is p(y = 0).More generally, any quantity x = y−γ for some γ will

have a power-law tail to its distribution p(x) ∼ x−α, withα = 1+1/γ. It is not clear who the first author or authorswere to describe this mechanism,12 but clear descriptionshave been given recently by Bouchaud [45], Jan et al. [46]and Sornette [47].

One might argue that this mechanism merely generatesa power law by assuming another one: the power-law re-lationship between x and y generates a power-law distri-bution for x. This is true, but the point is that the mecha-nism takes some physical power-law relationship betweenx and y—not a stochastic probability distribution—andfrom that generates a power-law probability distribution.This is a non-trivial result.

One circumstance in which this mechanism arises isin measurements of the fractional change in a quantity.For instance, Jan et al. [46] consider one of the mostfamous systems in theoretical physics, the Ising model ofa magnet. In its paramagnetic phase, the Ising model hasa magnetization that fluctuates around zero. Suppose wemeasure the magnetization m at uniform intervals andcalculate the fractional change δ = (∆m)/m betweeneach successive pair of measurements. The change ∆mis roughly normally distributed and has a typical size setby the width of that normal distribution. The 1/m on theother hand produces a power-law tail when small valuesof m coincide with large values of ∆m, so that the tail ofthe distribution of δ follows p(δ) ∼ δ−2 as above.

In Fig. 8 I show a cumulative histogram of mea-surements of δ for simulations of the Ising model on asquare lattice, and the power-law distribution is clearlyvisible. Using Eq. (5), the value of the exponent isα = 1.98 ± 0.04, in good agreement with the expectedvalue of 2.

12 A correspondent tells me that a similar mechanism was describedin an astrophysical context by Chandrasekhar in a paper in 1943,but I have been unable to confirm this.


1 10 100 1000

fractional change in magnetization δ

100

101

102

103

104

fluct

uatio

ns o

bser

ved

of si

ze δ

or g

reat

er

FIG. 8 Cumulative histogram of the magnetization fluctu-ations of a 128 × 128 nearest-neighbour Ising model on asquare lattice. The model was simulated at a tempera-ture of 2.5 times the spin-spin coupling for 100 000 timesteps using the cluster algorithm of Swendsen and Wang [48]and the magnetization per spin measured at intervals often steps. The fluctuations were calculated as the ratioδi = 2(mi+1 − mi)/(mi+1 + mi).

C. Random walks

Many properties of random walks are distributed ac-cording to power laws, and this could explain somepower-law distributions observed in nature. In particu-lar, a randomly fluctuating process that undergoes “gam-bler’s ruin”,13 i.e., that ends when it hits zero, has apower-law distribution of possible lifetimes.

Consider a random walk in one dimension, in which awalker takes a single step randomly one way or the otheralong a line in each unit of time. Suppose the walkerstarts at position 0 on the line and let us ask what theprobability is that the walker returns to position 0 for thefirst time at time t (i.e., after exactly t steps). This is theso-called first return time of the walk and represents thelifetime of a gambler’s ruin process. A trick for answeringthis question is depicted in Fig. 9. We consider first theunconstrained problem in which the walk is allowed toreturn to zero as many times as it likes, before returningthere again at time t. Let us denote the probability ofthis event as ut. Let us also denote by ft the probabilitythat the first return time is t. We note that both of theseprobabilities are non-zero only for even values of their

13 Gambler’s ruin is so called because a gambler’s night of bettingends when his or her supply of money hits zero (assuming thegambling establishment declines to offer him or her a line ofcredit).

t = m t = n

posit

ion

t

2 2

FIG. 9 The position of a one-dimensional random walker (ver-tical axis) as a function of time (horizontal axis). The proba-bility u2n that the walk returns to zero at time t = 2n is equalto the probability f2m that it returns to zero for the first timeat some earlier time t = 2m, multiplied by the probabilityu2n−2m that it returns again a time 2n − 2m later, summedover all possible values of m. We can use this observationto write a consistency relation, Eq. (52), that can be solvedfor ft, Eq. (60).

arguments since there is no way to get back to zero inany odd number of steps.

As Fig. 9 illustrates, the probability ut = u2n, with ninteger, can be written

u2n =

{

1 if n = 0,∑n

m=1 f2mu2n−2m if n ≥ 1,(52)

where m is also an integer and we define f0 = 0 andu0 = 1. This equation can conveniently be solved for f2nusing a generating function approach. We define

U(z) =∞∑

n=0

u2nzn, F (z) =

∞∑

n=1

f2nzn. (53)

Then, multiplying Eq. (52) throughout by zn and sum-ming, we find

U(z) = 1 +∞∑

n=1

n∑

m=1

f2mu2n−2mzn

= 1 +∞∑

m=1

f2mzm

∞∑

n=m

u2n−2mzn−m

= 1 + F (z)U(z). (54)

So

F (z) = 1 −1

U(z). (55)

The function U(z) however is quite easy to calculate.The probability u2n that we are at position zero after 2nsteps is

u2n = 2−2n

(

2n

n

)

, (56)


so14

U(z) =∞∑

n=0

(

2n

n

)

zn

4n=

1√1 − z

. (57)

And hence

F (z) = 1 −√

1 − z. (58)

Expanding this function using the binomial theoremthus:

F (z) = 12z +12 ×

12

2!z2 +

12 ×

12 ×

32

3!z3 + . . .

=∞∑

n=1

(2nn

)

(2n − 1) 22nzn (59)

and comparing this expression with Eq. (53), we imme-diately see that

f2n =

(2nn

)

(2n − 1) 22n, (60)

and we have our solution for the distribution of first re-turn times.

Now consider the form of f2n for large n. Writing outthe binomial coefficient as

(

2nn

)

= (2n)!/(n!)2, we takelogs thus:

ln f2n = ln(2n)! − 2 lnn! − 2n ln 2 − ln(2n − 1), (61)

and use Sterling’s formula lnn! ( n lnn − n + 12 lnn toget ln f2n ( 12 ln 2 −

12 lnn − ln(2n − 1), or

f2n (

√

2

n(2n − 1)2. (62)

In the limit n → ∞, this implies that f2n ∼ n−3/2, orequivalently

ft ∼ t−3/2. (63)

So the distribution of return times follows a power lawwith exponent α = 32 . Note that the distribution has adivergent mean (because α ≤ 2). As discussed in Sec-tion III.C, in practice this implies that the mean is de-termined by the size of the sample. If we measure thefirst return times of a large number of random walks,the mean will of course be finite. But the more walkswe measure, the larger that mean will become, withoutbound.

As an example application, the random walk can beconsidered a simple model for the lifetime of biological

14 The enthusiastic reader can easily derive this result for him orherself by expanding (1 − z)−1/2 using the binomial theorem.

taxa. A taxon is a branch of the evolutionary tree, agroup of species all descended by repeated speciationfrom a common ancestor.15 The ranks of the Linneanhierarchy—genus, family, order and so forth—are exam-ples of taxa. If a taxon gains and loses species at randomover time, then the number of species performs a ran-dom walk, the taxon becoming extinct when the numberof species reaches zero for the first (and only) time. (Thisis one example of “gambler’s ruin”.) Thus the time forwhich taxa live should have the same distribution as thefirst return times of random walks.

In fact, it has been argued that the distribution of thelifetimes of genera in the fossil record does indeed followa power law [49]. The best fits to the available fossil dataput the value of the exponent at α = 1.7 ± 0.3, which isin agreement with the simple random walk model [50].16

D. The Yule process

One of the most convincing and widely applicablemechanisms for generating power laws is the Yule pro-cess, whose invention was, coincidentally, also inspiredby observations of the statistics of biological taxa as dis-cussed in the previous section.

In addition to having a (possibly) power-law distribu-tion of lifetimes, biological taxa also have a very convinc-ing power-law distribution of sizes. That is, the distribu-tion of the number of species in a genus, family or othertaxonomic group appears to follow a power law quiteclosely. This phenomenon was first reported by Willisand Yule in 1922 for the example of flowering plants [15].Three years later, Yule [37] offered an explanation usinga simple model that has since found wide application inother areas. He argued as follows.

Suppose first that new species appear but they neverdie; species are only ever added to genera and never re-moved. This differs from the random walk model of thelast section, and certainly from reality as well. It is be-lieved that in practice all species and all genera becomeextinct in the end. But let us persevere; there is nonethe-less much of worth in Yule’s simple model.

Species are added to genera by speciation, the splitting

15 Modern phylogenetic analysis, the quantitative comparison ofspecies’ genetic material, can provide a picture of the evolution-ary tree and hence allow the accurate “cladistic” assignment ofspecies to taxa. For prehistoric species, however, whose geneticmaterial is not usually available, determination of evolutionaryancestry is difficult, so classification into taxa is based insteadon morphology, i.e., on the shapes of organisms. It is widely ac-knowledged that such classifications are subjective and that thetaxonomic assignments of fossil species are probably riddled witherrors.

16 To be fair, I consider the power law for the distribution of genuslifetimes to fall in the category of “tenuous” identifications towhich I alluded in footnote 7. This theory should be taken witha pinch of salt.


of one species into two, which is known to happen by a va-riety of mechanisms, including competition for resources,spatial separation of breeding populations and geneticdrift. If we assume that this happens at some stochasti-cally constant rate, then it follows that a genus with kspecies in it will gain new species at a rate proportionalto k, since each of the k species has the same chance perunit time of dividing in two. Let us further suppose thatoccasionally, say once every m speciation events, the newspecies produced is, by chance, sufficiently different fromthe others in its genus as to be considered the foundermember of an entire new genus. (To be clear, we definem such that m species are added to pre-existing generaand then one species forms a new genus. So m + 1 newspecies appear for each new genus and there are m + 1species per genus on average.) Thus the number of gen-era goes up steadily in this model, as does the number ofspecies within each genus.

We can analyse this Yule process mathematically asfollows.17 Let us measure the passage of time in themodel by the number of genera n. At each time-stepone new species founds a new genus, thereby increasingn by 1, and m other species are added to various pre-existing genera which are selected in proportion to thenumber of species they already have. We denote by pk,nthe fraction of genera that have k species when the totalnumber of genera is n. Thus the number of such generais npk,n. We now ask what the probability is that thenext species added to the system happens to be added toa particular genus i having ki species in it already. Thisprobability is proportional to ki, and so when properlynormalized is just ki/

∑

i ki. But∑

i ki is simply the to-tal number of species, which is n(m + 1). Furthermore,between the appearance of the nth and the (n + 1)thgenera, m other new species are added, so the probabil-ity that genus i gains a new species during this interval ismki/(n(m + 1)). And the total expected number of gen-era of size k that gain a new species in the same intervalis

mk

n(m + 1)× npk,n =

m

m + 1kpk,n. (64)

Now we observe that the number of genera with kspecies will decrease on each time step by exactly thisnumber, since by gaining a new species they become gen-era with k + 1 instead. At the same time the numberincreases because of species that previously had k − 1species and now have an extra one. Thus we can writea master equation for the new number (n + 1)pk,n+1 of

17 Yule’s analysis of the process was considerably more involvedthan the one presented here, essentially because the theory ofstochastic processes as we now know it did not yet exist in histime. The master equation method we employ is a relativelymodern innovation, introduced in this context by Simon [36].

genera with k species thus:

(n + 1)pk,n+1 = npk,n +m

m + 1

[

(k − 1)pk−1,n − kpk,n]

.

(65)The only exception to this equation is for genera of size 1,which instead obey the equation

(n + 1)p1,n+1 = np1,n + 1 −m

m + 1p1,n, (66)

since by definition exactly one new such genus appearson each time step.

Now we ask what form the distribution of the sizes ofgenera takes in the limit of long times. To do this weallow n → ∞ and assume that the distribution tendsto some fixed value pk = limn→∞ pn,k independent of n.Then Eq. (66) becomes p1 = 1−mp1/(m + 1), which hasthe solution

p1 =m + 1

2m + 1. (67)

And Eq. (65) becomes

pk =m

m + 1

[

(k − 1)pk−1 − kpk]

, (68)

which can be rearranged to read

pk =k − 1

k + 1 + 1/mpk−1, (69)

and then iterated to get

pk =(k − 1)(k − 2) . . . 1

(k + 1 + 1/m)(k + 1/m) . . . (3 + 1/m)p1

= (1 + 1/m)(k − 1) . . . 1

(k + 1 + 1/m) . . . (2 + 1/m), (70)

where I have made use of Eq. (67). This can be simpli-fied further by making use of a handy property of theΓ-function, Eq. (20), that Γ(a) = (a − 1)Γ(a− 1). Usingthis, and noting that Γ(1) = 1, we get

pk = (1 + 1/m)Γ(k)Γ(2 + 1/m)

Γ(k + 2 + 1/m)

= (1 + 1/m)B(k, 2 + 1/m), (71)

where B(a, b) is again the beta-function, Eq. (19). This,we note, is precisely the distribution defined in Eq. (40),which Simon called the Yule distribution. Since the beta-function has a power-law tail B(a, b) ∼ a−b, we can im-mediately see that pk also has a power-law tail with anexponent

α = 2 +1

m. (72)

The mean number m + 1 of species per genus for theexample of flowering plants is about 3, making m ( 2and α ( 2.5. The actual exponent for the distribution


found by Willis and Yule [15] is α = 2.5 ± 0.1, which isin excellent agreement with the theory.

Most likely this agreement is fortuitous, however. TheYule process is probably not a terribly realistic expla-nation for the distribution of the sizes of genera, princi-pally because it ignores the fact that species (and gen-era) become extinct. However, it has been adapted andgeneralized by others to explain power laws in manyother systems, most famously city sizes [36], paper ci-tations [51, 52], and links to pages on the world wideweb [53, 54]. The most general form of the Yule processis as follows.

Suppose we have a system composed of a collection ofobjects, such as genera, cities, papers, web pages and soforth. New objects appear every once in a while as citiesgrow up or people publish new papers. Each object alsohas some property k associated with it, such as number ofspecies in a genus, people in a city or citations to a paper,that is reputed to obey a power law, and it is this powerlaw that we wish to explain. Newly appearing objectshave some initial value of k which we will denote k0.New genera initially have only a single species k0 = 1,but new towns or cities might have quite a large initialpopulation—a single person living in a house somewhereis unlikely to constitute a town in their own right butk0 = 100 people might do so. The value of k0 can also bezero in some cases: newly published papers usually have

zero citations for instance.In between the appearance of one object and the next,

m new species/people/citations etc. are added to the en-tire system. That is some cities or papers will get newpeople or citations, but not necessarily all will. And inthe simplest case these are added to objects in propor-tion to the number that the object already has. Thusthe probability of a city gaining a new member is pro-portional to the number already there; the probabilityof a paper getting a new citation is proportional to thenumber it already has. In many cases this seems like anatural process. For example, a paper that already hasmany citations is more likely to be discovered during aliterature search and hence more likely to be cited again.Simon [36] dubbed this type of “rich-get-richer” processthe Gibrat principle. Elsewhere it also goes by the namesof the Matthew effect [55], cumulative advantage [51], orpreferential attachment [53].

There is a problem however when k0 = 0. For example,if new papers appear with no citations and garner cita-tions in proportion to the number they currently have,which is zero, then no paper will ever get any citations!To overcome this problem one typically assigns new cita-tions not in proportion simply to k, but to k + c, wherec is some constant. Thus there are three parameters k0,c and m that control the behaviour of the model.

By an argument exactly analogous to the one given above, one can then derive the master equation

(n + 1)pk,n+1 = npk,n + mk − 1 + c

k0 + c + mpk−1,n − m

k + c

k0 + c + mpk,n, for k > k0, (73)

and

(n + 1)pk0,n+1 = npk0,n + 1 − mk0 + c

k0 + c + mpk0,n, for k = k0. (74)

(Note that k is never less than k0, since each object appears with k = k0 initially.)

Looking for stationary solutions of these equations asbefore, we define pk = limn→∞ pn,k and find that

pk0 =k0 + c + m

(m + 1)(k0 + c) + m, (75)

and

pk =(k − 1 + c)(k − 2 + c) . . . (k0 + c)

(k − 1 + c + α)(k − 2 + c + α) . . . (k0 + c + α)pk0

=Γ(k + c)Γ(k0 + c + α)

Γ(k0 + c)Γ(k + c + α)pk0 , (76)

where I have made use of the Γ-function notation intro-duced for Eq. (71) and, for reasons that will become clearin just moment, I have defined α = 2 + (k0 + c)/m. Asbefore, this expression can also be written in terms of the

beta-function, Eq.(19):

pk =B(k + c,α)

B(k0 + c,α)pk0 . (77)

Since the beta-function follows a power law in its tail,B(a, b) ∼ a−b, the general Yule process generates apower-law distribution pk ∼ k−α with exponent relatedto the three parameters of the process according to

α = 2 +k0 + c

m. (78)

For example, the original Yule process for number ofspecies per genus has c = 0 and k0 = 1, which reproducesthe result of Eq. (72). For citations of papers or links toweb pages we have k0 = 0 and we must have c > 0 to getany citations or links at all. So α = 2+ c/m. In his work


on citations Price [51] assumed that c = 1, so that papercitations have the same exponent α = 2 + 1/m as thestandard Yule process, although there doesn’t seem to beany very good reason for making this assumption. As wesaw in Table I (and as Price himself also reported), realcitations seem to have an exponent α ( 3, so we shouldexpect c ( m. For the data from the Science CitationIndex examined in Section II.A, the mean number m ofcitations per paper is 8.6. So we should put c ( 8.6too if we want the Yule process to match the observedexponent.

The most widely studied model of links on the web,that of Barabási and Albert [53], assumes c = m so thatα = 3, but again there doesn’t seem to be a good reasonfor this assumption. The measured exponent for numbersof links to web sites is about α = 2.2, so if the Yuleprocess is to match the data in this case, we should putc ( 0.2m.

However, the important point is that the Yule processis a plausible and general mechanism that can explain anumber of the power-law distributions observed in natureand can produce a wide range of exponents to match theobservations by suitable adjustments of the parameters.For several of the distributions shown in Fig. 4, especiallycitations, city populations and personal income, it is nowthe most widely accepted theory.

E. Phase transitions and critical phenomena

A completely different mechanism for generating powerlaws, one that has received a huge amount of attentionover the past few decades from the physics community,is that of critical phenomena.

Some systems have only a single macroscopic length-scale, size-scale or time-scale governing them. A classicexample is a magnet, which has a correlation length thatmeasures the typical size of magnetic domains. Undercertain circumstances this length-scale can diverge, leav-ing the system with no scale at all. As we will now see,such a system is “scale-free” in the sense of Section III.Eand hence the distributions of macroscopic physical quan-tities have to follow power laws. Usually the circum-stances under which the divergence takes place are veryspecific ones. The parameters of the system have to betuned very precisely to produce the power-law behaviour.This is something of a disadvantage; it makes the diver-gence of length-scales an unlikely explanation for genericpower-law distributions of the type highlighted in thispaper. As we will shortly see, however, there are someelegant and interesting ways around this problem.

The precise point at which the length-scale in a sys-tem diverges is called a critical point or a phase transi-tion. More specifically it is a continuous phase transi-tion. (There are other kinds of phase transitions too.)Things that happen in the vicinity of continuous phasetransitions are known as critical phenomena, of whichpower-law distributions are one example.

FIG. 10 The percolation model on a square lattice: squareson the lattice are coloured in independently at random withsome probability p. In this example p = 12 .

To better understand the physics of critical phenom-ena, let us explore one simple but instructive example,that of the “percolation transition”. Consider a squarelattice like the one depicted in Fig. 10 in which some ofthe squares have been coloured in. Suppose we coloureach square with independent probability p, so that onaverage a fraction p of them are coloured in. Now we lookat the clusters of coloured squares that form, i.e., the con-tiguous regions of adjacent coloured squares. We can ask,for instance, what the mean area 〈s〉 is of the cluster towhich a randomly chosen square belongs. If that squareis not coloured in then the area is zero. If it is colouredin but none of the adjacent ones is coloured in then thearea is one, and so forth.

When p is small, only a few squares are coloured inand most coloured squares will be alone on the lattice,or maybe grouped in twos or threes. So 〈s〉 will be small.This situation is depicted in Fig. 11 for p = 0.3. Con-versely, if p is large—almost 1, which is the largest valueit can have—then most squares will be coloured in andthey will almost all be connected together in one largecluster, the so-called spanning cluster. In this situationwe say that the system percolates. Now the mean sizeof the cluster to which a vertex belongs is limited onlyby the size of the lattice itself and as we let the latticesize become large 〈s〉 also becomes large. So we have twodistinctly different behaviours, one for small p in which〈s〉 is small and doesn’t depend on the size of the system,and one for large p in which 〈s〉 much larger and increaseswith the size of the system.

And what happens in between these two extremes?As we increase p from small values, the value of 〈s〉 alsoincreases. But at some point we reach the start of theregime in which 〈s〉 goes up with system size instead ofstaying constant. We now know that this point is at p =0.5927462 . . ., which is called the critical value of p andis denoted pc. If the size of the lattice is large, then 〈s〉also becomes large at this point, and in the limit wherethe lattice size goes to infinity 〈s〉 actually diverges. To


FIG. 11 Three examples of percolation systems on 100× 100 square lattices with p = 0.3, p = pc = 0.5927 . . . and p = 0.9. Thefirst and last are well below and above the critical point respectively, while the middle example is precisely at it.

illustrate this phenomenon, I show in Fig. 12 a plot of〈s〉 from simulations of the percolation model and thedivergence is clear.

Now consider not just the mean cluster size but the en-tire distribution of cluster sizes. Let p(s) be the probabil-ity that a randomly chosen square belongs to a cluster ofarea s. In general, what forms can p(s) take as a functionof s? The important point to notice is that p(s), beinga probability distribution, is a dimensionless quantity—just a number—but s is an area. We could measure s interms of square metres, or whatever units the lattice iscalibrated in. The average 〈s〉 is also an area and thenthere is the area of a unit square itself, which we will de-note a. Other than these three quantities, however, thereare no other independent parameters with dimensions inthis problem. (There is the area of the whole lattice, butwe are considering the limit where that becomes infinite,so it’s out of the picture.)

If we want to make a dimensionless function p(s) outof these three dimensionful parameters, there are three

0 0.2 0.4 0.6 0.8 1

percolation probability

0

50

100

150

200

mea

n cl

uste

r siz

e

FIG. 12 The mean area of the cluster to which a randomlychosen square belongs for the percolation model described inthe text, calculated from an average over 1000 simulations ona 1000×1000 squar

MI 48109. U.S.A.web.mit.edu/kardar/www/teaching/IITS/lectures/lec7/Power...arXiv:cond-mat/0412004 v2 9 Jan 2005 Power laws, Pareto distributions and Zipf’s law M. E. J. Newman Department

Documents