The econometrics of inequality and poverty

8/3/2019 The econometrics of inequality and poverty

1/31

The econometrics of inequality and poverty

Lecture 6: Modeling the income distribution

Michel Lubrano

January 2010

Contents

1 Type of survey samples 21.1 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Using weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Stratified samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Two stage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Grouped data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 IID samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Natural estimators and resampling methods 5

2.1 The use of order statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Jacknife and bootstraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Non parametric estimation of densities 7

3.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Estimation par noyau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Proprietes dechantillonage 11

4.1 Assumptions and notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Biais et variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Approximations du biais et de la variance . . . . . . . . . . . . . . . . . . . . . 12

4.4 Determination de la fenetre et du noyau ideaux . . . . . . . . . . . . . . . . . . 13

5 Choix de la fenetre 145.1 Choix subjectifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.2 Reference a une distribution connue . . . . . . . . . . . . . . . . . . . . . . . . 15

5.3 Cross validation sur la vraisemblance . . . . . . . . . . . . . . . . . . . . . . . 16

5.4 Least squares cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1


2/31

5.5 Density estimation with weighted samples . . . . . . . . . . . . . . . . . . . . . 17

5.6 Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 General estimation methods 18

6.1 Inference for grouped data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.2 Inference for Pareto IID samples . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.4 Graphical and regression methods . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.5 Using R for Pareto fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.6 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.7 Inference for the Lognormal process . . . . . . . . . . . . . . . . . . . . . . . . 24

6.8 Using R to compare Pareto and Lognormal . . . . . . . . . . . . . . . . . . . . 25

7 Using mixtures for IID samples 26

7.1 Informal introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7.2 Mixture of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7.3 Estimation procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7.4 Difficulties of estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297.5 Estimating mixture in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1 Type of survey samples

Please read the first chapter of Deaton (1997).

The data we are interested are survey data concerning households. Many type of information

can be asked to household such as unemployment, wages, education, health status. Here we are

mainly concerned with income and sometime consumption. We have a finite population of size

N. We want to draw a sample of a smaller size n from that population. How can we proceed?The design of a survey has to follow precise rules. We want to get information on a population

and it is too costly to ask the entire population every year. A census occurs at most every five

years and gives information on the whole population. The coverage of the population is usually

not complete: homeless people, armed forces,...

1.1 Random samples

A survey has to be framed, which means that we have to know the size and composition of the

true population. A census is useful to frame a survey, other administrative data can be used too.

The census for instance provide a list of households to sample.Then we have to decide about the size n of the survey. The sample survey is then drawn at

random. The sample mean

x =1

n

ni

xi

2


3/31


4/31

where xs is the estimated mean for each strata. In each strata, we can of course have a particularweighting scheme which is superimposed to the stratification. Stratification often improves the

representativeness of the sample by reducing sampling error. It can produce a weighted mean

that has less variability than the arithmetic mean of a simple random sample of the population.

In fact

Var(x) =S

s=1(Ns

N)2Var(x

s)

because the strata are independent. It can be shown that this variance is lower than the variance

of

xsrs =S

s=1

nsn

xs.

1.4 Two stage sampling

Within each strata, most household survey collect their data in two stages, first sampling clusters,

and then selecting household within each cluster. This has the advantage of diminishing the cost

of interviews which are done in the same cluster or village, of easing the re-interview. Butobservation within a cluster are correlated. So the information collected might be less precise.

The mean is computed in the same way using weights. But it variance might be greater than in

the one stage sampling.

1.5 Grouped data

Survey data report private information on households. These data are politically sensitive de-

pending on their content. For instance, there are in France questionings about the use of racial

information to study discrimination. In Belgium, it is forbidden to ask question on the language

used at home (French or Flemish). So for a long time, these data were simply not available.

Researcher had access to data that were so aggregated, that they were presented in groups. Thetreatment of these grouped data needed special tools and estimation techniques. For instance,

Singh and Maddala or McDonald use grouped data for the US income. We reproduce here these

data: We have percentages summing 100% in all the columns with dates. The first column repre-

sent the end of class for each group. It is presumably in thousands dollars per year per household.

This lead to an histogram, as we shall explain below.

1.6 IID samples

When the statistician is lucky, the data come from an IID sample, like macroeconomic data.

These samples are much easier to analyse. A mean is computed in the usual way. Parametricdensities can be estimated using maximum likelihood.

4


5/31

Table 1: US Data on income

Endpoints 1970 1975 1980

2.5 6.6 3.5 2.1

5.0 12.5 8.5 4.1

7.5 15.2 10.6 6.2

10.0 16.6 10.6 6.512.5 15.8 11.4 7.3

15.0 11.0 10.9 6.9

20.0 13.1 18.8 14.0

25.0 4.6 11.6 13.7

35.0 3.0 9.5 19.8

50.0 1.1 3.2 12.8

0.5 1.4 6.7

2 Natural estimators and resampling methods

Order stat

Estimation of F(x)

Estimation of L(j/n)

Estimation of Poverty deficit curves: to be completed for stochastic dominance

Bootstrapping

2.1 The use of order statistics

Les premieres techniques destimation que lon va presenter ici sont relativement simples. Elles

reposent sur le fait que lon puisse ordonner les observations et donc calculer ce que lon appelleles statistiques dordre. Supposons donc que les observations de X soient classees par ordrecroissant et que lon note ce classement

x(1) x(2) x(n). (1)x(1) represente la plus petite observation et x(n) la plus grande. Il devient facile dans ce cadredestimer de facon naturelle une distribution et ses quantiles. En effet, un distribution se definit

comme F(x) = Prob(X < x). Elle peut sapproximer au moyen de

Prob(X x(i)) i/n (2)

quand on dispose de suffisamment dobservations. Le premier decile de cette distribution corre-spond a la valeur x0.10 telle que Prob(X x0.10) = 0.10. Il suffira alors de trouver lobservationdont le rang i correspondra a peu pres a i/n = 0.10 dans la suite ordonnee des X. Dune manieregenerale, si lon note Q(p) le quantile dordre p, celui-ci sestime comme

Q(p) = x(s) s 1 np s. (3)

5


6/31

Lestimation des quantiles permet par exemple de calculer une mesure de dispersion qui est

lecart interdecile (x0.90 x0.10)/x0.50.A partir de ces memes statistiques dordre, on peut definir un estimateur pour la courbe de

Lorenz generalisee en tracant

L(p = i/n) =1

n

i

j=1x(j). (4)

On a utilise ici les sommes partielles des statistiques dordre. La courbe de Lorenz sobtient en

normalisant par la moyenne. Enfin, on peut estimer le coefficient de Gini au moyen dune simple

somme et non dune double somme comme dans la definition originale in term of expected

absolute difference divided by twice the mean:

IG =2

n(n 1)

i

i x(i) n + 1n 1 . (5)

Ce type de calcul peut egalement sappliquer pour calculer lindice de pauvete de Sen-Schorrocks-

Thon:

ISS T =1

n2

qi=1

(2n 2i + 1)z

y(i)z .

ou q correspond au rang de z dans la distribution de X.

2.2 Jacknife and bootstraping

Thus we have simple estimators, but we do not know all the time how to compute standard

deviations. For instance it was rather easy to compute the variance of the mean. But the variance

of the mode is much more difficult to establish, especially when the sampling design is more

complex. The bootstrap and eventually the jacknife are tow methods for assessing sampling

variability.

Two sources of randomness

1. We have samples from a finite population. We must know the sample design, which can be

quite complicated in order to appreciate the source of randomness. Not always easy. For

instance N might not be known precisely.

2. There are errors of observations, or simply the nature of the variable which is observed is

random as it results from decision making under uncertainty.

Two types of methods were designed in the literature

1. The jacknife

2. The bootstrap

These are resampling techniques. There are packages in R for bootstraping. The bootstrap re-

sample with replacement n data from the original sample. The jacknife provides n 1 samples

6


7/31

by eliminating one observation each time of the original sample. With each technique, the statis-

tics for which we want to compute a variance is evaluated for each bootstrap or jacknife sample.

The bootstrap is available in R. We must first call the library boot. Then define a function

with two arguments: the fist argument represents the original data, the second argument indicates

the weights of the bootstraping generated by the package. Here we have given an example with

the Gini coefficient.

library(boot,Gini)

r


8/31

Distribution of the Gini

r$t

Density

0.250 0.255 0.260 0.265

0

50

100

150

Figure 1: Bootstraping the Gini

hist(y79,breaks=50)

where y79 is the FES data for 1979. The corresponding graph is given in Figure 2.

3.2 Estimation par noyau

The histogram has the bad property of being a step function: it is discontinuous and not differ-

entiable. We would like to get a smooth representation, and we feel that this is possible when

we have a full sample and not grouped data. Rosenblatt (1956) had the idea of replacing the

indicator function by a kernel K which integrates to one like the indicator function. We thushave the new estimator

f(x) =

1

n

n

i=1

1

hK(). (7)

On peut deduire certaines proprietes du noyau en partant des proprietes de la fonction indicatrice

et de lhistogramme.

-

K() d = 1,

8


9/31

- h 0 when n ,- K() = 0,- A common choice for K is the standardised normal density. Then K( > 3) 0.- The value chosen for h is capital for defining the neighbourhood (x

xi)/h

3.

It is very important to understand the role played by h in determining the shape of the ob-tained density. We have simulated 500 observations drawn from a mixture of normals N(i,1)with 1 = 1, 2 = 5 and p = 0.25.

f(x) = 0.25f(x|1, 1) + (1 0.25)f(x|5, 1)On a ensuite estime la densite de ces observations en utilisant un noyau Normal et trois valeurs

de h. For the while, we accept the fact that the optimal value ofh is given by

h = c n1/5

We have selected three values for c in the following graphs. On distingue bien la bimodalite dansle cas central; elle disparait dans la premiere figure et des aleas dechantillonnage apparaissent

dans la troisieme.

Histogram of y79

y79

Frequency

0 100 200 300 400

0

200

400

60

0

800

Figure 2: Histogram with 50 cell of FES 1979

9


10/31

10


11/31

4 Proprietes dechantillonage

On a passe en revue beaucoup de facteurs qui influaient sur le resultat final de lestimation de

la densite. Les deux ingredients de base ont ete le choix du noyau et le choix de la fenetre de

lissage. Comment mesurer exactement leur influence sur la precision du resultat final? On veut

pouvoir mesurer la difference finale entre lestimateur et la vraie densite. Une mesure naturelle

decart entre un estimateur et la vraie valeur est le Mean Squared Error

MSEx() = E[ ]2 (8)que lon peut decomposer facilement en

MSEx() = Biais[]2 + Var[] (9)

Mais on veut estimer une densite et non un estimateur ponctuel. Il faut donc une mesure globale

qui tienne compte de tout x. On va donc integrer sur x pour obtenir le MISE, ou Mean IntegratedSquared Error

MISEx(f) = E

[f(x) f(x)]2dx (10)Ceci correspond a une notion de risque. Si lon veut simplement minimiser la perte, il suffit deconsiderer:

ISEx(f) =

[f(x) f(x)]2dx (11)Le MISE est le critere le plus employe, mais il est difficile a calculer. On se contente souvent

dapproximations qui peuvent se trouver en remarquant que le MISE peut se decomposer en

MISEx(f) =

[E(f(x)) f(x)]2dx +

Var[f(x)] dx (12)

Il suffit alors de trouver des approximations pour le biais et la variance et de les reporter dans

cette formule.

4.1 Assumptions and notations

We already made some assumptions concerning the Kernel and the window size. We recall them

and introduce some useful notations:

-

K(t) dt = 1

-

K2(t) dt = cK < -

2K(t) dt = 2

La quantite 2 va jouer un role important dans lexpression des resultats. Enfin, concernant

la fenetre

- h 0 quand n - n h quand n

Elle doit tendre vers zero quand la taille de lechantillon augmente, mais pas trop vite.

11


12/31

4.2 Biais et variance

Le biais et la variance de lestimateur peuvent se calculer comme des esperance par rapport a la

vrai distribution inconnue f(.). En partant de la formule de lestimateur par noyau simple

E(f(x)) =

1

hK

x yh

f(y) dy (13)

qui servira a calculer le biais et

nVarf(x) = 1

h2K

x yh

2 f(y) dy1

hK

x yh

f(y) dy2 (14)

4.3 Approximations du biais et de la variance

Les formules exactes du biais et de la variance fonc intervenir des int egrales et ne sont pasapplicables directement, sauf dans des cas tres particuliers, de peu dinteret pratique. Aussi,

a-t-on cherche des approximations au moyen dun developpement de Taylor reduit au premier

ordre.

Commenons par poser le changement de variable y = x ht de Jacobien h. En faisant cechangement de variable dans lexpression du biais, il vient que

biais =

K(t)[f(x ht) f(x)]dt (15)

Developpons f(x ht) autour de h = 0

f(x ht) = f(x) htf(x) + 12 h2t2f(x) + . . . (16)En utilisant le fait que le noyau est desperance nulle et de variance 2

biais 12

h2f(x)2 + . . . (17)

Des calculs similaires pour la variance montrent que

Var(f(x)) 1nh

f(x) ck (18)

en supposant que n est grand et h petit. Lapproximation du MISE est donc

AMISE 14

h422

f(x)2dx +

1

nhck (19)

Le biais ne depend que de la taille de la fenetre et non de la taille de lechantillon. La variance,

par contre, depend de la taille de lechantillon. De plus on diminue le biais en diminuant h,

12


13/31

mais diminuer h augmente la variance. Le choix de h implique un arbitrage entre les erreurssystematiques et les erreurs aleatoires. Cest ce que lon retrouve sur le graphique a fenetres. Si

lon veut minimiser le MISE (ou AMISE ici), on se rend compte que le premier terme est du

meme ordre que h4 alors que le second terme est du meme ordre que 1/(nh). Biais et variancesont alors du meme ordre pour

h

n1/5 (20)

On retrouvera ce critere tout au long de linference nonparametrique.

4.4 Determination de la fenetre et du noyau ideaux

On va deriver le MISE approche par rapport a h, et trouver le h optimal en meetant cette expres-sion a zero. Il en resulte que

hopt = 2/52 c

1/5K {

f(x)2dx}1/5 n1/5

= cK

n 22 f(x)2dx1/5

(21)

La fenetre ideale depend de beaucoup de choses:

- Elle tend vers zero a une tres faible vitesse

- Elle depend des fluctuations de f. Si f fluctue beaucoup, il faudra un petit h. Certainesmethodes vont caller h par rapport a une densite connue comme la Normale (Regle deSilverman).

- Enfin, h depend du Noyau. Ceclui-ci peut toujours etre normalise de telle sorte que 2 = 1.

Ce qui fait que le noyau nintervient plus quau travers de cK. La regle de Silvermanexploitera encore une fois ce fait.

Si on reporte le h optimal dans lexpression du MISE, on va arriver a

MISE 54

2/52 c

4/5K

f(x)2dx

1/5n4/5 (22)

Le noyau ideal est celui qui minimise le MISE a f donne. Pour le trouver il faut en fait minimiserck sous contrainte que le noyau trouve soit bien une densite, cest a dire sintegre a un et soitnorme, cest a dire que 2 = 1. On montre que le noyau qui realise cette minimisation est le

noyau de Epanechnikov qui a une expression toute simple

K(t) =

3

4

5(1 t

2

5) si t 5

0 autrement

(23)

13


14/31

On peut encore simplifier cette expression par changement de variable x = 5

5, mais alors cenoyau qui sintegre encore a 1, nest plus de variance egale a 1.

On peut calculer lefficacite des autres noyaux par rapport au noyau dEpanechnikov en

definissant le rapport t2Ke(t) dt

Ke(t)

2dt

t2K(t) dt K(t)2dt(24)

En utilisant les proprietes du noyau dEpanechnikov, on arrive a

2/(5

5t2K(t) dt

K(t)2dt

(25)

Il est alors interessant de calculer lefficacite des noyau usuels. Avec ce qui semble la pire

Kernel K(t) efficiency

Epanechnikof

3

45(1 t2

/5) 1

Biweight15

16(1 t2)2 0.99

Gaussian12

exp12t2 0.95

Rectangular1

2pour |t| < 1 0.93

des solutions, le noyau rectangulaire qui conduit a lhistogramme, lefficacite est tres prochede 1. On ne consacrera donc pas trop de temps a choisir un noyau efficace. Seule dautre

considerations peuvent enter en ligne de compte. Le noyau dEpanechnikov a linconvenient

de netre derible quau premier ordre, alors que le biweight lest a lordre 2 et que le noyau

Gaussien est infiniement derivable. Certains noyau sont a support fini, dautre infini. Cela fait

une difference en terme defficacite numerique. Pour le noyau Gaussien, on peut passer son

temps par exemple a calculer beaucoup de nombres qui recevront un poids tres faible.

5 Choix de la fenetre

Le choix de la fenetre est determinant sur laspect du resultat. Il est guide par le but poursuivi.Sil sagit de presenter le contenu des donnees, un choix subjectif fera souvent laffaire. Sil

sagit de presenter des conclusion, un peu dundersmoothing sera utile, le lecteur pouvant lisser

a loeil, mais par contre ne peut reconstituer des details qui ont ete gommes par un h trop grand.Quand un grand nombre de resultats sont a presenter, une methode automatique est tres utile. De

14


15/31

meme si lon veut comparer des resultats, on aura interet a avoir une methode standardisee pour

choisir h. On notera que les methodes automatiques ne peuvent etre qualifiee dobjectives carelles reposent toutes sur des hypotheses particulieres.

5.1 Choix subjectifs

On considere plusieurs graphiques et lon choisi a loeil la valeur de h qui donne le resultat leplus esthetique. On peut se reporter aux figures precedentes.

5.2 Reference a une distribution connue

Nous avons vu que le h optimal etait donne par

hopt = 2/52 c

1/5K

f(x)2dx

1/5n1/5 (26)

Certains elements de cette expression sont connus comme n et K. Mais f est bien sur inconnue,

puisquon cherche a lestimer et lon doit calculer f(x)2dx. Si lon suppose que la vraiedistribution f est Normale de moyenne nulle et de variance 2, alors

fN(0,2)(x)2dx = 5

0.375

0.2125 (27)

Si maintenant on choisit un noyau normal, on verifie que 2 = 1 et cK = 0.5/

. Il suffit alorsde collecter ensemble tous les petits bouts pour trouver que dans ce cas, le h optimal est donnepar

h 1.06 n1/5 (28)Il suffit destimer de maniere consistante la variance de lechantillon et lon a un h optimal. Cestce que lon a coutume dappeler la regle de Silverman.

Ceci marche tres bien tant que lon est proche du cas Normal, mais beaucoup moins bien

des que lon sen ecarte. En particulier, si la vrai distribution f est une mixture, la formule deSilverman aura tendance a oversmoother des que les modes de la mixture secartent. Differentes

etudes on montre que cette regle oversmoothe aussi en cas dasymetrie de f, mais pas dans lecas de kurtosis. En particulier si f est une Student, la regle marche bien.

Pour ameliorer la regle de Silverman, on cherche a trouver une meilleure evaluation de la

dispertion. Si R est le range interquartile, alors la regle se transforme en

h = 0.79 R n1/5 (29)

Mais avec cette regle on aggrave loversmoothing en cas de bimodalite. Finalement on pourra

choisir

h = 0.9 A n1/5 (30)

15


16/31

A est le minimum entre et R/1.34. Les graphiques presentes ont ete calcules avec

h = c A n1/5 (31)

avec c prenant comme valeurs 10, 1 et 0.1.

5.3 Cross validation sur la vraisemblance

On va poursuive lidee de la vraisemblance et lappliquer cette fois-ci au choix de h. Si lafonction de vraisemblace est donnee par

log f(xi), une pseudo fonction de vraisemblance est

log L =

log f(xi, h). (32)

Le probleme cest que loptimum de cette fonction est obtenu pour h = 0. On va donc appliquerle cross-validation principle et evaluer non pas f(xi, h), mais fi(xi, h)

fi(xi, h) =1

h(n 1)

n

j = 1i = j

Kxj xi

h (33)

qui consiste a laisser de cote une observation a chaque fois. Cest un principe general en ap-

proche nonparametrique qui sera utilise par la suite.

Cette methode de vraisemblance revient a choisir le h qui minimise la distance de Kulback-Leibler entre f et f, soit

f(x)log

f(x)

f(x)

dx. (34)

Mais le h obtenu est severement affecte par le comportement des queues de f. Aussi ce criterenest pas tres employe. Il est utile de le connaitre car il a permi dintroduire un principe qui sera

tres utile par la suite pour la regression nonparametrique.

5.4 Least squares cross validation

On va chercher cette fois-ci a optimiser un critere qui est plus elabore que la simple pseudo

fonction de vraisemblance precedente. Considerons lIntegrated Squared Error

ISE(h) =

(f(x, h) f(x))2dx (35)

En developpant le carre, on se rend compte que lon peut simplifier cette expression car un des

membres ne depend pas de h

ISE(h)

f(x, h)2 dx 2

f(x, h) f(x) dx (36)

16


17/31

On va donc estimer cette quantite en utilisant lechantillon et ensuite trouver la valeur de h quila minimise.

La methode de cross-validation tire son nom de la methode particliere utilisee pour estimer

f(x) en laissant tomber une des observations. Definissons lestimateur

fi(x, h) = 1h(n 1) j=i Kx xjh (37)

La notation i signifie que lon laisse tomber lobservation i pour estimer f(xi). A partir de la,on remarque que

f(x, h) f(x) dx est lesperance de f(x, h). Un estimateur sans biais de cette

esperance est donne par la moyenne empirique des fi(x, h), soit

E(f(x, h)) 1n

ni=1

fi(xi) (38)

Il faudrait montrer pourquoi on choisit cet estimateur de leave-out et pourquoi on nutilise pas

tout lechantillon.Il faut maintenant calculer le premier element du ISE au moyen de

f2dx =1

n2h2

i

j

x

K

xi xh

K

xj xh

dx (39)

La solution est donnee par f2dx =

1

n2h2

i

j

K

xi xjh

(40)

K = K K. Si le noyau est une Normale (0,1), alors K = N(0, 2).On aura tout de suite compris que la methode est lourde a employer sur le plan des calculs.

Pour chaque valeur de h, il faut evaluer ISE(h) qui fait intervenir une double somme. De plus lafonction peut avoir plusieurs minima locaux. Pagan et Ullah mentionnent la technique de bin-

ning utilisee dans Xplore pour reduire les temps de calcul. On remarque egalement la parentee

avec la methode precedente, mais les fonctions de lechantillon utilisees sont plus complexes.

5.5 Density estimation with weighted samples

When there are weights wi, we must first impose that the weights sum to unity. The usual formula

is simply modified into

f(x) =1

n h

wiK

x xi

h

17


18/31

5.6 Using R

The standard stats package includes a routine for estimating densities. The density object is cre-

ated by simply calling density(x) where x represents the data set, assuming that the data arepresented in a column. By default a Gaussian kernel is used and the classical rule of Silverman

for the bandwidth. Of course many options are possible which can be found on the help. We

present these option in the following table To obtain a graph, it suffices to use the routine plot

Bandwidth Kernel Weight

bw = nrd0(x) kernel = gaussian weights = rep(1/nx, nx)

bw=bw.ucv(x) kernel = epanechnikov

bw=bw.SJ(x) kernel = triangular

Table 3: Options for density estimation

together with the output object of density. For instance plot(density(x)). If we want to

change the default method for determining the bandwidth, using for instance the cross validation

method, we can use

plot(density(y79,bw=bw.ucv(y79)))

6 General estimation methods

Inference, including bayesian inference for Pareto and other simple densities. R and inference

for SM using the FES data.

6.1 Inference for grouped data

La fonction de vraisemblance pour ce type de donnees secrit sous forme multinomiale

L() = N!g

i=1

Pi()ni

ni!

ou Pi() est la probabilite detre dans la classe ith des g groupes de la population:

Pi() =

Iif(y; )dy

ni/Nsont les frequences observees.Lajustement de diverses fonctions par cette methode donne des densites en cloche, alors

que les valeurs des parametres trouvees par Thurow (1970) impliquaient des formes en U ou en

L. Les coefficient de Gini calcules correspondent a ceux estimes de maniere directe par le US

Bureau of Census.

18


19/31

Pour comparer les densites entre elles, McDonald utilise la valeur de la fonction de vraisem-

blance, des SSR

SS R =

(niNpi())2

ou des statistiques de 2

Chi =

(

niNpi())2/pi()

Un test de rapport de rapport de vraisemblance permet de tester les reductions parametriques et

de comparer certaines distributions entre elle. Pour les donnees de revenu, la GB2 domine, mais

la SM vient juste apres devant la GG par exemple. La log Normale est mauvaise.

On trouve dans McDonald-Ransom (1979) quelques details supplementaires sur lestimation

sur donnees groupees.

La premiere methode consiste a maximiser la fonction de vraisemblance multinomiale. Cest

la methode dite du scoring.

La methode du 2 minimum consiste a minimiser la distance du 2

n (nin

pi)

2

pi

Cette distance est distribuee selon un 2 a g k 1 degres de liberte ce qui permet de testerladequation de la fonction aux donnees.

La methode des moindres carres consiste a minimiser(

ninpi)2

Cette derniere methode donne des resultats en general different des deux premieres et assez mau-

vais.

La methode utilisee par Singh et Maddala (1976) est un peu differente. On ne va pas chercher

a baser lestimation sur la probabilite dune intervalle et donc en fait a chercher a minimiser ladifference entre un histogramme et son approximation parametrique, mais la difference entre

lestimateur naturel de la distribution empirique et lexpression analytique de cette distribution.

Comme pour la SM on a

F(x) = 1 1(1 + a1xa2)a3

lestimation consiste a minimiser[log(1 F) + a3 log(1 + a1xa2i )]2

On constate plusieurs choses sur cette methode:

il sagit de minimiser une norme de moindres carres et non de 2

. Donc, on risque davoirla une premiere source derreurs (moindres carres non ponderes).

pour la borne infinie, on ne peut calculer log(1F) alors que cetait sans probleme dans lecas base sur la densite. Ce probleme se pose pour les troncatures du style revenus inferieurs

a ou bien revenus superieurs a...

19


20/31

6.2 Inference for Pareto IID samples

Inference is quite easy for the usual Pareto I model. It is detailed for instance in Arnold (2008).

Let us suppose that we have an IID sample of X which is drawn from a Pareto I. Once we havemanaged to obtain the estimates ofxm and of, it is easy to produce an estimate for the neededtransformations of these parameters such as for instance the Gini coefficient and to find their

standard deviation using the delta method (which is not very precise). In the case of Bayesianinference, such estimates are easier to obtain.

6.3 Maximum likelihood

The likelihood function is

L(x; xm, ) = nxnm (

xi)

(+1)1I(xi xm)It is easy to see that we have two sufficient statistics which give imediately the MLE

xm = x(1)

= 1n log(xi/x(1))1As underlined by Arnold (2008), these estimators are positively biased as

E(xm) = xm(1 1/(n))1

Var(xm) = x2mn(n 1)2(n 2)1

E() = n/(n 2)Var() = 2(n 2)2(n 3)1

Knowing the bias, it is easy to propose unbiased estimators by simply correcting the initial max-

imum likelihood estimators. Do it as an exercise. Once we know the estimates of xm and of,it is easy to produce an estimate for the needed transformations of these parameters such as for

instance the Gini coefficient and to find their standard deviation using the delta method (which

is not very precise).

6.4 Graphical and regression methods

This is in fact the original method. We know how to estimate the empirical distribution function

in a simple way. We know that

(1 F(xi)) = (xi/xm)

Taking the logs each side leads to the regressionlog(1 F(xi)) = cste log(xi) + i

If we do not get a straight line when plotting the two logs, it is a test that the sample does not

come from a Pareto distribution. We can also estimate in a similar way using the empiricalLorenz curve. These estimators are consistent.

20


21/31

6.5 Using R for Pareto fit

# Compute and plot (1-F), log(1-F) for FES data

library(ineq)

data1=read.table("fes79.csv",header=F,sep=";")

data2=read.table("fes88.csv",header=F,sep=";")

data3=read.table("fes92.csv",header=F,sep=";")data4=read.table("fes96.csv",header=F,sep=";")

y79 = sort(data1[,1])/223.5*223.5

y88 = sort(data2[,1])/421.7*223.5

y92 = sort(data3[,1])/546.4*223.5

y96 = sort(data4[,1])/602.4*223.5

y79 = sort(data1[,1])/223.5*223.5

pareto


22/31

!h

0 2 4 6 8

8

6

4

2

0

log(y)

log(1

F)

Figure 3: Pareto tail for the income distribution

22


23/31

where sm is also an unknown parameter, inference becomes more delicate and a Gibbs sampleris needed. When the sample is given and observed, it is natural to assign to xm the minimumvalue of the sample, either the observed value or a value determined on an a priori ground. We

have the same problem with the Weibull where it is supposed that X has to be positive, so thereis a minimum value taken equal to zero when the general form of the density includes a location

parameter.

Consider a gamma prior for . Write the likelihood function, after a transformation of thevariable. In this case, the log transformation of the data density into that of translated exponential

density. The Pareto distribution is related to the exponential distribution as follows. Suppose X

is Pareto-distributed with minimum xm and index . Let

Y = log

X

xm

.

Then Y is exponentially distributed with intensity , or equivalently with expected value 1/:

Pr(Y > y) = ey.

so that the cumulative density function is 1

ey and the pdf

f(x; ) =

ex, x 0,

0, x < 0.

The likelihood function for , given an independent and identically distributed sample y =(y1,...,yn) drawn from the variable, is

L(; y) =n

i=1

exp(yi) = n exp

ni=1

yi

= n exp(ny) ,

where

y = 1n

ni=1

yi

is the sample mean ofy. The conjugate prior for the exponential distribution is the gamma distri-bution (of which the exponential distribution is a special case). The following parameterisations

of the gamma pdf is useful:

Gamma( ; , s) =s

()1 exp( s).

The posterior distribution p can then be expressed in terms of the likelihood function definedabove and a gamma prior:

p(|y) L(; y) Gamma( ; , s)

= n exp( nx) s

()1 exp( s)

(+n)1 exp( (s + nx)).

23


24/31

Now the posterior density p has been specified up to a missing normalising constant. Since it hasthe form of a gamma pdf, this can easily be filled in, and one obtains

p(|y) = Gamma( ; + n, s + nx).

Here the parameter can be interpreted as the number of prior observations, and s as the sum of

the prior observations.Once the posterior is obtained, we can generate random number from it in order to find the

distribution of the Gini coefficient for instance or of any of the other transformation of.

6.7 Inference for the Lognormal process

The probability density function of a log-normal distribution is:

fX (x; , ) =1

x

2exp

(ln x )222

, x > 0where and are the mean and standard deviation of the variables natural logarithm. Thismeans for instance that = E(log(x)). The likelihood function is rather simple to write once wenote that this pdf is just the normal pdf times the Jacobian of the transformation which is 1/x.We have

fL(x; , ) =n

i=1

1

x i

fN(ln xi; , )

where by fL we denote the probability density function of the log-normal distribution and by fNthat of the normal distribution. Therefore, using the same indices to denote distributions, we can

write the log-likelihood function in the following way:

L(, |x1, x2, . . . , xn) = k ln xk + N(, | ln x1, ln x2, . . . , ln xn)= constant + N(, | ln x1, ln x2, . . . , ln xn).

Since the first term is constant with regard to and , both logarithmic likelihood functions, Land N, reach their maximum with the same and . Hence, using the formulas for the normaldistribution maximum likelihood parameter estimators and the equality above, we deduce that

for the log-normal distribution it holds that

=

k ln xk

n,

2 =

k (ln xk )2

n.

This means that in a lognormal sample, the two parameters can be estimated by the sample mean

of the logs and the variance of the logs.

24


25/31

6.8 Using R to compare Pareto and Lognormal

When you have declared the package ineq and loaded your observations, you can use the follow-

ing code to compare the non-parametric Lorenz curves and the Lorenz curves corresponding to

a Pareto density and to a lognormal density.

plot(Lc(y79))

lines(p,Lc.pareto(p, parameter=2),col="red")

text(0.9,0.6,"Pareto 2.0")

lines(p,Lc.lognorm(p, parameter=0.45),col="blue")

text(0.45,0.4,"Lognormal 0.45")

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

p

L(p)

Pareto 2.0

Lognormal 0.45

Figure 4: Lorenz for Pareto and Lognormal

25


26/31

7 Using mixtures for IID samples

7.1 Informal introduction

Let us go back to the FES data sets. Which kind of density can we fit to these data? We have

illustrated several stylised facts

The Pareto does not fit the data as shown by the Lorenz curve The lognormal seems to fit the data better as shown again by the Lorenz curve The high incomes, greater than exp(4.5) = 90.02, seem to behave like a Pareto

If we estimate the Pareto regression on the whole sample, the results seem to be good, when in

fact there are not, as shown by the graphs

ls.out |t|)

Intercept 7.5148 0.0502 149.5581 0

X -1.9736 0.0116 -170.3988 0

When done on the truncated sample, they are far better

Residual Standard Error=0.1453

R-Square=0.9786

F-statistic (df=1, 2190)=99911.3

p-value=0

Estimate Std.Err t-value Pr(>|t|)

Intercept 18.6256 0.0622 299.6337 0

X -4.0838 0.0129 -316.0875 0

But we need to confirm these results by a plot of the Lorenz curve

z = exp(4.5)

plot(Lc(y79[y79>z]))

lines(p,Lc.pareto(p, parameter=4),col="red")

lines(p,Lc.lognorm(p, parameter=0.25),col="blue")

26


27/31

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lorenz curve

p

L(p)

Figure 5: Lorenz fit for high incomes

We see on this plot that the red Lorenz curve corresponding to a Pareto with = 4.0 fitsslightly better than the lognormal with 2 = 0.25. So we cannot use a single distribution tomodel these data. This is confirmed by a non-parametric estimate of the density. We estimate the

data density using a Kernel and plot it together with a fit for the lognormal density obtained with

plot(density(y79))

lines(dlnorm(seq(0,350,1),meanlog=mean(ly79) ,sdlog=sd(ly79) ),col

We see clearly that if the overall fit could pass for being nice, the two modes are of course

smoothed into something with is even not in between, while the right tail seems to be fitted quite

well.

7.2 Mixture of distributions

When a single density is not enough to represent correctly the distribution of a sample, a simple

explanation is that the observed sample is heterogenous and this result from the mixing of dif-

ferent populations, each being represented by a particular density indexed by a given parameter.

27


28/31

0 100 200 300 400 500

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

0.0

12

density.default(x = y79)

N = 6230 Bandwidth = 5.982

Density

Figure 6: Non parametric estimate of the density for FES79

The trouble is that we do not know first how many different sub-populations there are and secondwhat is their proportion. This lack of knowledge makes the problem difficult. For a simplifica-

tion, let us suppose that we have only two subpopulations, each one being described by a density

indexed by i and in unknown proportion p. The density of one observation is

f(x|) = p fN(x|1, 21) + (1 p) fN(x|2, 22)

if we suppose as a simplification that the two members of the mixture are normal densities. If

we knew the sample separation, i.e. which observation belongs to group 1 or 2, the inference

problem would be very simple. But of course, the allocation of the observations is unknown.

7.3 Estimation procedures

It is convenient to introduce a new random variable called Z that will be associated to eachobservation xi and that will say if xi belongs to the first component of the mixture zi = 1 or tothe second component of the mixture zi = 2. Suppose that we know the n values of z. We can

28


29/31


30/31

7.5 Estimating mixture in R

The complexity of the estimation procedures is reflected in the procedures proposed in R. In order

to simplify the problem, the program start by considering an histogram, which means grouped

data. So we have first to select the number of cells in the histogram. Then we have to give

starting values for the parameters, and first of all the number of components. It it is quite safe

to estimate a two component mixture, many references in the empirical literature indicate thattrying to fit more than two component is rarely successful. Usually an equal weight is given as a

starting value for the pi. A visual inspection of the histogram gives clues about plausible valuesfor the mean. The prior variance is small when the prior mean correspond to a sharp part of the

histogram and much larger for the prior mean corresponding to the tail.

library(mixdist)

FES.mix


31/31

Table 4: Parameter estimates for a twin mixture

member p

1 0.1369 45.42 6.764

2 0.8631 89.14 40.811

0 100 200 300 400

0.0

00

0.0

05

0.0

10

0.0

15

X

ProbabilityDensity

Lognormal Mixture

Figure 7: Mixture of two lognormal densities

References

ARNOLD, B. C. (2008): Pareto and Generalized Pareto Distributions, in Modeling Income

Distribuions and Lorenz Curves, ed. by D. Chotikapanich, vol. 5 of Economic Studies in

Equality, Social Exclusion and Well-Being, chap. 7, pp. 119145. Springer, New-York.

DEATON, A. (1997): The Analysis of Household Surveys. The John Hopkins University Press,

Baltimore and London.

31

The econometrics of inequality and poverty

Documents