Top Banner
SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data
46

SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Mar 31, 2015

Download

Documents

Lizbeth Benbow
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

SC968Panel data methods for sociologistsLecture 2, part 1

Introducing panel data

Page 2: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Overview

Panel data What it is How to get to know the data

Change over time Tabulating Calculating transition probabilities

Page 3: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

What is panel data?

A data set containing observations on multiple phenomena observed at a single point in time is called cross-sectional data

A data set containing observations on a single phenomenon observed over multiple time periods is called time series data

Observations on multiple phenomena over multiple time periods are panel data

Cross sectional and time series data are one- dimensional, panel data are two-dimensional Panel data can be used to answer both longitudinal and cross-

sectional questions!

Page 4: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Using panel data in Stata

Data on n cases, over t time periods, giving a total of n × t observations

One record per observation

i.e. long format

Stata tools for analyzing panel data begin with the prefix xt

First need to tell Stata that you have panel data using xtset

Page 5: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

+------------------------------------------------------------------+ | pid wave sex age mastat jbstat fihhmn | |------------------------------------------------------------------| | 10019057 1 female 59 never ma retired 780 | | 10019057 2 female 60 never ma retired 759.14 | | 10019057 3 female 61 never ma retired 923.5 | | 10019057 4 female 62 never ma retired 62.5 | | 10019057 5 female 63 never ma retired 663 | | 10019057 6 female 64 never ma retired missing o | | 10019057 7 female 65 never ma retired 1254.963 | | 10019057 8 female 66 never ma retired 1270.432 | | 10019057 9 female 67 never ma retired 1364.555 | | 10019057 10 female 67 never ma retired 1479.74 | | 10019057 11 female 68 never ma retired 1328.25 | | 10019057 12 female 69 never ma retired 1371.49 | | 10019057 13 female 71 never ma retired missing o | | 10019057 14 female 71 never ma retired 1372.333 | | 10019057 15 female 73 never ma retired 1475.812 | |------------------------------------------------------------------| | 10028005 1 male 30 never ma employed 1501.155 | | 10028005 2 male 31 never ma employed 1636.259 | | 10028005 3 male 32 never ma employed 1943.283 | | 10028005 6 male 35 never ma employed 2001.54 | | 10028005 7 male 36 never ma employed 1634.33 | | 10028005 9 male 38 never ma employed 1587.945 | +------------------------------------------------------------------+

Complete and incomplete person-wave data

Page 6: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

. xtset pid wave

panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit

Time variableUnique cross-wave identifier

Telling Stata you have time series data

Page 7: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

. xtset pid wave

panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit

Period between observations in units of the time variable

Cases not observed for every time period

Page 8: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Describing the patterns in panel data

. xtdes,patterns(20) Freq. Percent Cum. | Pattern ---------------------------+----------------- 1294 28.12 28.12 | 111111111111111 248 5.39 33.51 | 1.............. 157 3.41 36.93 | 11............. 115 2.50 39.43 | ..............1 105 2.28 41.71 | 111............ 104 2.26 43.97 | 1111........... 73 1.59 45.56 | 11111.......... 69 1.50 47.05 | ............111 66 1.43 48.49 | ..........11111 62 1.35 49.84 | .............11 60 1.30 51.14 | .1............. 60 1.30 52.45 | 11111111111.... 58 1.26 53.71 | 11111111....... 58 1.26 54.97 | 111111111...... 57 1.24 56.21 | 11111111111111. 55 1.20 57.40 | .....1......... 54 1.17 58.57 | ........1111111 54 1.17 59.75 | .11111111111111 54 1.17 60.92 | 1111111111..... 53 1.15 62.07 | .........111111 1745 37.93 100.00 | (other patterns) ---------------------------+----------------- 4601 100.00 | XXXXXXXXXXXXXXX

Page 9: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Examining change over two waves

2001 | 2002 Employment status Employment | status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 991 15 46 | 1,052 2 | 20 12 9 | 41 3 | 56 20 495 | 571 -----------+---------------------------------+---------- Total | 1,067 47 550 | 1,664

1991 | 1992 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 961 35 76 | 1,072 2 | 36 38 24 | 98 3 | 40 23 524 | 587 -----------+---------------------------------+---------- Total | 1,037 96 624 | 1,757

Page 10: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Calculating transition probabilities

The transition probability is the probability of transitioning from one state to another

)|Pr{ 1 iXjXp ttij

n

jijijij NNp

1

/

So to calculate by hand,

Cell count Row total

Page 11: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Transition probability matrix

2001 | 2002 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+--------- 1 | 0.94 0.01 0.04| 1.00 2 | 0.49 0.29 0.22| 1.00 3 | 0.10 0.04 0.87| 1.00 -----------+---------------------------------+---------

1991 | 1992 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 0.90 0.03 0.07| 1.00 2 | 0.37 0.39 0.24| 1.00 3 | 0.07 0.04 0.89| 1.00 -----------+---------------------------------+----------

Page 12: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Transition probability matrices in Stata

. xttrans jbstat if wave<3,freq current | economic | current economic activity activity | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 961 35 76 | 1,072 | 89.65 3.26 7.09 | 100.00 -----------+---------------------------------+---------- 2 | 36 38 24 | 98 | 36.73 38.78 24.49 | 100.00 -----------+---------------------------------+---------- 3 | 40 23 524 | 587 | 6.81 3.92 89.27 | 100.00 -----------+---------------------------------+---------- Total | 1,037 96 624 | 1,757 | 59.02 5.46 35.52 | 100.00

Mean transition probabilities for all waves t to t+1 when you leave out the “if” statement

Page 13: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Change in a categorical variable over timeA decision tree

empl

empl

empl

empl

empl

unemp

unemp

unemp

unemp

olf

olf

olf

olf

0.90

0.03

0.07

0.91

0.03

0.06

0.26

0.49

0.25

0.10

0.03

0.87

Page 14: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Change in a continuous variable over time

Size transition matrix

Quantile transition matrix

Mean transition matrix

Median transition matrix

Page 15: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Size transition matrix

Absolute mobility e.g. movement in and out of poverty

Boundaries set exogenously i.e. predetermined e.g. poverty defined a priori as an income below £5,000

Do not depend on distribution under investigation e.g. comparing mobility in 1990s and 2000s

incorporates both movements of positions of individuals and economic growth

Page 16: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Quantile transition matrix

Mobility as a relative concept Same number of individuals in each class Only records movements involving re-ranking Cannot take account of economic growth, for

example when comparing matrices Cannot draw a complete picture if comparing

mobility in different cohorts/countries/welfare regimes

Page 17: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Mean/median transition matrices

Both absolute and relative approaches incorporated into matrices

Class boundaries defined as percentages of mean or median income of the origin and destination distributions

Example: 25%, 50%, 75% of median income Note that this is not the same as quartiles

Page 18: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Example: income 1991-1992

wave = 1 household income: month before interview ------------------------------------------------------------- Percentiles Smallest 1% 181.86 0 5% 349.82 0 10% 458.98 0 Obs 2795 25% 826.6895 0 Sum of Wgt. 2795 50% 1511.067 Mean 1773.253 Largest Std. Dev. 1299.089 75% 2365.493 9230.818 90% 3329.769 9230.818 Variance 1687633 95% 4062.217 9230.818 Skewness 1.836874 99% 6748.689 9230.818 Kurtosis 8.622895 wave = 2 household income: month before interview ------------------------------------------------------------- Percentiles Smallest 1% 207.9433 0 5% 338.7431 0 10% 460.68 0 Obs 2639 25% 861.67 5 Sum of Wgt. 2639 50% 1508 Mean 1795.179 Largest Std. Dev. 1229.827 75% 2449.813 8405.636 90% 3414.511 8405.636 Variance 1512476 95% 4103.649 10491.08 Skewness 1.352148 99% 5824.449 10491.08 Kurtosis 6.370836

Page 19: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Category boundaries for each method

Matrix Year Boundary 1

(n)

Boundary 2

(n)

Boundary 3

(n)

Boundary 4

(n)

Size 1991 0 - 800

(580)

800 - 1500

(650)

1500 - 2200

(504)

2200 - 9231

(715)

1992 0 - 800

(580)

800 - 1500

(645)

1500 - 2200

(473)

2200 - 10491

(751)

Quartile 1991 0 – 827

(609)

827 -1511

(615)

1511 – 2365

(611)

2365 – 9231

(614)

1992 0 – 862

(610)

862 – 1508

(612)

1508 – 2450

(612)

2450 – 10491

(615)

Mean 1991 0 – 887

(654)

887 -1773

(814)

1773 – 2660

(506)

2660 – 9231

(475)

1992 0 – 898

(652)

898 -1795

(766)

1795 – 2693

(501)

2693 – 10491

(530)

Median 1991 0 – 750

(539)

750 -1500

(685)

1500 – 2250

(540)

2250 – 9231

(685)

1992 0 – 746

(536)

746 -1491

(686)

1491 -2237

(505)

2262 – 10491

(722)

Page 20: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Warning!

Measurement error Causes an over-estimation of mobility

If mother’s and baby’s weight are reported to nearest half pound can affect which band the observations falls in

A respondent may describe their marital status as separated in year 1 and single in year 2

Page 21: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Finally…..

Greater challenges to understanding and checking panel data

Transition matrices a good way to summarise mobility patterns

Different methods of constructing matrices lead to distinct interpretations

May need to take account of measurement error when modelling change

Page 22: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.
Page 23: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

SC968Panel data methods for sociologistsLecture 2, part 2

Concepts for panel data analysis

Page 24: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Overview

Types of questions, types of variables: time-invariant, time-varying and trend

Between- and within-individual variation Concept of individual heterogeneity From OLS to models that allow causal interpretations: fixed effects and

random effects models The basics of these models’ implementation in Stata

Page 25: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Types of variable

Those which vary between individuals but hardly ever over time Sex Ethnicity Parents’ social class when you were 14 The type of primary school you attended (once you’ve become an adult)

Those which vary over time, but not between individuals The retail price index National unemployment rates Age, in a cohort study

Those which vary both over time and between individuals Income Health Psychological wellbeing Number of children you have Marital status

Trend variables Vary between individuals and over time, but in highly predictable ways: Age Year

Page 26: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Between- and within-individual variation

If you have a sample with repeated observations on the same individuals, there are two sources of variance within the sample:

The fact that individuals are systematically different from one another (between-individual variation) The fact that individuals’ behaviour varies between observations over time (within-individual variation)

2_

1

_

1

2_

11

2_

11

)(

)(

)(

xxB

xxW

xxT

m

j

i

k

i

i

m

jij

k

i

m

jij

k

i

Total variation is the sum over all individuals and years, of the square of the difference between each observation of x and the mean

kmkk

m

m

xxx

xxx

xxx

...

..................

..................

...

...

21

22221

11211

Within variation is the sum of the squares of each individual’s observation from his or her mean

Between variation is the sum of squares of differences between individual means and the whole-sample mean

1)-T/(N SD

Remember: From the variation, you get to the variance, you get to the Standard Deviation:

Page 27: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

xtsum in STATA

Similar to ordinary “sum” command

within 4.320605 1 15 T = 15 between 0 8 8 n = 1294wave overall 8 4.320605 1 15 N = 19410 within 4.030974 -6.738331 35.12834 T-bar = 12.7845 between 3.609665 0 29.69231 n = 1225LIKERT overall 11.26167 5.344825 0 36 N = 15661 within .1852756 -.866041 1.000626 T-bar = 13.1787 between .1738938 0 1 n = 1237ue_sick overall .0672924 .2505353 0 1 N = 16302 within 4.31763 31.30015 54.30015 T = 15 between 19.27238 6.4 90.93333 n = 1294age overall 40.03349 19.74332 0 98 N = 19410 within .243531 -.244038 1.622629 T-bar = 13.2026 between .4217842 0 1 n = 1234partner overall .6892954 .4627963 0 1 N = 16292 within 0 .5397574 .5397574 T-bar = 13.1964 between .4989059 0 1 n = 1237female overall .5397574 .4984321 0 1 N = 16324 Variable Mean Std. Dev. Min Max Observations

. xtsum female partner age ue_sick LIKERT wave if nwaves == 15

delta: 1 unit time variable: wave, 1 to 15, but with gaps panel variable: pid (unbalanced). xtset pid wave

All variation is “between”

All variation is within, because this is a balanced sample

Have chosen a balanced sample

Most variation is “between”, because it’s fairly rare to switch between having and not having a partner

Page 28: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

More on xtsum….

within 4.320605 1 15 T = 15 between 0 8 8 n = 1294wave overall 8 4.320605 1 15 N = 19410 within 4.030974 -6.738331 35.12834 T-bar = 12.7845 between 3.609665 0 29.69231 n = 1225LIKERT overall 11.26167 5.344825 0 36 N = 15661 within .1852756 -.866041 1.000626 T-bar = 13.1787 between .1738938 0 1 n = 1237ue_sick overall .0672924 .2505353 0 1 N = 16302 within 4.31763 31.30015 54.30015 T = 15 between 19.27238 6.4 90.93333 n = 1294age overall 40.03349 19.74332 0 98 N = 19410 within .243531 -.244038 1.622629 T-bar = 13.2026 between .4217842 0 1 n = 1234partner overall .6892954 .4627963 0 1 N = 16292 within 0 .5397574 .5397574 T-bar = 13.1964 between .4989059 0 1 n = 1237female overall .5397574 .4984321 0 1 N = 16324 Variable Mean Std. Dev. Min Max Observations

. xtsum female partner age ue_sick LIKERT wave if nwaves == 15

delta: 1 unit time variable: wave, 1 to 15, but with gaps panel variable: pid (unbalanced). xtset pid wave

Observations with non-missing variable

Average number of time-points

Number of individuals

Min & max refer to individual deviation from own averages, with global averages added back in.

Min & max refer to xi-bar

Page 29: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

The xttab command

(n = 1236) Total 16031 100.00 2458 198.87 50.28 lt sick, 558 3.48 105 8.50 39.08 ft studt 718 4.48 271 21.93 42.93 family c 1159 7.23 292 23.62 28.97 retired 2687 16.76 314 25.40 58.49 unemploy 539 3.36 274 22.17 17.51 employed 8982 56.03 974 78.80 68.27 self-emp 1388 8.66 228 18.45 42.72 jbstat Freq. Percent Freq. Percent Percent Overall Between Within

. xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8

For simplicity, omitted jbstats of missing, maternity leave, gov training and other.

Pooled sample, broken down by person/years Number of people who

spent any time in this state

Of those who spent any time in this state, the proportion of their time (on average) they spent in it.

Page 30: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Which statistical model for panel data?

Your research question will guide which models are most suitable but the nature of your data is also important:

What is the effect on income of having more children? • What is the difference in income between individuals who have a different

number of children? • What is the difference in income before and after the birth of a child?

• What is the difference in income between men and women and before and after the birth of a child?

• How does income change in the time leading up to the birth of a child ? survival analysis later in this course!

Is your research question cross-sectional or longitudinal, or both? Cross-sectional: exploit variation between individuals Longitudinal: exploit variation “within” individuals over time and permit

causal interpretation of effects and can consider “between” variation if needed

Page 31: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Longitudinal analysis is concerned with modelling individual heterogeneity

A very simple concept: people are different! In social science, when we talk about heterogeneity, we are really

talking about unobservable (or unobserved) heterogeneity:

Observed heterogeneity: differences in education levels, or parental background, or anything else that we can measure

and control for in regressions

Unobserved heterogeneity: anything which is fundamentally unmeasurable, or which is rather poorly measured, or which

does not happen to be measured in the particular data set we are using.

With panel data we can do something about unobserved heterogeneity as we can differentiate between person-level unobserved x that are identical over time and those that vary over time!

Page 32: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

OLS with panel data

pid wave y x11 1 2340 01 2 2405 51 3 2730 101 4 3250 151 5 3705 201 6 4030 252 1 1885 52 2 2145 102 3 2275 152 4 2470 202 5 2762 252 6 3120 303 1 780 103 2 1170 153 3 1365 203 4 2405 253 5 2405 303 6 2470 35

OLSt=1: y=2448 -156*x1 OLSpooled: y=1925 + 29*x1

10

00

20

00

30

00

40

00

Inco

me

0 10 20 30 40Number of years since leaving school

pid=1 pid=2

pid=3

OLS: cross-section

10

00

20

00

30

00

40

00

Inco

me

0 10 20 30 40Number of years since leaving school

pid=1 pid=2

pid=3

OLS: pooled

Cross-sectional effect captures may be quite misleading (omitted variable bias)! By adding more data points from the same units at different points in time we can get

better estimates. But assumptions of OLS may be violated!

Page 33: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

An illustration of how unobserved heterogeneity matters

Considering this is from panel data, two problems become apparent:

•Error terms for persons 1, 2 and 3 differ systematically•The association between x and y appears to be biased

10

00

20

00

30

00

40

00

Inco

me

0 10 20 30 40Number of years since leaving school

pid=1 pid=2

pid=3

OLS: pooled

10

00

20

00

30

00

40

00

Inco

me

0 10 20 30 40Number of years since leaving school

pid=1 pid=2

pid=3

OLS: unobs hetPanel data allows you to:

Break down the error term (wi) in two components: the unobservable characteristics of the person (ui), and genuine “error” (ei).

then model ui and ei

w3

w1u1 ?

Page 34: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Expanding the OLS model to consider unobserved heterogeneity

Individual-specific, fixed over time

Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself)

iiKiKiiii uxxxxy .........332211

itiitit uxy

Analytically, think of splitting the error term into it’s two components ui and i

… and consider that you have repeated observations over time

.. and then reduce the complexity of the information available in some way, or add further assumptions. Your options:

• Focus on “between” variation: loose info on “within” variation• Focus on “within” variation: loose info on “between” variation• Model both types of variation making further assumptions

Page 35: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Within and between estimators

iiii uxy

Individual-specific, fixed over time

Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself)

This is the “between” estimator

And this is the “within” estimator – “fixed effects”

θ measures the weight given to between-group variation, and is derived from the variances of ui and εi

itiitit uxy

)()()( iitiitiit xxyy

Not interested in within variation? Use the means of all observations for all persons i

Not interested in “between” variation? Why not “remove” it in that case!

Interested in both? Well, let’s treat xi_bar as imperfect to measure person fixed effect and use between variation where within variation is poorly captured

)}()1{()()1()( iitiiitiit uxxyy

Page 36: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Between estimator

Interpret as how much does y change between different people Not much used

Except to calculate the θ parameter for random effects, but Stata does this, not you!

It’s inefficient compared to random effects It doesn’t use as much information as is available in the data (only uses means)

Assumption required: that ui is uncorrelated with xi

Easy to see why: if they were correlated, how could one decide how much of the variation in y to attribute to the x’s (via the betas) as opposed to the correlation?

Can’t estimate effects of variables where mean is invariant over individuals Age in a cohort study Macro-level variables

iiii

itiitit

uxy

uxy

Page 37: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Focusing on “within” variation – the fixed effects family

“Fixed effects” estimator Basic idea: For each individual, calculate the mean of x and the

mean of y. Then run OLS on a transformed dataset where each yit is replaced by and each xit is replaced by xtreg y x, fe

)( iit yy )( iit xx

Identical to: Least Squares Dummy Variables regression areg, y x, absorb(pid)

Include a dummy indicator for each individual; all individual level differences, including the idiosyncratic error term, will then be captured in the person-specific intercept.

Members of the same family, which you may come across in the literature:First Differences regress D.(y x)

For each individual, and each time period’s y and x, calculate the difference between the value in this period and that in the last period. Then run OLS on a transformed dataset where each y it is replaced by (yit – yit-1) and each xit is replaced by (xit – xit-1)

“Hybrid models” regress y x mean_x z

run standard OLS but add of each time-varying variable as additional regressorsix

Page 38: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Fixed effects estimator

Fixed effects: y=65*x1

)()()( iitiitiit

itiitit

xxyy

uxy

Ignores between-group variation – so it’s an inefficient estimator

However, few assumptions are required for FE to be consistent: ui is allowed to correlate with xi

Disadvantage: can’t estimate the effects of any time-invariant variables

Need to consider change in interpretation of effects

-10

00-5

00

050

010

00In

com

e

-10 0 10Number of years since leaving school

pid=1 pid=2pid=3

Fixed Effects

pid wave y x1 1 1 2340 0 3076.7 12.5 -736.7 -12.51 2 2405 5 3076.7 12.5 -671.7 -7.51 3 2730 10 3076.7 12.5 -346.7 -2.51 4 3250 15 3076.7 12.5 173.3 2.51 5 3705 20 3076.7 12.5 628.3 7.51 6 4030 25 3076.7 12.5 953.3 12.52 1 1885 5 2442.8 17.5 -557.8 -12.52 2 2145 10 2442.8 17.5 -297.8 -7.52 3 2275 15 2442.8 17.5 -167.8 -2.52 4 2470 20 2442.8 17.5 27.2 2.52 5 2762 25 2442.8 17.5 319.2 7.52 6 3120 30 2442.8 17.5 677.2 12.53 1 780 10 1765.8 22.5 -985.8 -12.53 2 1170 15 1765.8 22.5 -595.8 -7.53 3 1365 20 1765.8 22.5 -400.8 -2.53 4 2405 25 1765.8 22.5 639.2 2.53 5 2405 30 1765.8 22.5 639.2 7.53 6 2470 35 1765.8 22.5 704.2 12.5

)( ixx ixiy )( iyy

Page 39: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Want to look at the effect of non-time varying x? Use and in OLS

•the effect of any unobserved characteristic otherwise transported in the effect is shifted to the effect of : approximates the coefficient in the FE model, gives you, approximately, the OLS estimate for non-time-varying variables

itresidualiiiitit

itiitit

uzxxy

uxy

321

pid wave y x z x_bar1 1 2340 1 1 1.51 2 2405 2 1 1.51 3 2730 2 1 1.51 4 3250 2 1 1.51 5 3705 1 1 1.51 6 4030 1 1 1.52 1 1885 0 2 0.662 2 2145 1 2 0.662 3 2275 1 2 0.662 4 2470 1 2 0.662 5 2762 1 2 0.662 6 3120 0 2 0.663 1 780 1 2 0.333 2 1170 1 2 0.333 3 1365 0 2 0.333 4 2405 0 2 0.333 5 2405 0 2 0.333 6 2470 0 2 0.33

ixitx

• Disadvantage: can only control for unobserved heterogeneity associated with observed time-varying variables xi;

residualiu

Hint: create yourselfix

• Typically no interest in the effect of so no need to worry about its interpretation. Note that is approximately equal to the effect in the pooled OLS

ix

1

3

iz

31

zi: non-time varying individual characteristics for which you do not need to include group means

ix itx

Page 40: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Random effects estimator

Uses both within- and between-group variation, so makes best use of the data and is efficient. Starts off with the idea that using xi_bar is not the best we can do to capture within variation.

the more imprecise the estimate of the person-level variation (as measured by the person xi_bar) the more we should draw on the information from other units (x_bar)

Assumption required: that ui is uncorrelated with xi

Rather heroic assumption – think of examples Will see a test for this later Note that the within and between effect is constrained to be identical

(much more like OLS in this respect so no causal interpretation!). E.g., when you include a location indicator in your model, you are saying that the

effect on y of moving to a new town is the same as the effect on y of living in different towns. When you include a female dummy, you are saying that the effect of being female on y is the same as the effect on y of changing gender.

)}()1{()()1()( iitiiitiit

itiitit

uxxyy

uxy

“Random Effects Model” here RE Generalised Least Squares

Page 41: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Estimating fixed effects in STATA

F test that all u_i=0: F(3316, 20882) = 4.56 Prob > F = 0.0000 rho .49265449 (fraction of variance due to u_i) sigma_e 4.0525618 sigma_u 3.9934565 _cons 6.252975 .4932977 12.68 0.000 5.286073 7.219877 badhealth 1.230831 .0428556 28.72 0.000 1.14683 1.314831 age2 -.0011833 .0002209 -5.36 0.000 -.0016163 -.0007503 age .1141748 .0214403 5.33 0.000 .0721501 .1561994 partner -.298668 .118635 -2.52 0.012 -.5312018 -.0661342 ue_sick 1.951485 .1394164 14.00 0.000 1.678218 2.224752 female (dropped) LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

corr(u_i, Xb) = 0.1561 Prob > F = 0.0000 F(5,20882) = 220.44

overall = 0.1285 max = 14 between = 0.1906 avg = 7.3R-sq: within = 0.0501 Obs per group: min = 1

Group variable: pid Number of groups = 3317Fixed-effects (within) regression Number of obs = 24204

. xtreg LIKERT female ue_sick partner age age2 badh, fe

“u” and “e” are the two parts of the error term

Peaks at age 48

“R-square-like” statistic

Talk about xtmixed

Page 42: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Between regression:

_cons 3.953941 .4430909 8.92 0.000 3.085181 4.822701 badhealth 2.275832 .0926521 24.56 0.000 2.094171 2.457493 age2 -.0009489 .0002263 -4.19 0.000 -.0013927 -.0005052 age .0827335 .0219026 3.78 0.000 .0397895 .1256775 partner -.0101941 .1777423 -0.06 0.954 -.35869 .3383019 ue_sick 2.038192 .312191 6.53 0.000 1.426085 2.650299 female 1.476659 .1350226 10.94 0.000 1.211923 1.741395 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

sd(u_i + avg(e_i.))= 3.833357 Prob > F = 0.0000 F(6,3310) = 166.80

overall = 0.1482 max = 14 between = 0.2322 avg = 7.3R-sq: within = 0.0480 Obs per group: min = 1

Group variable: pid Number of groups = 3317Between regression (regression on group means) Number of obs = 24204

. xtreg LIKERT female ue_sick partner age age2 badh, be

Not much used, but useful to compare coefficients with fixed effects

Coefficient on “partner” was negative and significant in FE model.

In FE, the “partner” coeff really measures the events of gaining or losing a partner

Page 43: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Random effects regression

rho .3577895 (fraction of variance due to u_i) sigma_e 4.0525618 sigma_u 3.0248563 _cons 5.181864 .3137662 16.52 0.000 4.566894 5.796835 badhealth 1.433115 .0385506 37.17 0.000 1.357558 1.508673 age2 -.0011062 .0001498 -7.39 0.000 -.0013998 -.0008126 age .1058038 .014544 7.27 0.000 .0772981 .1343094 partner -.1947691 .0973734 -2.00 0.045 -.3856175 -.0039207 ue_sick 2.045302 .1271039 16.09 0.000 1.796183 2.294422 female 1.493431 .1259931 11.85 0.000 1.246489 1.740373 LIKERT Coef. Std. Err. z P>|z| [95% Conf. Interval]

0.1986 0.1986 0.5482 0.6629 0.6629 min 5% median 95% max theta

corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000Random effects u_i ~ Gaussian Wald chi2(6) = 2013.32

overall = 0.1471 max = 14 between = 0.2239 avg = 7.3R-sq: within = 0.0500 Obs per group: min = 1

Group variable: pid Number of groups = 3317Random-effects GLS regression Number of obs = 24204

. xtreg LIKERT female ue_sick partner age age2 badh, re theta

Option “theta” gives a summary

of weights

Tells you how good an approximation xi_bar is of the person-level effect; or how much of the within variation we used to determine the effect size zero= OLS 1=FE estimators

Page 44: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

And what about OLS?

OLS simply treats within- and between-group variation as the same Pools data across waves

_cons 4.450393 .2212733 20.11 0.000 4.016684 4.884102 badhealth 1.841796 .0357165 51.57 0.000 1.771789 1.911802 age2 -.0010613 .0001049 -10.12 0.000 -.001267 -.0008557 age .0983746 .0103316 9.52 0.000 .078124 .1186252 partner -.0751296 .0769271 -0.98 0.329 -.2259116 .0756524 ue_sick 2.031815 .1240757 16.38 0.000 1.788619 2.275011 female 1.409466 .0640651 22.00 0.000 1.283895 1.535038 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 694823.199 24203 28.7081436 Root MSE = 4.9431 Adj R-squared = 0.1489 Residual 591239.694 24197 24.4344214 R-squared = 0.1491 Model 103583.505 6 17263.9175 Prob > F = 0.0000 F( 6, 24197) = 706.54 Source SS df MS Number of obs = 24204

. reg LIKERT female ue_sick partner age age2 badh

Page 45: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Test whether pooling data is valid

itiitit uxy

If the ui do not vary between individuals, they can be treated as part of α and OLS is fine.

Breusch-Pagan Lagrange multiplier test H0 Variance of ui = 0

H1 Variance of ui not equal to zero

If H0 is not rejected, you can pool the data and use OLS Post-estimation test after random effects

Prob > chi2 = 0.0000 chi2(1) = 10816.48 Test: Var(u) = 0

u 9.149756 3.024856 e 16.42326 4.052562 LIKERT 28.70814 5.357998 Var sd = sqrt(Var) Estimated results:

LIKERT[pid,t] = Xb + u[pid] + e[pid,t]

Breusch and Pagan Lagrangian multiplier test for random effects

. xttest0

. quietly xtreg LIKERT female ue_sick partner age age2 badh, re

Page 46: SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data.

Comparing models

Compare coefficients between models Reasonably similar – differences in “partner” and “badhealth” coeffs R-squareds are similar Within and between estimators maximise within and between r-2 respectively.

FE RE BE OLSfemale 1.49 *** 1.48 *** 1.41 ***ue_sick 1.95 *** 2.05 *** 2.04 *** 2.03 ***partner -0.30 ** -0.19 ** -0.01 -0.08age 0.11 *** 0.11 *** 0.08 *** 0.10 ***age2 0.00 *** 0.00 *** 0.00 *** 0.00 ***badhealth 1.23 *** 1.43 *** 2.28 *** 1.84 ***_cons 6.25 *** 5.18 *** 3.96 *** 4.45 ***

R-2 within 0.050 0.050 0.048R-2 between 0.191 0.224 0.232R-2 overall 0.129 0.147 0.148 0.149