SC968 Panel data methods for sociologists Lecture 2, part 1 Introducing panel data
Mar 31, 2015
SC968Panel data methods for sociologistsLecture 2, part 1
Introducing panel data
Overview
Panel data What it is How to get to know the data
Change over time Tabulating Calculating transition probabilities
What is panel data?
A data set containing observations on multiple phenomena observed at a single point in time is called cross-sectional data
A data set containing observations on a single phenomenon observed over multiple time periods is called time series data
Observations on multiple phenomena over multiple time periods are panel data
Cross sectional and time series data are one- dimensional, panel data are two-dimensional Panel data can be used to answer both longitudinal and cross-
sectional questions!
Using panel data in Stata
Data on n cases, over t time periods, giving a total of n × t observations
One record per observation
i.e. long format
Stata tools for analyzing panel data begin with the prefix xt
First need to tell Stata that you have panel data using xtset
+------------------------------------------------------------------+ | pid wave sex age mastat jbstat fihhmn | |------------------------------------------------------------------| | 10019057 1 female 59 never ma retired 780 | | 10019057 2 female 60 never ma retired 759.14 | | 10019057 3 female 61 never ma retired 923.5 | | 10019057 4 female 62 never ma retired 62.5 | | 10019057 5 female 63 never ma retired 663 | | 10019057 6 female 64 never ma retired missing o | | 10019057 7 female 65 never ma retired 1254.963 | | 10019057 8 female 66 never ma retired 1270.432 | | 10019057 9 female 67 never ma retired 1364.555 | | 10019057 10 female 67 never ma retired 1479.74 | | 10019057 11 female 68 never ma retired 1328.25 | | 10019057 12 female 69 never ma retired 1371.49 | | 10019057 13 female 71 never ma retired missing o | | 10019057 14 female 71 never ma retired 1372.333 | | 10019057 15 female 73 never ma retired 1475.812 | |------------------------------------------------------------------| | 10028005 1 male 30 never ma employed 1501.155 | | 10028005 2 male 31 never ma employed 1636.259 | | 10028005 3 male 32 never ma employed 1943.283 | | 10028005 6 male 35 never ma employed 2001.54 | | 10028005 7 male 36 never ma employed 1634.33 | | 10028005 9 male 38 never ma employed 1587.945 | +------------------------------------------------------------------+
Complete and incomplete person-wave data
. xtset pid wave
panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit
Time variableUnique cross-wave identifier
Telling Stata you have time series data
. xtset pid wave
panel variable: pid (unbalanced) time variable: wave, 1 to 15, but with gaps delta: 1 unit
Period between observations in units of the time variable
Cases not observed for every time period
Describing the patterns in panel data
. xtdes,patterns(20) Freq. Percent Cum. | Pattern ---------------------------+----------------- 1294 28.12 28.12 | 111111111111111 248 5.39 33.51 | 1.............. 157 3.41 36.93 | 11............. 115 2.50 39.43 | ..............1 105 2.28 41.71 | 111............ 104 2.26 43.97 | 1111........... 73 1.59 45.56 | 11111.......... 69 1.50 47.05 | ............111 66 1.43 48.49 | ..........11111 62 1.35 49.84 | .............11 60 1.30 51.14 | .1............. 60 1.30 52.45 | 11111111111.... 58 1.26 53.71 | 11111111....... 58 1.26 54.97 | 111111111...... 57 1.24 56.21 | 11111111111111. 55 1.20 57.40 | .....1......... 54 1.17 58.57 | ........1111111 54 1.17 59.75 | .11111111111111 54 1.17 60.92 | 1111111111..... 53 1.15 62.07 | .........111111 1745 37.93 100.00 | (other patterns) ---------------------------+----------------- 4601 100.00 | XXXXXXXXXXXXXXX
Examining change over two waves
2001 | 2002 Employment status Employment | status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 991 15 46 | 1,052 2 | 20 12 9 | 41 3 | 56 20 495 | 571 -----------+---------------------------------+---------- Total | 1,067 47 550 | 1,664
1991 | 1992 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 961 35 76 | 1,072 2 | 36 38 24 | 98 3 | 40 23 524 | 587 -----------+---------------------------------+---------- Total | 1,037 96 624 | 1,757
Calculating transition probabilities
The transition probability is the probability of transitioning from one state to another
)|Pr{ 1 iXjXp ttij
n
jijijij NNp
1
/
So to calculate by hand,
Cell count Row total
Transition probability matrix
2001 | 2002 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+--------- 1 | 0.94 0.01 0.04| 1.00 2 | 0.49 0.29 0.22| 1.00 3 | 0.10 0.04 0.87| 1.00 -----------+---------------------------------+---------
1991 | 1992 Employment status Employment| status | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 0.90 0.03 0.07| 1.00 2 | 0.37 0.39 0.24| 1.00 3 | 0.07 0.04 0.89| 1.00 -----------+---------------------------------+----------
Transition probability matrices in Stata
. xttrans jbstat if wave<3,freq current | economic | current economic activity activity | 1 2 3 | Total -----------+---------------------------------+---------- 1 | 961 35 76 | 1,072 | 89.65 3.26 7.09 | 100.00 -----------+---------------------------------+---------- 2 | 36 38 24 | 98 | 36.73 38.78 24.49 | 100.00 -----------+---------------------------------+---------- 3 | 40 23 524 | 587 | 6.81 3.92 89.27 | 100.00 -----------+---------------------------------+---------- Total | 1,037 96 624 | 1,757 | 59.02 5.46 35.52 | 100.00
Mean transition probabilities for all waves t to t+1 when you leave out the “if” statement
Change in a categorical variable over timeA decision tree
empl
empl
empl
empl
empl
unemp
unemp
unemp
unemp
olf
olf
olf
olf
0.90
0.03
0.07
0.91
0.03
0.06
0.26
0.49
0.25
0.10
0.03
0.87
Change in a continuous variable over time
Size transition matrix
Quantile transition matrix
Mean transition matrix
Median transition matrix
Size transition matrix
Absolute mobility e.g. movement in and out of poverty
Boundaries set exogenously i.e. predetermined e.g. poverty defined a priori as an income below £5,000
Do not depend on distribution under investigation e.g. comparing mobility in 1990s and 2000s
incorporates both movements of positions of individuals and economic growth
Quantile transition matrix
Mobility as a relative concept Same number of individuals in each class Only records movements involving re-ranking Cannot take account of economic growth, for
example when comparing matrices Cannot draw a complete picture if comparing
mobility in different cohorts/countries/welfare regimes
Mean/median transition matrices
Both absolute and relative approaches incorporated into matrices
Class boundaries defined as percentages of mean or median income of the origin and destination distributions
Example: 25%, 50%, 75% of median income Note that this is not the same as quartiles
Example: income 1991-1992
wave = 1 household income: month before interview ------------------------------------------------------------- Percentiles Smallest 1% 181.86 0 5% 349.82 0 10% 458.98 0 Obs 2795 25% 826.6895 0 Sum of Wgt. 2795 50% 1511.067 Mean 1773.253 Largest Std. Dev. 1299.089 75% 2365.493 9230.818 90% 3329.769 9230.818 Variance 1687633 95% 4062.217 9230.818 Skewness 1.836874 99% 6748.689 9230.818 Kurtosis 8.622895 wave = 2 household income: month before interview ------------------------------------------------------------- Percentiles Smallest 1% 207.9433 0 5% 338.7431 0 10% 460.68 0 Obs 2639 25% 861.67 5 Sum of Wgt. 2639 50% 1508 Mean 1795.179 Largest Std. Dev. 1229.827 75% 2449.813 8405.636 90% 3414.511 8405.636 Variance 1512476 95% 4103.649 10491.08 Skewness 1.352148 99% 5824.449 10491.08 Kurtosis 6.370836
Category boundaries for each method
Matrix Year Boundary 1
(n)
Boundary 2
(n)
Boundary 3
(n)
Boundary 4
(n)
Size 1991 0 - 800
(580)
800 - 1500
(650)
1500 - 2200
(504)
2200 - 9231
(715)
1992 0 - 800
(580)
800 - 1500
(645)
1500 - 2200
(473)
2200 - 10491
(751)
Quartile 1991 0 – 827
(609)
827 -1511
(615)
1511 – 2365
(611)
2365 – 9231
(614)
1992 0 – 862
(610)
862 – 1508
(612)
1508 – 2450
(612)
2450 – 10491
(615)
Mean 1991 0 – 887
(654)
887 -1773
(814)
1773 – 2660
(506)
2660 – 9231
(475)
1992 0 – 898
(652)
898 -1795
(766)
1795 – 2693
(501)
2693 – 10491
(530)
Median 1991 0 – 750
(539)
750 -1500
(685)
1500 – 2250
(540)
2250 – 9231
(685)
1992 0 – 746
(536)
746 -1491
(686)
1491 -2237
(505)
2262 – 10491
(722)
Warning!
Measurement error Causes an over-estimation of mobility
If mother’s and baby’s weight are reported to nearest half pound can affect which band the observations falls in
A respondent may describe their marital status as separated in year 1 and single in year 2
Finally…..
Greater challenges to understanding and checking panel data
Transition matrices a good way to summarise mobility patterns
Different methods of constructing matrices lead to distinct interpretations
May need to take account of measurement error when modelling change
SC968Panel data methods for sociologistsLecture 2, part 2
Concepts for panel data analysis
Overview
Types of questions, types of variables: time-invariant, time-varying and trend
Between- and within-individual variation Concept of individual heterogeneity From OLS to models that allow causal interpretations: fixed effects and
random effects models The basics of these models’ implementation in Stata
Types of variable
Those which vary between individuals but hardly ever over time Sex Ethnicity Parents’ social class when you were 14 The type of primary school you attended (once you’ve become an adult)
Those which vary over time, but not between individuals The retail price index National unemployment rates Age, in a cohort study
Those which vary both over time and between individuals Income Health Psychological wellbeing Number of children you have Marital status
Trend variables Vary between individuals and over time, but in highly predictable ways: Age Year
Between- and within-individual variation
If you have a sample with repeated observations on the same individuals, there are two sources of variance within the sample:
The fact that individuals are systematically different from one another (between-individual variation) The fact that individuals’ behaviour varies between observations over time (within-individual variation)
2_
1
_
1
2_
11
2_
11
)(
)(
)(
xxB
xxW
xxT
m
j
i
k
i
i
m
jij
k
i
m
jij
k
i
Total variation is the sum over all individuals and years, of the square of the difference between each observation of x and the mean
kmkk
m
m
xxx
xxx
xxx
...
..................
..................
...
...
21
22221
11211
Within variation is the sum of the squares of each individual’s observation from his or her mean
Between variation is the sum of squares of differences between individual means and the whole-sample mean
1)-T/(N SD
Remember: From the variation, you get to the variance, you get to the Standard Deviation:
xtsum in STATA
Similar to ordinary “sum” command
within 4.320605 1 15 T = 15 between 0 8 8 n = 1294wave overall 8 4.320605 1 15 N = 19410 within 4.030974 -6.738331 35.12834 T-bar = 12.7845 between 3.609665 0 29.69231 n = 1225LIKERT overall 11.26167 5.344825 0 36 N = 15661 within .1852756 -.866041 1.000626 T-bar = 13.1787 between .1738938 0 1 n = 1237ue_sick overall .0672924 .2505353 0 1 N = 16302 within 4.31763 31.30015 54.30015 T = 15 between 19.27238 6.4 90.93333 n = 1294age overall 40.03349 19.74332 0 98 N = 19410 within .243531 -.244038 1.622629 T-bar = 13.2026 between .4217842 0 1 n = 1234partner overall .6892954 .4627963 0 1 N = 16292 within 0 .5397574 .5397574 T-bar = 13.1964 between .4989059 0 1 n = 1237female overall .5397574 .4984321 0 1 N = 16324 Variable Mean Std. Dev. Min Max Observations
. xtsum female partner age ue_sick LIKERT wave if nwaves == 15
delta: 1 unit time variable: wave, 1 to 15, but with gaps panel variable: pid (unbalanced). xtset pid wave
All variation is “between”
All variation is within, because this is a balanced sample
Have chosen a balanced sample
Most variation is “between”, because it’s fairly rare to switch between having and not having a partner
More on xtsum….
within 4.320605 1 15 T = 15 between 0 8 8 n = 1294wave overall 8 4.320605 1 15 N = 19410 within 4.030974 -6.738331 35.12834 T-bar = 12.7845 between 3.609665 0 29.69231 n = 1225LIKERT overall 11.26167 5.344825 0 36 N = 15661 within .1852756 -.866041 1.000626 T-bar = 13.1787 between .1738938 0 1 n = 1237ue_sick overall .0672924 .2505353 0 1 N = 16302 within 4.31763 31.30015 54.30015 T = 15 between 19.27238 6.4 90.93333 n = 1294age overall 40.03349 19.74332 0 98 N = 19410 within .243531 -.244038 1.622629 T-bar = 13.2026 between .4217842 0 1 n = 1234partner overall .6892954 .4627963 0 1 N = 16292 within 0 .5397574 .5397574 T-bar = 13.1964 between .4989059 0 1 n = 1237female overall .5397574 .4984321 0 1 N = 16324 Variable Mean Std. Dev. Min Max Observations
. xtsum female partner age ue_sick LIKERT wave if nwaves == 15
delta: 1 unit time variable: wave, 1 to 15, but with gaps panel variable: pid (unbalanced). xtset pid wave
Observations with non-missing variable
Average number of time-points
Number of individuals
Min & max refer to individual deviation from own averages, with global averages added back in.
Min & max refer to xi-bar
The xttab command
(n = 1236) Total 16031 100.00 2458 198.87 50.28 lt sick, 558 3.48 105 8.50 39.08 ft studt 718 4.48 271 21.93 42.93 family c 1159 7.23 292 23.62 28.97 retired 2687 16.76 314 25.40 58.49 unemploy 539 3.36 274 22.17 17.51 employed 8982 56.03 974 78.80 68.27 self-emp 1388 8.66 228 18.45 42.72 jbstat Freq. Percent Freq. Percent Percent Overall Between Within
. xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8
For simplicity, omitted jbstats of missing, maternity leave, gov training and other.
Pooled sample, broken down by person/years Number of people who
spent any time in this state
Of those who spent any time in this state, the proportion of their time (on average) they spent in it.
Which statistical model for panel data?
Your research question will guide which models are most suitable but the nature of your data is also important:
What is the effect on income of having more children? • What is the difference in income between individuals who have a different
number of children? • What is the difference in income before and after the birth of a child?
• What is the difference in income between men and women and before and after the birth of a child?
• How does income change in the time leading up to the birth of a child ? survival analysis later in this course!
Is your research question cross-sectional or longitudinal, or both? Cross-sectional: exploit variation between individuals Longitudinal: exploit variation “within” individuals over time and permit
causal interpretation of effects and can consider “between” variation if needed
Longitudinal analysis is concerned with modelling individual heterogeneity
A very simple concept: people are different! In social science, when we talk about heterogeneity, we are really
talking about unobservable (or unobserved) heterogeneity:
Observed heterogeneity: differences in education levels, or parental background, or anything else that we can measure
and control for in regressions
Unobserved heterogeneity: anything which is fundamentally unmeasurable, or which is rather poorly measured, or which
does not happen to be measured in the particular data set we are using.
With panel data we can do something about unobserved heterogeneity as we can differentiate between person-level unobserved x that are identical over time and those that vary over time!
OLS with panel data
pid wave y x11 1 2340 01 2 2405 51 3 2730 101 4 3250 151 5 3705 201 6 4030 252 1 1885 52 2 2145 102 3 2275 152 4 2470 202 5 2762 252 6 3120 303 1 780 103 2 1170 153 3 1365 203 4 2405 253 5 2405 303 6 2470 35
OLSt=1: y=2448 -156*x1 OLSpooled: y=1925 + 29*x1
10
00
20
00
30
00
40
00
Inco
me
0 10 20 30 40Number of years since leaving school
pid=1 pid=2
pid=3
OLS: cross-section
10
00
20
00
30
00
40
00
Inco
me
0 10 20 30 40Number of years since leaving school
pid=1 pid=2
pid=3
OLS: pooled
Cross-sectional effect captures may be quite misleading (omitted variable bias)! By adding more data points from the same units at different points in time we can get
better estimates. But assumptions of OLS may be violated!
An illustration of how unobserved heterogeneity matters
Considering this is from panel data, two problems become apparent:
•Error terms for persons 1, 2 and 3 differ systematically•The association between x and y appears to be biased
10
00
20
00
30
00
40
00
Inco
me
0 10 20 30 40Number of years since leaving school
pid=1 pid=2
pid=3
OLS: pooled
10
00
20
00
30
00
40
00
Inco
me
0 10 20 30 40Number of years since leaving school
pid=1 pid=2
pid=3
OLS: unobs hetPanel data allows you to:
Break down the error term (wi) in two components: the unobservable characteristics of the person (ui), and genuine “error” (ei).
then model ui and ei
w3
w1u1 ?
Expanding the OLS model to consider unobserved heterogeneity
Individual-specific, fixed over time
Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself)
iiKiKiiii uxxxxy .........332211
itiitit uxy
Analytically, think of splitting the error term into it’s two components ui and i
… and consider that you have repeated observations over time
.. and then reduce the complexity of the information available in some way, or add further assumptions. Your options:
• Focus on “between” variation: loose info on “within” variation• Focus on “within” variation: loose info on “between” variation• Model both types of variation making further assumptions
Within and between estimators
iiii uxy
Individual-specific, fixed over time
Varies over time, usual assumptions apply (mean zero, homoscedastic, uncorrelated with x or u or itself)
This is the “between” estimator
And this is the “within” estimator – “fixed effects”
θ measures the weight given to between-group variation, and is derived from the variances of ui and εi
itiitit uxy
)()()( iitiitiit xxyy
Not interested in within variation? Use the means of all observations for all persons i
Not interested in “between” variation? Why not “remove” it in that case!
Interested in both? Well, let’s treat xi_bar as imperfect to measure person fixed effect and use between variation where within variation is poorly captured
)}()1{()()1()( iitiiitiit uxxyy
Between estimator
Interpret as how much does y change between different people Not much used
Except to calculate the θ parameter for random effects, but Stata does this, not you!
It’s inefficient compared to random effects It doesn’t use as much information as is available in the data (only uses means)
Assumption required: that ui is uncorrelated with xi
Easy to see why: if they were correlated, how could one decide how much of the variation in y to attribute to the x’s (via the betas) as opposed to the correlation?
Can’t estimate effects of variables where mean is invariant over individuals Age in a cohort study Macro-level variables
iiii
itiitit
uxy
uxy
Focusing on “within” variation – the fixed effects family
“Fixed effects” estimator Basic idea: For each individual, calculate the mean of x and the
mean of y. Then run OLS on a transformed dataset where each yit is replaced by and each xit is replaced by xtreg y x, fe
)( iit yy )( iit xx
Identical to: Least Squares Dummy Variables regression areg, y x, absorb(pid)
Include a dummy indicator for each individual; all individual level differences, including the idiosyncratic error term, will then be captured in the person-specific intercept.
Members of the same family, which you may come across in the literature:First Differences regress D.(y x)
For each individual, and each time period’s y and x, calculate the difference between the value in this period and that in the last period. Then run OLS on a transformed dataset where each y it is replaced by (yit – yit-1) and each xit is replaced by (xit – xit-1)
“Hybrid models” regress y x mean_x z
run standard OLS but add of each time-varying variable as additional regressorsix
Fixed effects estimator
Fixed effects: y=65*x1
)()()( iitiitiit
itiitit
xxyy
uxy
Ignores between-group variation – so it’s an inefficient estimator
However, few assumptions are required for FE to be consistent: ui is allowed to correlate with xi
Disadvantage: can’t estimate the effects of any time-invariant variables
Need to consider change in interpretation of effects
-10
00-5
00
050
010
00In
com
e
-10 0 10Number of years since leaving school
pid=1 pid=2pid=3
Fixed Effects
pid wave y x1 1 1 2340 0 3076.7 12.5 -736.7 -12.51 2 2405 5 3076.7 12.5 -671.7 -7.51 3 2730 10 3076.7 12.5 -346.7 -2.51 4 3250 15 3076.7 12.5 173.3 2.51 5 3705 20 3076.7 12.5 628.3 7.51 6 4030 25 3076.7 12.5 953.3 12.52 1 1885 5 2442.8 17.5 -557.8 -12.52 2 2145 10 2442.8 17.5 -297.8 -7.52 3 2275 15 2442.8 17.5 -167.8 -2.52 4 2470 20 2442.8 17.5 27.2 2.52 5 2762 25 2442.8 17.5 319.2 7.52 6 3120 30 2442.8 17.5 677.2 12.53 1 780 10 1765.8 22.5 -985.8 -12.53 2 1170 15 1765.8 22.5 -595.8 -7.53 3 1365 20 1765.8 22.5 -400.8 -2.53 4 2405 25 1765.8 22.5 639.2 2.53 5 2405 30 1765.8 22.5 639.2 7.53 6 2470 35 1765.8 22.5 704.2 12.5
)( ixx ixiy )( iyy
Want to look at the effect of non-time varying x? Use and in OLS
•the effect of any unobserved characteristic otherwise transported in the effect is shifted to the effect of : approximates the coefficient in the FE model, gives you, approximately, the OLS estimate for non-time-varying variables
itresidualiiiitit
itiitit
uzxxy
uxy
321
pid wave y x z x_bar1 1 2340 1 1 1.51 2 2405 2 1 1.51 3 2730 2 1 1.51 4 3250 2 1 1.51 5 3705 1 1 1.51 6 4030 1 1 1.52 1 1885 0 2 0.662 2 2145 1 2 0.662 3 2275 1 2 0.662 4 2470 1 2 0.662 5 2762 1 2 0.662 6 3120 0 2 0.663 1 780 1 2 0.333 2 1170 1 2 0.333 3 1365 0 2 0.333 4 2405 0 2 0.333 5 2405 0 2 0.333 6 2470 0 2 0.33
ixitx
• Disadvantage: can only control for unobserved heterogeneity associated with observed time-varying variables xi;
residualiu
Hint: create yourselfix
• Typically no interest in the effect of so no need to worry about its interpretation. Note that is approximately equal to the effect in the pooled OLS
ix
1
3
iz
31
zi: non-time varying individual characteristics for which you do not need to include group means
ix itx
Random effects estimator
Uses both within- and between-group variation, so makes best use of the data and is efficient. Starts off with the idea that using xi_bar is not the best we can do to capture within variation.
the more imprecise the estimate of the person-level variation (as measured by the person xi_bar) the more we should draw on the information from other units (x_bar)
Assumption required: that ui is uncorrelated with xi
Rather heroic assumption – think of examples Will see a test for this later Note that the within and between effect is constrained to be identical
(much more like OLS in this respect so no causal interpretation!). E.g., when you include a location indicator in your model, you are saying that the
effect on y of moving to a new town is the same as the effect on y of living in different towns. When you include a female dummy, you are saying that the effect of being female on y is the same as the effect on y of changing gender.
)}()1{()()1()( iitiiitiit
itiitit
uxxyy
uxy
“Random Effects Model” here RE Generalised Least Squares
Estimating fixed effects in STATA
F test that all u_i=0: F(3316, 20882) = 4.56 Prob > F = 0.0000 rho .49265449 (fraction of variance due to u_i) sigma_e 4.0525618 sigma_u 3.9934565 _cons 6.252975 .4932977 12.68 0.000 5.286073 7.219877 badhealth 1.230831 .0428556 28.72 0.000 1.14683 1.314831 age2 -.0011833 .0002209 -5.36 0.000 -.0016163 -.0007503 age .1141748 .0214403 5.33 0.000 .0721501 .1561994 partner -.298668 .118635 -2.52 0.012 -.5312018 -.0661342 ue_sick 1.951485 .1394164 14.00 0.000 1.678218 2.224752 female (dropped) LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
corr(u_i, Xb) = 0.1561 Prob > F = 0.0000 F(5,20882) = 220.44
overall = 0.1285 max = 14 between = 0.1906 avg = 7.3R-sq: within = 0.0501 Obs per group: min = 1
Group variable: pid Number of groups = 3317Fixed-effects (within) regression Number of obs = 24204
. xtreg LIKERT female ue_sick partner age age2 badh, fe
“u” and “e” are the two parts of the error term
Peaks at age 48
“R-square-like” statistic
Talk about xtmixed
Between regression:
_cons 3.953941 .4430909 8.92 0.000 3.085181 4.822701 badhealth 2.275832 .0926521 24.56 0.000 2.094171 2.457493 age2 -.0009489 .0002263 -4.19 0.000 -.0013927 -.0005052 age .0827335 .0219026 3.78 0.000 .0397895 .1256775 partner -.0101941 .1777423 -0.06 0.954 -.35869 .3383019 ue_sick 2.038192 .312191 6.53 0.000 1.426085 2.650299 female 1.476659 .1350226 10.94 0.000 1.211923 1.741395 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
sd(u_i + avg(e_i.))= 3.833357 Prob > F = 0.0000 F(6,3310) = 166.80
overall = 0.1482 max = 14 between = 0.2322 avg = 7.3R-sq: within = 0.0480 Obs per group: min = 1
Group variable: pid Number of groups = 3317Between regression (regression on group means) Number of obs = 24204
. xtreg LIKERT female ue_sick partner age age2 badh, be
Not much used, but useful to compare coefficients with fixed effects
Coefficient on “partner” was negative and significant in FE model.
In FE, the “partner” coeff really measures the events of gaining or losing a partner
Random effects regression
rho .3577895 (fraction of variance due to u_i) sigma_e 4.0525618 sigma_u 3.0248563 _cons 5.181864 .3137662 16.52 0.000 4.566894 5.796835 badhealth 1.433115 .0385506 37.17 0.000 1.357558 1.508673 age2 -.0011062 .0001498 -7.39 0.000 -.0013998 -.0008126 age .1058038 .014544 7.27 0.000 .0772981 .1343094 partner -.1947691 .0973734 -2.00 0.045 -.3856175 -.0039207 ue_sick 2.045302 .1271039 16.09 0.000 1.796183 2.294422 female 1.493431 .1259931 11.85 0.000 1.246489 1.740373 LIKERT Coef. Std. Err. z P>|z| [95% Conf. Interval]
0.1986 0.1986 0.5482 0.6629 0.6629 min 5% median 95% max theta
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000Random effects u_i ~ Gaussian Wald chi2(6) = 2013.32
overall = 0.1471 max = 14 between = 0.2239 avg = 7.3R-sq: within = 0.0500 Obs per group: min = 1
Group variable: pid Number of groups = 3317Random-effects GLS regression Number of obs = 24204
. xtreg LIKERT female ue_sick partner age age2 badh, re theta
Option “theta” gives a summary
of weights
Tells you how good an approximation xi_bar is of the person-level effect; or how much of the within variation we used to determine the effect size zero= OLS 1=FE estimators
And what about OLS?
OLS simply treats within- and between-group variation as the same Pools data across waves
_cons 4.450393 .2212733 20.11 0.000 4.016684 4.884102 badhealth 1.841796 .0357165 51.57 0.000 1.771789 1.911802 age2 -.0010613 .0001049 -10.12 0.000 -.001267 -.0008557 age .0983746 .0103316 9.52 0.000 .078124 .1186252 partner -.0751296 .0769271 -0.98 0.329 -.2259116 .0756524 ue_sick 2.031815 .1240757 16.38 0.000 1.788619 2.275011 female 1.409466 .0640651 22.00 0.000 1.283895 1.535038 LIKERT Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 694823.199 24203 28.7081436 Root MSE = 4.9431 Adj R-squared = 0.1489 Residual 591239.694 24197 24.4344214 R-squared = 0.1491 Model 103583.505 6 17263.9175 Prob > F = 0.0000 F( 6, 24197) = 706.54 Source SS df MS Number of obs = 24204
. reg LIKERT female ue_sick partner age age2 badh
Test whether pooling data is valid
itiitit uxy
If the ui do not vary between individuals, they can be treated as part of α and OLS is fine.
Breusch-Pagan Lagrange multiplier test H0 Variance of ui = 0
H1 Variance of ui not equal to zero
If H0 is not rejected, you can pool the data and use OLS Post-estimation test after random effects
Prob > chi2 = 0.0000 chi2(1) = 10816.48 Test: Var(u) = 0
u 9.149756 3.024856 e 16.42326 4.052562 LIKERT 28.70814 5.357998 Var sd = sqrt(Var) Estimated results:
LIKERT[pid,t] = Xb + u[pid] + e[pid,t]
Breusch and Pagan Lagrangian multiplier test for random effects
. xttest0
. quietly xtreg LIKERT female ue_sick partner age age2 badh, re
Comparing models
Compare coefficients between models Reasonably similar – differences in “partner” and “badhealth” coeffs R-squareds are similar Within and between estimators maximise within and between r-2 respectively.
FE RE BE OLSfemale 1.49 *** 1.48 *** 1.41 ***ue_sick 1.95 *** 2.05 *** 2.04 *** 2.03 ***partner -0.30 ** -0.19 ** -0.01 -0.08age 0.11 *** 0.11 *** 0.08 *** 0.10 ***age2 0.00 *** 0.00 *** 0.00 *** 0.00 ***badhealth 1.23 *** 1.43 *** 2.28 *** 1.84 ***_cons 6.25 *** 5.18 *** 3.96 *** 4.45 ***
R-2 within 0.050 0.050 0.048R-2 between 0.191 0.224 0.232R-2 overall 0.129 0.147 0.148 0.149