Applied Survey Data Analysis Design-based Inference from Complex Samples (A brief introduction!) Presenter: Steven G. Heeringa University of Michigan, Ann Arbor, Michigan [email protected]5 th School on Survey Sampling and Survey Methodology Cuiaba, Brazil October 17, 2017 1
137
Embed
Applied Survey Data Analysis Design-based Inference from ... · Applied Survey Data Analysis Design-based Inference from Complex Samples ... in Disproportionate Stratified Sampling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
- Prediction approach (see Valliant, Dorfman, Royall, 2000, Finite
Population Sampling and Inference: A Prediction Approach, Wiley)
- Hierarchical Linear Models (see Pfeffermann et al., 1998, “Weighting
for unequal selection probabilities in multilevel models,” JRSS-B;
Rabe-Hesketh and Skrondal, 2006, “Multilevel Modelling of
Complex Survey Data,” JRSS-A)
- Bayesian Models (Little, 2003, Ch. 4 in Analysis of Survey Data)
17
Statistical Inference (Model-based)1) Incorporates a probability model, f(y|x, θ) for the variables.
a) ex: Y(1),…,Y(n) are independent observations from a standard
normal distribution with mean μ and variance σ2.
b) ex: coin flip – Bernouli trial
P(Head) = .50
T(Tail) = .50
2) Goal is to estimate the parameters, θ, of f(y|x,θ). Assumes no
finite limit to the process that generates the sample Ys.
3) Estimators, tests of hypothesis, confidence intervals derived on the
basis of the probability model. Extremely powerful if the
probability model is correct.
ex: 1 22
errorGM M
FR
18
“Design-Based” Inference from Sample Surveys
• Goal is inference about θ, the value of the statistic in a finite population;
• Inference is based on the expected distribution of sample estimates, not on a
probability model, f(y|x,θ) for Ys;
• Confidence interval approach to inference about V
(1) (2) (3)
Example:
30.0 ± 1.96 x (1.30)
(27.45, 32.54)
• tdf(p) is the value of the Student – t distribution with df degrees of freedom cutting
off tail-probability p/2. Assumes normality of the sampling distribution of
• se( ) the estimated standard error (square root of the variance) of the sample
estimate, . Special computational formulas and/or methods are needed.
,.975 ( )y t se y
,1 /2ˆ ˆ( )dft se
19
Sampling Distributions: SRS and
Cluster Sampling
SRS
n=
50
0
0
0.1
0.2
0.3
0.4
0.5
0.6
% o
f S
am
ple
Mean 25.04614
SD 2.811056
Cluster Size, b=10
Mean 24.95748
SD 4.607751
Cluster Size, b=50
Mean 25.04225
SD 8.485906
n=
10
00
0
0.1
0.2
0.3
0.4
0.5
0.6
% o
f S
am
ple
Mean 24.9862
SD 2.019553
Mean 25.02338
SD 3.219163
Mean 24.97321
SD 5.898481
n=
50
00
-5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2
0
0.1
0.2
0.3
0.4
0.5
0.6
% o
f S
am
ple
Mean 25.0038
SD 0.909071
-5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2
Mean 24.99775
SD 1.438167
-5.4 1.8 9 16.2 23.4 30.6 37.8 45 52.2
Mean 25.02288
SD 2.620669
Mean of y
20
Three Elements of Design-Based Inference
Using Confidence Intervals (CIs)
,1 /2
,1 /2
(1) (2) (3)
ˆ ˆ( )
:
ˆ survey weighted estimate of ;
critical value from Student t with df;
ˆ ˆ( ) robust, design-corrected estimate of SE( ).
df
df
t se
where
t
se
21
Weighting and Estimation
22
Survey Weights
Weighting may simultaneously incorporate all three components: unequal
probabilities of selection, nonresponse, and post-stratification
1) Weight for unequal probabilities of selection: wsel;
2) Weight for sample nonresponse: wnr;
3) Poststratification weight for population noncoverage and sampling
variance reduction: wps.
Then compute the overall weight as: w = wsel x wnr x wps
See: Valliant, R., The Effect of Multiple Weighting Steps on Variance
Estimation, Journal of Official Statistics, Vol. 20, No. 1, 2004, pp. 1–
18.
23
Examples of Survey Weighted Estimates
1
1
2 2;
11,
2
1
1
estimates ;
2( )1 estimates
1
estimates the simple linear regression coefficient, .1
n
i ii
w n
ii
w n
i
n
i i ii
w n
i ii
i
W y
y Y
W
nW y yi i
is S
W
W y
b B
W
x
x
24
Survey Weighted Estimation: Pseudo-Maximum
Likelihood for Logistic Regression
h hn na aH H
h i h i h i
h=1 =1 i=1 h=1 =1 i=1
Pseudo ln(Likelihood):
= w y ln( ( )) 1 ln(1 ( ))
where:
ˆ/(1 ), /(1 )
and b= the vector of coefficient estimates that solve
h i h i h i
i i
x w y x
x e e x e e
i i i ix B x B x b x b
' '
s:
ln| 0
ˆ
B b
h ih i h i h i h i h ii ih h
L BU B
B
w x y w x x
25
Degrees of Freedom
26
Three Elements of Design-Based Inference (CIs)
,1 /2
,1 /2 critical value from Student t with df;
Note: df here
(1) (2) (3)
ˆ ˆ( )
:
ˆ estimat
are for the sampling distribution of estimate
e of ;
ˆ( ) robust,
s
des
!
ign
df
df
set
t
where
se
survey weighted
ˆ-corrected estimate of SE( ).
27
Degrees of Freedom in Variance Estimation
for Complex Sample DataSimple Rule:
degrees of = # of – # of
freedom clusters strata
ex: Two cluster per stratum design.
d.f. = 2 ∙ H – H = H
See: Valliant, R. and Rust, K.F., Degrees of Freedom Approximations and
Rules-of-Thumb, Journal of Official Statistics, Vol. 26, No. 4, 2010, pp.
585–602. (They propose a simple estimator of degrees of freedom that
leads to improved confidence interval coverage relative to the simple rule
above, which is currently used by most software packages.)
1
H
hh
a H
28
Degrees of Freedom in
Confidence Interval Construction
*.95 .975,
.975,1
.975,5
.975,10
.975,20
.975,30
.975,40
.975,
.975
ˆ ˆ
12.706
2.5706
2.2281
2.0860
2.0423
2.0211
1.9600
1.9600
dfCI t se
t
t
t
t
t
t
t
Z
29
Variance of Sample Estimates
30
Three Elements of Design-Based Inference (CIs)
,1 /2
,1 /2
(1) (2) (3)
ˆ
:
ˆ estimate of ;
critical value from Student t with df;
ˆ( )
ˆ ˆ( ) robust, design-corrected estimate of SE( ).
df
df
set
where
se
t
survey weighted
31
Complex Sample Variance
Estimation• Direct (closed form) results exist for estimating variances of descriptive
estimates computed from Simple Random, Stratified Random or Equal-size Cluster Samples (which are linear statistics).
• Direct (closed form) results exist for estimating variances of linear statistics, such as totals:
• More complicated if the estimate is a nonlinear function of sample quantities (weights and unequal size clusters make most common statistics nonlinear functions of random variables).
1 1
*
1 1 1
ˆ;
1, :
ˆ;
N n
i i i i
i i
N n n
i i i i
i i i
a y a y
Example Population Total
Y y Y w y y
32
• In general, weighted survey estimates are nonlinear functions of
linear statistics:
• Certain other statistics such as regression coefficients and correlation coefficients are also nonlinear functions of linear statistics:
Measurement on unit in cluster in stratum
Corresponding weight
h i h i
h iw
h i
h i
h i
h i
y wx
yw z
y i h
w
2ˆ
W
Wh i
h i h i h iih
h iih
y xub vx
33
Complex Sample Variance Estimation Methods
• Taylor Series approximation or linearization technique
• Approximate nonlinear statistics as a linear function of estimates of totals, derive corresponding variance estimator
• Leads to a specific form of the variance estimator for each statistic (default variance estimation method in SAS and Stata)
• Replication or Resampling Methods
• Jackknife Repeated Replication (JRR)
• Balanced Repeated Replication (BRR)
• Bootstrap Methods
34
Design Variables for Variance Estimation
• General design variable inputs:
– Stratum Code (e.g. stratum_var, h = 1,…,H)
– Cluster Code for PSUs or Elements (min. 2 / stratum)
– Final Survey Weight (wi for each case, i = 1,…,n)
• SAS V9.2+ (Design variables included in command code)
• Stata (Global declaration of design variables using svyset)
svyset cluster_var [pweight = wgt_var],
strata(stratum_var)
PROC SURVEYMEANS DATA = example;STRATUM stratum_var;CLUSTER cluster_var;WEIGHT wgt_var;VAR varname;
RUN;
35
Alternative for Replication Variance Estimates
• For replicated variance estimation methods, alternative is to provide a vector of replicate weights (1 weight per replicate)– Eliminates the need to release sample design stratum and cluster groups,
enhances disclosure protection (still need method used)
– Enables data producers to perform nonresponse adjustment and post-stratification for each replicate
• SAS V9.2+PROC SURVEYMEANS DATA= VARMETHOD=JACKKNIFE;
• Primary stage units (PSUs) in multi-stage sample designs are
considered to be selected with replacement from the primary
stage strata. Any finite population correction for the primary
stage sample is ignored. The resulting estimates of sampling
variance will be slight overestimates (see Kish, 1965, 5.3B).
• Multi-stage sampling within selected PSUs results in a single
ultimate cluster of observations for that PSU.
• Assume: single-stage selection of ultimate clusters, with
replacement, where all elements in a given ultimate cluster are
sampled. Greatly simplifies variance estimation formulae, and
variance in ultimate clusters from PSUs within a stratum is the
dominant source of variance in sample estimates.
37
Taylor Series Linearization: Function of 2 variables
, ,
2 2
( , ) ( , ) ( ) ( )
var( ( , )) var( ) var( ) 2 cov( , )
o o o o
o o o o
x x z z x x z z
f ff x z f x z x x z z
x z
f x z A x B z AB x z
AB
38
Taylor Series Linearization: Function of 2 or more variables
2
2
2
0 0
( , )
1
( ) ( ) 2 ( , )
: x , are the values of x, z computed from
the sample.
h i h i
h iw
h i
h i
o
o
o
o
o
o
y wx
f x z R yw z
fA
x z
xfB
z z
x Var x R Var z R Cov x zVar
z z
xR
z
where z
39
Simple Data Example (4 Strata, 2 PSUs/Stratum)
Stratum PSU (Cluster) Case yi wi
1 1 1 .58 1
1 2 .48 2
1 2 1 .42 2
2 2 .57 2
2 1 1 .39 1
1 2 .46 2
2 2 1 .50 2
2 2 .21 1
3 1 1 .39 1
1 2 .47 2
3 2 1 .44 1
2 2 .43 1
4 1 1 .64 1
1 2 .55 1
4 2 1 .47 2
2 2 .50 2
40
Taylor Series Approach: Variance Estimate
for Mean of y in Data Example
,
0
2
, ,
, 2
0
2
11.370.47375
24
24
var( ) 0.9777; var( ) 6.0000; cov( , z) 2.4000
( ) ( ) 2 ( , )
0.9777 0.4737 6.
h i h i h i
h i h iw TSL
h i h i
h i h i
h i
h i
w TSL w TSL
w TSL
w y xx
yw z z
z w
x z x
var x y var z y cov x zvar y
v
2
,
0000 2 0.4737 2.4000
24
0.00008731
( ) 0.009343w TSLse y
41
• The linearization technique is useful if the estimate
can be expressed as a function of sample totals;
readily available in SAS and Stata
• Linearization requires analytic manipulations,
computation of derivatives
• Not directly suitable for percentiles such as the
median or functions of percentiles (non-smooth
functions)
• Replication techniques are useful for almost any type
of estimate
• Empirical comparisons: Kish and Frankel (1974),
Kovar, Rao and Wu (1988), Korn and Graubard
(1999)
Alternatives to Taylor Series Linearization
42
Common Steps in Replicated Methods1. Create replicates using sampled cases (unique to each method). Software
uses sampling error codes provided in the data set.
– JRR replicate sample: leave out one PSU, and re-weight cases in “deletion stratum”.
– BRR replicate sample: given two PSUs per stratum, leave out one PSU from each stratum (based on a “balanced” Hadamard matrix to eliminate covariances when estimating variances), and re-weight cases (see ASDA, Chapter 3).
2. Create revised weight for replicate sample.
3. Compute weighted estimates of population statistic of interest for each replicate using replicate weight.
4. Apply replicated variance estimation formula to derive standard errors.
5. Construct confidence intervals (or hypothesis tests) based on estimated statistics, standard errors, and correct degrees of freedom (generally use a –H approximation).
• JRR, BRR, and bootstrap programs adjust the values of the
sample selection weights to create “replicate weights” using the
above procedures. Default - no adjustment to original survey
weights.
• If nonresponse adjustment and post-stratification adjustments
are included in the final survey weight, survey statisticians
advocate that these be recomputed for each sample replicate.
• WesVar PC (Westat, Inc.) assists in developing replicate weight
adjustments that reflect nonresponse and post-stratification.
• Replicate weights are added to data set:
w=[w(1), …,w(K)]
45
JRR: Constructing Replicates, Replicate Weights
• There are several approaches to constructing the JRR
replicates. The Delete One method that is often the default
in survey analysis software is illustrated here.
• For the Delete One JRR variance estimator, each stratum
will contribute a(h) replicates, where for each replicate one
of the clusters in a single stratum is deleted.
• In our data example, H=4 and a(h)=2 for h=1,…,4, so the
total number of required JRR replicates for the Delete One
JRR is: 4
1
( ) 8h
h
a
46
JRR: Constructing Replicates, Replicate Weights
• Suppose there are H strata with a(h) clusters. H=4, a(h)=2 in example data set.
• A JRR replicate is constructed by deleting one PSU from one stratum. The first replicate leaves out the 1st PSU in Stratum 1.
• The replicate weight for this first replicate multiplies the weights for remaining cases in the “deletion stratum” by a factor of a(h)/[a(h)-1]. This equals 2 in our example. Replicate 1 weight values remain unchanged for cases in all other strata.
47
JRR Replicate 1: Data Example
Stratum PSU (Cluster) Case yi wi,rep
1 1 1 . .
1 2 . .
1 2 1 .42 2x2
2 2 .57 2x2
2 1 1 .39 1
1 2 .46 2
2 2 1 .50 2
2 2 .21 1
3 1 1 .39 1
1 2 .47 2
3 2 1 .44 1
2 2 .43 1
4 1 1 .64 1
1 2 .55 1
4 2 1 .47 2
2 2 .50 2
48
JRR Replicate 2 (of 8): Data
Example
Stratum PSU (Cluster) Case yi wi,rep
1 1 1 .58 1x2
1 2 .48 2x2
1 2 1 . .
2 2 . .
2 1 1 .39 1
1 2 .46 2
2 2 1 .50 2
2 2 .21 1
3 1 1 .39 1
1 2 .47 2
3 2 1 .44 1
2 2 .43 1
4 1 1 .64 1
1 2 .55 1
4 2 1 .47 2
2 2 .50 2
49
JRR Variance Estimation: Delete 1
• If stratum and cluster codes are available in the data set:
• If only (a) replicate weights are available in the data set:
2
( )
1 1
( )
1
1ˆ ˆ ˆvar ( ) ( )
ˆ weighted estimate of Q for replicate where
cluster in stratum has been deleted;
haHh
JRR h
h h
h
H
h
h
aq q q
a
where
q
h
df a H a H
2
1
1ˆ ˆ ˆvar ( ) ( )
1
a
JRR k
k
aq q q
a
df a
50
JRR: Estimating the Sampling
Variance8
2
_
1
82
1
1 /2,4
ˆ ˆ ˆvar ( ) 0.5 ( )
0.5 ( )
.0000801449
( ) .0000801449 .008952
( ) 0.47375 .008952
0.47375 2.7764 (.008952) (0.44890, 0.49860)
JRR DelOne k
k
k
k
JRR
q q q
y y
se y
CI y t
51
Half Sample Replicates
• Assume a paired selection design (2 PSUs per stratum
design)
• A half sample is defined by choosing one PSU from each
stratum.
• A complement of a half sample is made up of all those
PSUs not in the half sample. A complement is also a half
sample.
• There are 2H possible half samples and their complements.
We only need H half samples for variance estimation.
(Why?)
52
Half-Sample Replication
Stratum
1 2 3 4
1 + + + -
2 + - - -
3
Halfsample
- - + -
4 - + - -
+ First element (PSU) is selected
- Second element (PSU) is selected
Consider the following choice of 4 half samples:
53
• Square arrays of + and - that define balanced half samples are called Hadamard matrices. Placket and Burman (Biometrika, 1946) have tabulated these matrices for H=4,8,12,16,…, 200 (all multiples of 4)
• Computer algorithms are available for creating these matrices (Stata, WesVar PC).
• What if the number of strata is not multiple of 4? In this case only partial balance can be achieved.
• For H=3 drop one column from the matrix for H=4. For H=5 choose the matrix with H=8 and drop 3 strata (columns).
Balanced Repeated Replication (BRR)
54
BRR: Replicates, Replicate Weights
• Suppose there are H strata with a(h) clusters. H=4, a(h)=2 in example data set
• A BRR replicate is constructed by deleting one PSU from each stratum according to the pattern specified in the Hadamard matrix. Example: the first half-sample replicate leaves out the 2nd PSU in Strata 1,2,3 and the 1st PSU in Stratum 4.
• The replicate weight for this first replicate multiplies the weights for remaining cases in the half-sample by a factor of 2.
55
BRR Replicate 1: Data Example
Stratum PSU (Cluster) Case yi wi,rep
1 1 1 .58 1x2
1 2 .48 2x2
1 2 1 . .
2 2 . .
2 1 1 .39 1x2
1 2 .46 2x2
2 2 1 . .
2 2 . .
3 1 1 .39 1x2
1 2 .47 2x2
3 2 1 . .
2 2 . .
4 1 1 . .
1 2 . .
4 2 1 .47 2x2
2 2 .50 2x2
56
BRR: Constructing Replicates, Replicate
Weights(2)
• Example: the second replicate leaves out the 2nd
PSU in Stratum 1 and the 1st PSU in Strata 2,3,4.
• Again, the replicate weight for this second
replicate multiplies the weights for remaining
cases in the half-sample by a factor of 2.
• Four half-sample replicates are created based on
the deletion pattern in the Hadamard matrix
57
BRR Replicate 2: Data Example
Stratum PSU (Cluster) Case yi wi,2,brr
1 1 1 .58 1x2
1 2 .48 2x2
1 2 1 . .
2 2 . .
2 1 1 . .
1 2 . .
2 2 1 .50 2x2
2 2 .21 1x2
3 1 1 . .
1 2 . .
3 2 1 .44 1x2
2 2 .43 1x2
4 1 1 . .
1 2 . .
4 2 1 .47 2x2
2 2 .50 2x2
58
BRR: Constructing Estimates
,
r
,
1 2 3 4
1
Four replicates are created and the weighted estimate:
q is computed for each of r=1,...,4 replicates
ˆ ˆ ˆ ˆq 0.4708; q 0.4633; q 0.4614; q 0.4692;
q=
rep
rep
n
i i rep
i rep
r n
i rep
i rep
n
i i
i
y w
y
w
y w
w
1
the full sample estimate is also computed.
q=0.4737
n
i
i
59
BRR: Estimating the Sampling Variance4
2
1
42
1
1 /2,4
1ˆ ˆ ˆvar ( ) ( ) [Formula v2 for half samples]
1( )
.00007248
( ) .0000748 .008515
( ) 0.47375 .008515
0.47375 2.7764 (.008515) (0.45011,0.49739)
Recall that JR
BRR r
r
r
r
BRR
q q qc
y yc
se y
CI y t
R resulted in a 95% CI of (0.4493, 0.4981),
resulting in extremely similar inferences.
60
Rao-Wu Rescaling Bootstrap: Replicate Formation
• b=1,…, B bootstrap replicates formed by sampling with
replacement (SWR) mh PSUs from each of the h=1,…H
primary stage strata.
• Recommendation (Rust and Rao, 1996) is to set mh=ah-1
(one less than the number of sample PSUs in stratum h)
• For mh=ah-1, the bootstrap weight for the selected replicate
is:
( ) ( )
( )
;1
: the count of times PSU
is selected in replicate b.
b bhh i h i h
h
b
h
aw w r
a
where r
61
Rao-Wu Rescaling Bootstrap:
Variance Estimation
• Variance estimate for b=1,…B bootstrap estimates is the Monte
Carlo approximation:
• For large numbers of bootstrap replicates, i.e. b>>100 a histogram
of the bootstrap simulates the sampling distribution of the estimator
– May be used to examine asymptotic normality assumption
– May be used to derive asymmetric CIs for population values
2
( )
1
1ˆ ˆ ˆvar ( ) ( )
B
Boot b
b
q q qB
62
Empirical Comparison of Methods
• Data Set: Health and Retirement Survey: HRS Wave 1
(1992)
• Software/Variance estimation methods
– SRS
– WesVar BRR and JRR with Replicate Weighting for
Nonresponse and Poststratification
– SAS TSL, BRR, JRR using only full sample weight
and program creation of replicates
– R System: Rao, Wu (1988, 1992) Bootstrap with full
sample weight (NR and PS adjustments not redone
for each Bootstrap replicate)
63
Comparison of Variance Estimation Methods
HRS (1992) Means, Full Adult Sample
Variable
n=9759
n Mean
(Wgt)
SRS
(SE)
W-
BRR
(SE)
W-JKK
(SE)
S-TSL
(SE)
S-BRR
(SE)
S-JKK
(SE)
R-BS
(SE)
Years of
School 9759 12.31 0.031 0.078 0.077 0.077 0.077 0.077 0.076
Analysis• Cochran (1977), West et al. (2008) – “Conditional”
analysis “conditions” on observed subpopulation sizes as though they were fixed. Results from using “if”, “by”, or “where” statements when analyzing the data.
• Conditional analysis is OK for simple random samples…
• …but not necessarily for stratified samples:– Distribution of subpopulation cases, m(h), to strata h=1,…H is a
random variable. Rarely fixed.
– Correct if the subpopulation of interest is used to define explicit strata, e.g., Census Region in a multi-stage national sample of U.S. households, or gender in a stratified (by sex) sample of men and women in a University student body.
27
“Unconditional” vs. “Conditional”
Analysis• “Unconditional” analysis treats stratum
subpopulation sizes as a random variable, m(h), h=1,…,H
• Variability of subpopulation across strata and to clusters within strata (including m(hi)=0) must be reflected in the variance estimation.
• Stata subpop() and SAS DOMAIN keywords/options ensure that this variability is reflected in standard errors of estimates.
28
Unconditional Analyses in Stata
• over(varname) option for command
– varname is a categorical variable
– Analyses (correct) will be replicated for each
level of the categorical varname
• subpop(varname) for command
– varname is a user-generated 0,1 variable where
1 indicates membership in subclass
– applies to all svy procedures
29
Unconditional Analyses in Stata
• The Stata subpop( ) feature always produces the correct result.
• Stata examines the design distribution (by strata and clusters) of the subpopulation. If a stratum contains 0 cases, that stratum and its clusters do not contribute to degrees of freedom, i.e., DF = # clusters -# of strata.
– keyword will produce analysis for each level of the
categorical variable varname
• PROC SURVEYFREQ;
tables domain*var1*var2;
– Produces separate table of var1*var2 for each level
of the categorical variable (e.g. gender) that is first
listed in the statement.32
SAS Notes
• PROC SURVEYREG, PROC
SURVEYLOGISTIC: domain statement
• No built in feature for subpopulation
analysis in these procedures, prior to
Version 9.2.
• Alternatives:
– Subsetting “if” (NOT recommended; SAS will
give you a warning message if used)
33
SAS Code for Subpopulation Estimates Example
(Output not shown).
proc format ;
value edf 1='0-11 Years' 2='12 Years' 3='13-15 Years' 4='16+
Years' ;
title "Analysis Example 5.12: Proportions in SubGroups: HRS" ;
proc surveymeans data=hrs mean clm ;
strata stratum ;
cluster secu ;
weight kwgthh ;
domain kfinr*edcat ;
var h8atota ;
format edcat edf. ;
run ;
SAS Notes
• SAS counts all strata and clusters toward estimation of degrees of freedom, even under domain xxx; statement.
• “trick” may lead to overestimation of degrees of freedom. Strata with 0 subpopulation cases will be counted toward df, e.g., always df=32 for NHANES II. Will yield narrower CIs than subpopulation analysis in other software systems such as STATA or R.
35
Sampling Errors for Functions of Estimates:
Differences of Subpopulation Means
12
,
1 1 1
sub1 sub2 sub1 sub2 sub1 sub2
sub1 sub2
ˆ ˆ ˆ ˆvar( ) var( ) 2 cov( )
: , are any chosen constants.
Example:
var(y y ) var(y ) var(y ) 2cov(y , y )
: y , y are estimates of
J J J K
j j j j j k j k
j j j k j
j k
a a a a
where a a
where
the mean of y for
two subclasses.
36
Sampling Errors for Functions of Estimates:
Complex Sample Designs
• Sampling errors for statistics that are functions of survey
estimates require computations of variances and
covariances of estimates. These should be stored by the
software in a variance/covariance matrix.
• In complex sample designs, the covariance terms for
subpopulation estimates are often positive due to the
clustering (non-independence) in the design.
• This is true even when the subpopulations are distinct
(male and female). The covariance will be zero if the
subpopulations are defined by distinct design strata (e.g.,
Northeast vs. South Region).37
Stata Example: Compute Subpopulation Means
. svy, vce(linearized): mean numadl, over(arthrtis)
svy, subpop(age18p): tab irregular, se ci col deff
svy, subpop(age18p): proportion irregular
svy, subpop(age18p): mean irregular
54
• The logistic or polytomous regression model
framework can be used analyze categorical data
when one variable can be designated as the
dependent variable and other variables as
explanatory or predictor variables.
• Many times we are interested only in investigating
associations between a set of categorical variables.
• A common approach is to compute a chi-square
statistic under the null hypothesis, and fail to
reject the null hypothesis if this chi-square statistic
is within the range of values that would be
expected.
55
• The chi-square test statistic is a measure of distance between the expected and observed counts in cells.
• Pearson:
• Likelihood Ratio:
• If the observed data are consistent with the null hypothesis, then we would expect this distance measure (the test statistic) to be small.
Chi-square Tests: Pearson and Likelihood Ratio
2 2ˆ ˆ( ) /Pearson rc rc rc
r c
X n p
2 2 lnˆ
rcrc
r c rc
pG n p
56
Adjusting for Design Effects
• How to incorporate the design features into this analysis? Since ignoring the design features (weighting) can introduce bias in the estimated proportions that goes into making the Chi-square statistics, the above analysis is not valid.
• Two approaches:
– Fellegi (JASA, 1980) corrects the SRS Chi-square statistic.
– Rao-Scott (1984, Biometrika), Rao-Thomas (1987)
57
Rao-Scott Method
• Use the weighted proportions in the construction of Chi-square statistics. Develop an approximate F-reference distribution for determining a p-value.
• This is analytically complicated and requires computation of generalized design effects. These are the eigenvalues of the “design effect” matrix for the proportions used to compute the Chi-square statistic.
• Most software packages enabling analysis of complex sample survey data have implemented this approach.