Optimal allocation algorithm for a li ifi i d i multi-way stratification design P.D. Falorsi, P. Righi, Italian National Statistical Institute NTTS 2011 Conference 22 – 24 February 2011, Brussels
Optimal allocation algorithm for a l i ifi i d imulti-way stratification design
P.D. Falorsi, P. Righi,, g ,Italian National Statistical Institute
NTTS 2011 Conference 22 – 24 February 2011, Brussels
Outline
Overview
Multi-way Sampling Design Multi-way optimal allocation y p
procedure Monte Carlo simulation
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 2
Overview
Large scale surveys in Official Statisticsll d ti t f t f
usually produce estimates for a set ofparameters by a huge number of highlydetailed estimation domains
These domains generally define notnested partitions of the target population
When the domain indicator variables areavailable at framework level, we may plana sample covering each domaina sample covering each domain
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 3
Overview
Why fix a sample size in each domain:
Allows to apply direct estimators When planning the sample an evaluation
of the sampling errors on the mainof the sampling errors on the mainestimates is possible
When direct estimator is not reliable(small area problem) having units in thedomains allows to:
b d th bi f ll i di t bound the bias of small area indirectestimators;
use models with specific small area effects.
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 4
Overview
Standard solution for fixing the
sample sizes in domainsbelonging to two or more
titipartitions: Stratified the sample with strata
gi en b o l ifi tion ofgiven by cross-classification ofvariables defining the differentpartitions(cross-classification orpartitions(cross-classification orone-way stratified design)
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 5
Overview
Main drawbacks:
Too detailed stratification Risk of sample size explosion Inefficient sample allocation Risk of statistical burden
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 6
Overview
Some examples (1):
Inefficient sample allocation
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 7
Overview
Some examples (2): statistical burden
Strata distrib tion b n mber of enterprises in the Small and Medi mStrata distribution by number of enterprises in the Small and Medium Enterprises Survey (2003)Number stratum enterprises
Absolute frequency
Cumulative frequency % Frequency % Cumulative
frequency1 4 700 4 700 18 7 18 71 4,700 4,700 18.7 18.72 2,512 7,212 10.0 28.7
3-5 3,816 11,028 15.2 43.96-10 2,815 13,843 11.2 55.1>10 11 286 25 129 44 9 100 0
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 8
>10 11,286 25,129 44.9 100.0
Overview
Some examples (3):
Italian Graduates’ Career Survey. 2010sample size about 90,000 units
Number of domain in not nested partition and number of cross-classified strata
Type of degreeSample size explosionI ffi i t l ll tiType of degree
3 years LongFirst partition 448 425Second partition 94 198
Inefficient sample allocation
Second partition 94 198
Strata 2,981 4,778
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 9
Multi-way Sampling Design
Multi-way (or incomplete)
stratification design (MWD)satisfies sample allocation atd i l l ith tdomain level without cross-classification the sizes of the combining strata are the sizes of the combining strata are
random variables Main problem of MWD: a random Main problem of MWD: a random
selection procedure
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 10
Multi-way Sampling Design
Use of Cube Method (Deville and Tillé,2004)
2004) The method select balanced samples in the
model assisted framework
A sample s is balanced on a set of auxiliaryvariables z (balancing variables) if the
MWD is a special case of balanced sample
zz zz ttUk ksk kkht
,ˆ
MWD is a special case of balanced sample The method works well with a large
population and a lot of domains
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 11
Multi-way optimal allocation procedure
o The aim of the work is define a proceduredefining the optimal allocation and a selection
defining the optimal allocation and a selectionmethod suitable for large scale surveys
o The procedure is based on three main stepso The procedure is based on three main steps1. Sample allocation (optimization step, vector )
o minimizes the overall sample size n guaranteeing that the sampling variances are lower than
π
prefixed level of precision thresholdso Deal with a multivariate-multidomain problem
2 Definition of the final incl sion p obabilities 2. Definition of the final inclusion probabilities (calibration step)
3. Sample selection (balancing step)
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 12
Multi-way optimal allocation procedure
Notation and essential terms :
Domain b partition d : ; Domain indicator variable: ; Parameter of interest and estimator
B l i i bl Balancing variables
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 13
Multi-way optimal allocation procedure
Variance approximation of
balanced sampling :
With
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 14
Multi-way optimal allocation procedure
1. Theoretical Constrained
Optimization problem(optimization step):
Constraints
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 15
Multi-way optimal allocation procedure
Equivalent problem:
GivenWith The inequality constraints are
equivalents toq
with
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 16
Multi-way optimal allocation procedure
Issues of the theoretical ti i ti bl
optimization problem: Solution by means of modified
Chromy algorithm taking into account Chromy algorithm taking into account the constraints
Iterative procedure because the unknown terms are in the left unknown terms are in the left and right side of the inequality constraints
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 17
Multi-way optimal allocation procedure
Optimal allocation algorithm:
Give values to the unknown terms on the right side of the inequality (initialization values or values obtain (in the previous iteration)
Keep fixed these values and use modified Chromy algorithm to obtain modified Chromy algorithm to obtain the values in the left side
Iterate the modified Chromy algorithm til th it i i until the convergence criterion is
satisfied using the left values of the previous iteration for the right side
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 18
Multi-way optimal allocation procedure
Predicted Constrained Optimization
problem: In practice we do not know the term
nd m t e p edi tionand must use a prediction Given a superpopulation model
express
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 19
Multi-way optimal allocation procedure
For taking into account the uncertainty f th d l l th i
of the model we replace the variance with the Anticipated Variances An upward approximation is given byp pp g y
being obtained by means of the g ypredicted value
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 20
Multi-way Sampling Design
Remark: cross-classification stratified d i fi d l ti
design assumes a fixed superpopulation model defined in each stratum
hkyE hrk stratum)( , 2)(yVar 0)( yyCov, )( hrkyVar 0)( ,, rlrk yyCov
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 21
Multi-way optimal allocation procedure
2. Definition of the final inclusion probabilities
probabilities (calibration step) :
Given the vector by means of a π ycalibration procedure calculate
S h th t h i i t
ππ
Such that each is an integer
3. Sample selection (balancing step) with cube method
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 22
Monte Carlo Simulation
Objectives of simulation:
Test the convergence of the optimization algorithm (optimization step)step)
Verify the expect AV with respect to the Monte Carlo empirical AVp
Comparison with standard cross-classified stratified design
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 23
Monte Carlo Simulation
Data:Subpopulation of the Istat Italian Graduates’
Subpopulation of the Istat Italian Graduates’ Career Survey (3,427 units)
Driving allocation variables:emplo ed stat s ( es/no) employed status (yes/no) ;
actively seeking work (yes/no) . We generate the values of the two variables
by means a logistic additive model (Prediction by means a logistic additive model (Prediction model)
Explicative variables: degree mark, sex, age class and aggregation of subject area degree class and aggregation of subject area degree (different for and )
The parameters are estimated with the data from the previous survey
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 24
from the previous survey
Monte Carlo Simulation
Survey target estimates: 8 types of estimation domains;
Two partitions define the most disaggregate domains: First partition: university by subject area
d (9 l )degree (9 classes); Second partition degree by sex; Domains:448+94; Strata 2,981
(university, degree, sex) In the simulation: domains 20+15;strata 91
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 25
Monte Carlo Simulation
Errors thresholds fixed in terms of CV(%)
Results: assuming as known Iterations modified Chromy algorithm: 6 Iterations modified Chromy algorithm: 6 Optimal sample size 171, after calibration 182
Results: assuming predicted Iterations modified Chromy algorithm: 3Iterations modified Chromy algorithm: 3 Optimal sample size 699, after calibration 707
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 26
Monte Carlo Simulation
Analysis of the allocation with the predicted values
predicted values The sample allocation procedure uses an
approximation of the AV
Average of Expectected Anticipated CV(%) Partition
1 8 1 17 81y 2y
Average of Empirical (10,000 Monte Carlo simulations) Anticipated CV(%) Partition
1 6 7 14 71y 2y
1 8.1 17.82 9.2 19.1
1 6.7 14.72 7.4 15.5
The simulation confirms the input AV is an upward approximation of the real AV
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 27
Monte Carlo Simulation
Comparison with the standard approach:
approach: The implicit model is similar to the model
used in our approach;The allocation differences depend on the The allocation differences depend on the unit minimum number constraint (2) in each stratum.
The sample size is 751 units (+7 4%) The sample size is 751 units (+7.4%) Taking into account the domains with
small population strata (<10 units in average per stratum) standard approach produces +14.4% sample size
NTTS 2011 22-24 February 2011 P.D. Falorsi and P. Righi 28