Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of Massachusetts - Los Angeles - Amherst Corinne M. Mar Center for Studies in Demography and Ecology University of Washington Supported by NIH grants 1R21HD063000 and 5R21HD075714-02, NSF awards MMS-0851555, SES-1230081 and MMS-1357619 and the DoD ONR MURI award N00014-08-1-1015. Working Papers available at http://www.stat.ucla.edu/∼handcock http://arXiv.org UNAIDS Reference Group Consultation, UMass, June 9-10, 2014
56
Embed
Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Estimating the Size of Hidden Populationsusing Respondent-Driven Sampling Data
Mark S. Handcock Krista J. GileDepartment of Statistics Department of MathematicsUniversity of California University of Massachusetts
- Los Angeles - AmherstCorinne M. Mar
Center for Studies in Demography and EcologyUniversity of WashingtonUNIVERSITY OF CALIFORNIA
UNOFFICIAL SEAL
Attachment B - “Unofficial” SealFor Use on Letterhead
Supported by NIH grants 1R21HD063000 and 5R21HD075714-02, NSFawards MMS-0851555, SES-1230081 and MMS-1357619 and
the DoD ONR MURI award N00014-08-1-1015.
Working Papers available athttp://www.stat.ucla.edu/∼handcock
http://arXiv.org
UNAIDS Reference Group Consultation, UMass, June 9-10, 2014
Hard-to-Reach Population Methods Research Group
I Isabelle Beaudry, UMassI Ian E. Fellows, Fellows StatisticsI Krista J. Gile, UMassI Mark S. Handcock, UCLAI Lisa G. Johnston, Tulane University, UCSFI Corinne M. Mar, University of WashingtonI http://hpmrg.org
The key is the modeling of the sampling processI Salganik and Heckathorn (2004): Markov chain model over
classesI Volz and Heckathorn (2008): Markov chain model over peopleI Gile (2008, 2011): Adjusts for with-replacement effects -
Successive Sampling (SS)I Gile and Handcock (2008, 2011): Network model-assisted
estimator, more realistic representation of RDS
Successive Sampling (SS)
Consider the following successive sampling (SS) or probabilityproportional to size without replacement (PPSWOR) samplingprocedure:
I Begin with a population of N units, denoted by indices 1 . . .Nwith varying sizes represented by d1,d2, . . .dN .
I Let G1, . . . ,GN be the indices of the successively sampledpeople.
I Sample the first unit from the full population {1 . . .N} withprobability proportional to size di . Assign the index of this unit tothe random variable Gi .
I Select each subsequent unit with probability proportional to sizefrom among the remaining units, such that
P(Gi = k |G1 . . .Gi−1) =
{dk∑
j /∈{G1...Gi−1}dj
k /∈ {G1 . . .Gi−1}0 else
.
SS for RDS
Gile (2011) argues that RDS can be approximated by SuccessiveSampling under a configuration model for the network:
I Node i has given degree, di , consider di edge-ends.I Pairs of edge-ends matched up at randomI This is a configuration model
I Suppose G1,G2, . . .Gk by Successive Sampling according to d .I Then if the network is unknown, but known to be a configuration
model, tracing a link from Gk will select nodes according toSuccessive Sampling.
Is there information in RDS dataabout population size?
Idea:I Under Successive Sampling, “larger” units typically sampled
earlierI Early sample: lots of “big” units, few “small”I Later sample: fewer “big”, more “small”I No change implies the population is not much depletedI Big change implies population very depleted
This can be quantified to estimate N!Note: the information about N is in the ordered sample pattern!
Modeling the sampling process for non-ignorablesampling
I RDS is not ignorable: P(G|Dobs,Dunobs) 6= P(G|Dobs)
I Information about N is in the sequence of observations.I Make inference from joint model for sample sequence and unit
sizes.
Inferential Approach
Observed data:
Dobs = Unit sizes (degrees) of observed units in order of observation
Goal:P(N|Dobs)
(posterior distribution of N given the data)
Parameters:
N = Population Sizeη = Parameter of distribution of unit sizes.
Inferential Approach
P(N|Dobs) ∝∫
P(Dobs|N, η)P(η,N)dη
(independent priors)
=
∫P(Dobs|N, η)P(η)P(N)dη
=
∫(likelihood) (prior for η) (prior for N)dη
=
∫P(Dobs|G,U,D, η)P(G|U, η)P(U|N, η)P(η)P(N)dη
=
∫P(samp given degrees) P(degrees) (prior η) (prior N)dη
Models for Degrees
Parametric model for the degrees:
d iidi ∼ f (·|η)
with support d = 0,1, . . . , and parameter η.
Models for DegreesExtensive papers by Handcock and Jones.To specify f (·|η). We can consider:
1. Poisson2. Negative binomial. This allows Gamma over-dispersion over
Poisson.3. Yule, Waring. This allows power-law over-dispersion over
Poisson.4. Poisson-log-normal. This allows log-normal over-dispersion over
Poisson. It is more than the Negative Binomial but less than thepower-law models.
5. Conway-Maxwell-Poisson distribution. This allows bothunder-dispersion and over dispersion with a single additionalparameter over a Poisson.
6. Non-parametric lower tails: To allow for poor fit in the lowerdegrees.
These are all coded up in the CRAN degreenet package and/or thesize package.
Prior for the degree distribution model
Each degree distribution model parametrized with mean andstandard deviation.
η = g(µ, σ)
µ|σ ∼ N(µ0, σ0/dfmean) σ ∼ Invχ(σ0;dfsigma)
Use diffuse default prior on degree model parameters(equivalent sample size dfµ = 1 and dfσ = 5).
Prior for Population Size N
Many possibilitiesI The data truncates the prior below the sample size.I Uniform prior is improperI Natural parametric models (e.g., Negative Binomial,
Poisson-log-normal, Conway-Maxwell-Poisson).I Natural parametric models too thin in the tails
I Instead: specify prior knowledge about the sample proportion(i.e. n/N).
Prior for Population Size N
I Simple prior: uniform on n/N.I Gives closed form prior on N with infinite mean (median = 2n)I Generalize to n/N ∼ Beta(α, β)
The density on N (considered continuous) is:
π(N) = βn(N − n)β−1/Nα+β for N > n.
I The distribution has tail behavior O(1/Nα+1).I Elicit median or mode from field researchers and translate to β
and/or α.
500 1000 1500 2000 2500 3000
0.00
000.
0005
0.00
100.
0015
Population size (N)
Den
sity
mean=1000median=1000mode=1000
Figure: Three example prior distributions for the population size (N). Theycorrespond to α = 1 and β = 1.55, 1.16 and 3.
Likelihood: Notation
Let:Dobs = (D1, . . . ,Dn) be the random ordered observed degrees(ordered for notation)Dunobs = (Dn+1, . . . ,DN) be the unordered random unobserveddegreesLet dobs = (d1, . . . ,dn) and dunobs = (dn+1, . . . ,dN) be their realizedvalues.Let G = (G1, . . . ,Gn) be the random indices of the ordered sampleand gobs be the observed sequence.
Likelihood
P(Dobs|N, η)
=∑
d
∑g
p(Dobs = dobs|G = g,D = d , η)p(G = g|D = d , η)p(D = d |η)
=N!
(N − n)!
∑d∈DU(dobs)
p(G = (1, . . . ,n)|D = d)N∏
j=1
f (dj |η)
where DU(dobs) is the set of possible dunobs given dobs.
P(G|D = d ,N, η) =n∏
k=1
dk
λkwhere
λk =N∑
i=k
di =n∑
i=k
di +N∑
i=n+1
di k = 1, . . . ,n
depends on both dobs and dunobs.
L[N, η|Dobs,G] ∝ N!
(N − n)!
∑dunobs∈DU(dobs)
n∏k=1
dk
λj·
N∏j=1
f (dj |η)
Inference
I Likelihood can be maximizedI Can combine with priors to compute posteriorI Note computational complexities based on sum over N − n
embedded sums over infinite spaces.
Example: N=1000, homophily=2, diff. activity=3
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
Application: Estimating the numbers of those most atrisk for HIV in Cities in El Salvador
I Surveillance surveys in El SalvadorI Focus on high-risk groups: female sex workers (FSW)I RDS study of size n = 184 in 2010.
Figure: Graphical representation of the recruitment tree for the sampling offemale sex workers in Sonsonate, El Salvador in 2010. The nodes are therespondents and the wave number increases as you go down the page. Thenode gray scale is proportional to the network size reported by the worker,with white being degree one and black the maximum degree.
population size
post
erio
r de
nsity
(x
10−4
)
0
5
10
15
20
184 1000 2000 3000 4000population size
post
erio
r de
nsity
(x
10−4
)
0
5
15
20
184 1000 2000 3000 4000population size
post
erio
r de
nsity
(x
10−4
)
0
5
10
15
20
184 1000 2000 3000 4000
Figure: Posterior distribution for the number of female sex workers inSonsonate based on three prior distributions for the population size: flat,matching the midpoint UNAIDS estimate, and interval-matching the UNAIDSestimate. The prior is dashed. The red mark is at the posterior median. Thegreen mark is at the posterior mean. The blue lines are at the lower andupper bounds of the 95% highest-probability-density interval. The purplelines demark the lower and upper UNAIDS guidelines.
Simulation StudySimulate Population
I 1000, 835, 715, 625, 555, or 525 nodesI 20% “Infected”
Simulate Social Network (from ERGM, using statnet)I Mean degree 7I Homophily on Infection: α = E(# infected to infected tie)
ER=0(# infected to infected tie) = 5 (orother)
I Differential Activity: ω =mean degree infected
mean degree uninfected = 1 (or other)
Simulate Respondent-Driven SampleI 500 total samplesI 10 seeds, chosen proportional to degreeI 2 coupons eachI Coupons at random to relationsI Sample without replacement
Blue parameters varied in study.
Evaluating Performance:Frequentist properties of Bayesian method
I Point estimates: are they about right on average?I Using the Bayesian framework, use probability intervals for the
population size (Highest Posterior Density Credible Intervals -CI’s)
I Compare Frequentist properties: CI width and coverage rates
Figure: Spread of central 95% of simulated population size estimates(posterior means) for 5 population sizes for low, accurate, and high priors.Dots represent means. Estimates are represented as multiples of the truepopulation size (red line at 1 indicates true population size). Numbers belowthe bars are coverage rates of 95% HPD intervals.
Figure: Spread of central 95% of simulated population size estimates(posterior means) for population size 1000 for low, accurate, and high priors,with varying levels of homophily (α) and differential activity (ω).
Discussion
I It is important to estimate the networked population sizeI There is information on the population size implicit in the
decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size
using the Bayesian frameworkI We can incorporate other features of the populationI We can estimate population means (e.g., prevalence and counts)I In the Bayesian framework we can estimate uncertainty of the
estimates in a natural way
Cautions
I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.
I The estimates depend on the prior distribution for populationsize.
I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.
I There is another important (and influential) tuning parameter: K -the truncation value for degrees.
I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.
I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
Loess Fit to Sequence of Degrees (ALL) red is minimum, green is maximum, blue is best
Time Sequence
p( <=
obse
rved d
egree
)
Prior for the population size
I translates to a closed form for the prior on N which has infinitemean
I Generalize to n/N ∼ Beta(1, β)
The density function on N (considered as a continuous variable) is:
f (N|n) = βn(N − n)β−1/Nβ+1 for N > n
The distribution has tail behavior ≈ 1/N2. The mode of the prior is at0.5n(β + 1) and the median is given by n/(1− (1/2)1/β) The medianor mode can be elicited from field researchers and translated to β. Auniform distribution on the sample proportion corresponds to amedian of twice the sample size.
0 5000 10000 15000
3e−0
54e−0
55e−0
56e−0
57e−0
58e−0
59e−0
5
Prior for population size Prior mode = 1000
truth=1000population size
prior
densi
ty
0 10000 20000 30000 40000 50000
0.0e+
005.0
e−06
1.0e−
051.5
e−05
2.0e−
05
Prior for population size Prior mode = 10000
truth=1000population size
prior
densi
ty
Example: N=1000, homophily=2, diff. activity=3
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
500 1000 1500 2000 2500
0.000
00.0
005
0.001
00.0
015
posterior for population size
population size
Dens
ity
500
1000
1500
2000
2500
posteriorsize() Population Size REVISION 178200 RDS samples, prior.size.mode = truth
circle is mode, triangle is mean
Po
pu
latio
n S
ize
●●●●●
●●●●●●●●● ●●
●●
●
●
●●●●●
● ●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●●
●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Miss-specification of the degree distribution shape
I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution
600 800 1000 1200 1400
0.000
0.001
0.002
0.003
Posterior for Population Size
population size
Dens
ity
truth = 1000 mode = 763 median = 801 mean = 835
Miss-specification of the degree distribution shape
I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution
I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.
500 1000 1500 2000
0.000
00.0
005
0.001
00.0
015
0.002
0
Posterior for Population Size
population size
Dens
itytruth = 1000 mode = 988 median = 993 mean = 1049
4 6 8 10 12 14 16
0.00.5
1.01.5
2.0
Posterior for Mean Degree: true overall mean degree is 7
degree
Dens
ity
No Disease With Disease
2.0 2.5 3.0 3.5 4.0 4.5
01
23
4
Posterior for s.d. degree
degree
Dens
ity
No Disease With Disease
500
1000
1500
2000
2500
3000
Population Size REVISION 170200 RDS samples, prior.size.mode = truth
circle is mode, triangle is mean
Po
pu
latio
n S
ize
●●●●●●
●●●●●●●
●
●●● ●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Modeling of other characteristics of the population
I As we model many population characteristics, including thedisease status, we can compute estimates of them directly
I Example: disease prevalence
0.14
0.16
0.18
0.20
0.22
0.24
0.26
Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth
Population size: red is 525, green is 715, and blue is 1000
Dis
ea
se
Pre
va
len
ce
●
●●
●
●
●
●●●●●●●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Confidence intervals, design effectsand standard errors
I Using the Bayesian framework, we can naturally compute theprobability intervals for the population size and othercharacteristics
I Examples: CI coverage for the population size and prevalence
0.88
0.90
0.92
0.94
0.96
REVISION 170200 RDS samples, prior.size.mode = truth
Coverage: proportion of samples whose 95% CI covered the true population size True population size: red is 525, green is 715, blue is 1000
Popu
latio
n Si
ze
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
0.85
0.90
0.95
1.00
REVISION 170200 RDS samples, prior sampling fraction = truth
Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2
Popu
latio
n Si
ze
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Application: The study of HIV/AIDS in San Francisco
I Surveillance surveys by San Francisco Department of PublicHealth
I Focus on African-American (AA) men-who-have-sex-with-men(MSM)
I RDS study of size n = 256 in 2009.I Intensive study provides a population size estimate of 4439.I Census data indicated 21518 AA men in San Francisco.
0 5000 10000 15000 20000
0e+0
02e−0
54e−0
56e−0
58e−0
5posterior for population size
population size
Densi
ty
SF mean
0 2000 4000 6000 8000
0.000
000.0
0005
0.000
100.0
0015
0.000
200.0
0025
0.000
30
posterior for the number of AA MSM with HIV
HIV+ count
Densi
ty
SF mean
0 5000 10000 15000 20000
1e−05
2e−05
3e−05
4e−05
5e−05
6e−05
posterior for population size
population size
Density
SF mean
0 2000 4000 6000 8000 10000
0.00000
0.00005
0.00010
0.00015
posterior for the number of AA MSM with HIV
HIV+ count
Density
SF mean
Discussion
I It is important to estimate the networked population sizeI There is information on the population size implicit in the
decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size
using the Bayesian frameworkI In the Bayesian framework we can estimate uncertainty of the
estimates in a natural way
Cautions
I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.
I The estimates depend on the prior distribution for populationsize.
I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.
I There is another important (and influential) tuning parameter: K -the truncation value for degrees.
I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.
I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.
Miss-specification of the degree distribution shape
I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution
I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.
4 6 8 10 12 14 16
0.00.5
1.01.5
2.0
Posterior for Mean Degree: true overall mean degree is 7
degree
Dens
ityNo Disease With Disease
2.0 2.5 3.0 3.5 4.0 4.5
01
23
4
Posterior for s.d. degree
degree
Dens
ity
No Disease With Disease
500
1000
1500
2000
2500
3000
Population Size REVISION 170200 RDS samples, prior.size.mode = truth
circle is mode, triangle is mean
Po
pu
latio
n S
ize
●●●●●●
●●●●●●●
●
●●● ●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●
●●●●●●●●●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Modeling of other characteristics of the population
I As we model many population characteristics, including thedisease status, we can compute estimates of them directly
I Example: disease prevalence
0.14
0.16
0.18
0.20
0.22
0.24
0.26
Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth
Population size: red is 525, green is 715, and blue is 1000
Dis
ea
se
Pre
va
len
ce
●
●●
●
●
●
●●●●●●●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
0.85
0.90
0.95
1.00
REVISION 170200 RDS samples, prior sampling fraction = truth
Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2
Popu
latio
n Si
ze
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1 2 4 1 2 4
Differential Activity Level
Homophily Ratio 1 Homophily Ratio 5
Comparison to the Gile SS estimator
I The SS estimator in Gile JASA (2011) requires N knownI We use the posterior mode as a plug in estimate of N
Figure: Spread of central 95% of simulated prevalence estimates forpopulation size 1000, with varying levels of homophily (α) and differentialactivity (ω). Solid lines represent prevalence estimates based on theposterior mean, dashed lines represent comparable estimates using the priormean. Relative efficiency (MSE posterior/MSE prior) is given above each bar,and the coverage of nominal 95% confidence intervals is below each bar.