Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Estimating the Size of Hidden Populationsusing Respondent-Driven Sampling Data

Mark S. Handcock Krista J. GileDepartment of Statistics Department of MathematicsUniversity of California University of Massachusetts

- Los Angeles - AmherstCorinne M. Mar

Center for Studies in Demography and EcologyUniversity of WashingtonUNIVERSITY OF CALIFORNIA

UNOFFICIAL SEAL

Attachment B - “Unofficial” SealFor Use on Letterhead

Supported by NIH grants 1R21HD063000 and 5R21HD075714-02, NSFawards MMS-0851555, SES-1230081 and MMS-1357619 and

the DoD ONR MURI award N00014-08-1-1015.

Working Papers available athttp://www.stat.ucla.edu/∼handcock

http://arXiv.org

UNAIDS Reference Group Consultation, UMass, June 9-10, 2014

Hard-to-Reach Population Methods Research Group

I Isabelle Beaudry, UMassI Ian E. Fellows, Fellows StatisticsI Krista J. Gile, UMassI Mark S. Handcock, UCLAI Lisa G. Johnston, Tulane University, UCSFI Corinne M. Mar, University of WashingtonI http://hpmrg.org

http://hpmrg.org

Inferential approaches

The key is the modeling of the sampling processI Salganik and Heckathorn (2004): Markov chain model over

classesI Volz and Heckathorn (2008): Markov chain model over peopleI Gile (2008, 2011): Adjusts for with-replacement effects -

Successive Sampling (SS)I Gile and Handcock (2008, 2011): Network model-assisted

estimator, more realistic representation of RDS

Successive Sampling (SS)

Consider the following successive sampling (SS) or probabilityproportional to size without replacement (PPSWOR) samplingprocedure:

I Begin with a population of N units, denoted by indices 1 . . .Nwith varying sizes represented by d1,d2, . . .dN .

I Let G1, . . . ,GN be the indices of the successively sampledpeople.

I Sample the first unit from the full population {1 . . .N} withprobability proportional to size di . Assign the index of this unit tothe random variable Gi .

I Select each subsequent unit with probability proportional to sizefrom among the remaining units, such that

P(Gi = k |G1 . . .Gi−1) =

{dk∑

j /∈{G1...Gi−1}dj

k /∈ {G1 . . .Gi−1}0 else

.

SS for RDS

Gile (2011) argues that RDS can be approximated by SuccessiveSampling under a configuration model for the network:

I Node i has given degree, di , consider di edge-ends.I Pairs of edge-ends matched up at randomI This is a configuration model

I Suppose G1,G2, . . .Gk by Successive Sampling according to d .I Then if the network is unknown, but known to be a configuration

model, tracing a link from Gk will select nodes according toSuccessive Sampling.

Is there information in RDS dataabout population size?

Idea:I Under Successive Sampling, “larger” units typically sampled

earlierI Early sample: lots of “big” units, few “small”I Later sample: fewer “big”, more “small”I No change implies the population is not much depletedI Big change implies population very depleted

This can be quantified to estimate N!Note: the information about N is in the ordered sample pattern!

Modeling the sampling process for non-ignorablesampling

I RDS is not ignorable: P(G|Dobs,Dunobs) 6= P(G|Dobs)

I Information about N is in the sequence of observations.I Make inference from joint model for sample sequence and unit

sizes.

Inferential Approach

Observed data:

Dobs = Unit sizes (degrees) of observed units in order of observation

Goal:P(N|Dobs)

(posterior distribution of N given the data)

Parameters:

N = Population Sizeη = Parameter of distribution of unit sizes.

Inferential Approach

P(N|Dobs) ∝∫

P(Dobs|N, η)P(η,N)dη

(independent priors)

=

∫P(Dobs|N, η)P(η)P(N)dη

=

∫(likelihood) (prior for η) (prior for N)dη

=

∫P(Dobs|G,U,D, η)P(G|U, η)P(U|N, η)P(η)P(N)dη

=

∫P(samp given degrees) P(degrees) (prior η) (prior N)dη

Models for Degrees

Parametric model for the degrees:

d iidi ∼ f (·|η)

with support d = 0,1, . . . , and parameter η.

Models for DegreesExtensive papers by Handcock and Jones.To specify f (·|η). We can consider:

1. Poisson2. Negative binomial. This allows Gamma over-dispersion over

Poisson.3. Yule, Waring. This allows power-law over-dispersion over

Poisson.4. Poisson-log-normal. This allows log-normal over-dispersion over

Poisson. It is more than the Negative Binomial but less than thepower-law models.

5. Conway-Maxwell-Poisson distribution. This allows bothunder-dispersion and over dispersion with a single additionalparameter over a Poisson.

6. Non-parametric lower tails: To allow for poor fit in the lowerdegrees.

These are all coded up in the CRAN degreenet package and/or thesize package.

Prior for the degree distribution model

Each degree distribution model parametrized with mean andstandard deviation.

η = g(µ, σ)

µ|σ ∼ N(µ0, σ0/dfmean) σ ∼ Invχ(σ0;dfsigma)

Use diffuse default prior on degree model parameters(equivalent sample size dfµ = 1 and dfσ = 5).

Prior for Population Size N

Many possibilitiesI The data truncates the prior below the sample size.I Uniform prior is improperI Natural parametric models (e.g., Negative Binomial,

Poisson-log-normal, Conway-Maxwell-Poisson).I Natural parametric models too thin in the tails

I Instead: specify prior knowledge about the sample proportion(i.e. n/N).

Prior for Population Size N

I Simple prior: uniform on n/N.I Gives closed form prior on N with infinite mean (median = 2n)I Generalize to n/N ∼ Beta(α, β)

The density on N (considered continuous) is:

π(N) = βn(N − n)β−1/Nα+β for N > n.

I The distribution has tail behavior O(1/Nα+1).I Elicit median or mode from field researchers and translate to β

and/or α.

500 1000 1500 2000 2500 3000

0.00

000.

0005

0.00

100.

0015

Population size (N)

Den

sity

mean=1000median=1000mode=1000

Figure: Three example prior distributions for the population size (N). Theycorrespond to α = 1 and β = 1.55, 1.16 and 3.

Likelihood: Notation

Let:Dobs = (D1, . . . ,Dn) be the random ordered observed degrees(ordered for notation)Dunobs = (Dn+1, . . . ,DN) be the unordered random unobserveddegreesLet dobs = (d1, . . . ,dn) and dunobs = (dn+1, . . . ,dN) be their realizedvalues.Let G = (G1, . . . ,Gn) be the random indices of the ordered sampleand gobs be the observed sequence.

Likelihood

P(Dobs|N, η)

=∑

d

∑g

p(Dobs = dobs|G = g,D = d , η)p(G = g|D = d , η)p(D = d |η)

=N!

(N − n)!

∑d∈DU(dobs)

p(G = (1, . . . ,n)|D = d)N∏

j=1

f (dj |η)

where DU(dobs) is the set of possible dunobs given dobs.

P(G|D = d ,N, η) =n∏

k=1

dk

λkwhere

λk =N∑

i=k

di =n∑

i=k

di +N∑

i=n+1

di k = 1, . . . ,n

depends on both dobs and dunobs.

L[N, η|Dobs,G] ∝ N!

(N − n)!

∑dunobs∈DU(dobs)

n∏k=1

dk

λj·

N∏j=1

f (dj |η)

Inference

I Likelihood can be maximizedI Can combine with priors to compute posteriorI Note computational complexities based on sum over N − n

embedded sums over infinite spaces.

Example: N=1000, homophily=2, diff. activity=3

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015

posterior for population size

population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015


population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015


population size

Dens

ity

Application: Estimating the numbers of those most atrisk for HIV in Cities in El Salvador

I Surveillance surveys in El SalvadorI Focus on high-risk groups: female sex workers (FSW)I RDS study of size n = 184 in 2010.

Figure: Graphical representation of the recruitment tree for the sampling offemale sex workers in Sonsonate, El Salvador in 2010. The nodes are therespondents and the wave number increases as you go down the page. Thenode gray scale is proportional to the network size reported by the worker,with white being degree one and black the maximum degree.

population size

post

erio

r de

nsity

(x

10−4

)

0

5

10

15

20

184 1000 2000 3000 4000population size

post

erio

r de

nsity

(x

10−4

)

0

5

15

20

184 1000 2000 3000 4000population size

post

erio

r de

nsity

(x

10−4

)

0

5

10

15

20

184 1000 2000 3000 4000

Figure: Posterior distribution for the number of female sex workers inSonsonate based on three prior distributions for the population size: flat,matching the midpoint UNAIDS estimate, and interval-matching the UNAIDSestimate. The prior is dashed. The red mark is at the posterior median. Thegreen mark is at the posterior mean. The blue lines are at the lower andupper bounds of the 95% highest-probability-density interval. The purplelines demark the lower and upper UNAIDS guidelines.

Simulation StudySimulate Population

I 1000, 835, 715, 625, 555, or 525 nodesI 20% “Infected”

Simulate Social Network (from ERGM, using statnet)I Mean degree 7I Homophily on Infection: α = E(# infected to infected tie)

ER=0(# infected to infected tie) = 5 (orother)

I Differential Activity: ω =mean degree infected

mean degree uninfected = 1 (or other)

Simulate Respondent-Driven SampleI 500 total samplesI 10 seeds, chosen proportional to degreeI 2 coupons eachI Coupons at random to relationsI Sample without replacement

Blue parameters varied in study.

Evaluating Performance:Frequentist properties of Bayesian method

I Point estimates: are they about right on average?I Using the Bayesian framework, use probability intervals for the

population size (Highest Posterior Density Credible Intervals -CI’s)

I Compare Frequentist properties: CI width and coverage rates

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Prior Mean

Siz

e E

stim

ate/

Trut

h

527.5 555 1110 625 750 1500 750 1000 2000 1000 1500 3000 2750 5000 10000

●549 ●568●618

●728●814

●1053

●995

●1244

●1897

●1423

●1918

●2978

●3712

●5587

●7712

Coverage

94 96 99

96 96 99

96 96100

9499

100

83

93

100

HPDU ratio

1.11.1

1.3

1.3

1.6

2.3

1.6

2.4

4.8

1.9

3.2

5.7

1.5

2.3

3.5

Figure: Spread of central 95% of simulated population size estimates(posterior means) for 5 population sizes for low, accurate, and high priors.Dots represent means. Estimates are represented as multiples of the truepopulation size (red line at 1 indicates true population size). Numbers belowthe bars are coverage rates of 95% HPD intervals.

Miss-specification of the Network Structure

0.5

1.0

1.5

2.0

2.5

3.0

3.5S

ize

Est

imat

e/Tr

uth

750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000

α=1.0 α=1.8 α=1.0 α=1.8 α=1.0 α=1.8

ω=0.5 ω=1.0 ω=2.0

●977

●1167

●1571

●974

●1167

●1783

●995

●1244

●1897

●1002

●1258

●1901

●1004

●1144

●1324

●984

●1161

●1541

Coverage 94 97 10094 96

9696 96

100 98 98100

90 95 97 82 86 92

HPDU ratio

1.6

2.1

3.1

1.5

2.0

2.7

1.6

2.4

4.8

1.6

2.4

4.4

1.5

1.9

2.2

1.4

1.6

1.9

Figure: Spread of central 95% of simulated population size estimates(posterior means) for population size 1000 for low, accurate, and high priors,with varying levels of homophily (α) and differential activity (ω).

Discussion

I It is important to estimate the networked population sizeI There is information on the population size implicit in the

decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size

using the Bayesian frameworkI We can incorporate other features of the populationI We can estimate population means (e.g., prevalence and counts)I In the Bayesian framework we can estimate uncertainty of the

estimates in a natural way

Cautions

I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.

I The estimates depend on the prior distribution for populationsize.

I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.

I There is another important (and influential) tuning parameter: K -the truncation value for degrees.

I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.

I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

1.0

Loess Fit to Sequence of Degrees (ALL) red is minimum, green is maximum, blue is best

Time Sequence

p( <=

obse

rved d

egree

)

Prior for the population size

I translates to a closed form for the prior on N which has infinitemean

I Generalize to n/N ∼ Beta(1, β)

The density function on N (considered as a continuous variable) is:

f (N|n) = βn(N − n)β−1/Nβ+1 for N > n

The distribution has tail behavior ≈ 1/N2. The mode of the prior is at0.5n(β + 1) and the median is given by n/(1− (1/2)1/β) The medianor mode can be elicited from field researchers and translated to β. Auniform distribution on the sample proportion corresponds to amedian of twice the sample size.

0 5000 10000 15000

3e−0

54e−0

55e−0

56e−0

57e−0

58e−0

59e−0

5

Prior for population size Prior mode = 1000

truth=1000population size

prior

densi

ty

0 10000 20000 30000 40000 50000

0.0e+

005.0

e−06

1.0e−

051.5

e−05

2.0e−

05

Prior for population size Prior mode = 10000

truth=1000population size

prior

densi

ty

Example: N=1000, homophily=2, diff. activity=3

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015


population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015


population size

Dens

ity

500 1000 1500 2000 2500

0.000

00.0

005

0.001

00.0

015


population size

Dens

ity

500

1000

1500

2000

2500

posteriorsize() Population Size REVISION 178200 RDS samples, prior.size.mode = truth

circle is mode, triangle is mean

Po

pu

latio

n S

ize

●●●●●

●●●●●●●●● ●●

●●

●

●

●●●●●

● ●

●

●●●●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●●

●

●

1 2 4 1 2 4

Differential Activity Level

Homophily Ratio 1 Homophily Ratio 5

Miss-specification of the degree distribution shape

I Because of differential activity the Conway-Maxwell-Poissonmodel class does not cover the bimodality of the degreedistribution

600 800 1000 1200 1400

0.000

0.001

0.002

0.003

Posterior for Population Size

population size

Dens

ity

truth = 1000 mode = 763 median = 801 mean = 835



I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.

500 1000 1500 2000

0.000

00.0

005

0.001

00.0

015

0.002

0

Posterior for Population Size

population size

Dens

itytruth = 1000 mode = 988 median = 993 mean = 1049

4 6 8 10 12 14 16

0.00.5

1.01.5

2.0

Posterior for Mean Degree: true overall mean degree is 7

degree

Dens

ity

No Disease With Disease

2.0 2.5 3.0 3.5 4.0 4.5

01

23

4

Posterior for s.d. degree

degree

Dens

ity


500

1000

1500

2000

2500

3000

Population Size REVISION 170200 RDS samples, prior.size.mode = truth


Po

pu

latio

n S

ize

●●●●●●

●●●●●●●

●

●●● ●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

1 2 4 1 2 4



Modeling of other characteristics of the population

I As we model many population characteristics, including thedisease status, we can compute estimates of them directly

I Example: disease prevalence

0.14

0.16

0.18

0.20

0.22

0.24

0.26

Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth

Population size: red is 525, green is 715, and blue is 1000

Dis

ea

se

Pre

va

len

ce

●

●●

●

●

●

●●●●●●●●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

1 2 4 1 2 4



Confidence intervals, design effectsand standard errors

I Using the Bayesian framework, we can naturally compute theprobability intervals for the population size and othercharacteristics

I Examples: CI coverage for the population size and prevalence

0.88

0.90

0.92

0.94

0.96

REVISION 170200 RDS samples, prior.size.mode = truth

Coverage: proportion of samples whose 95% CI covered the true population size True population size: red is 525, green is 715, blue is 1000

Popu

latio

n Si

ze

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1 2 4 1 2 4



0.85

0.90

0.95

1.00

REVISION 170200 RDS samples, prior sampling fraction = truth

Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2

Popu

latio

n Si

ze

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1 2 4 1 2 4



Application: The study of HIV/AIDS in San Francisco

I Surveillance surveys by San Francisco Department of PublicHealth

I Focus on African-American (AA) men-who-have-sex-with-men(MSM)

I RDS study of size n = 256 in 2009.I Intensive study provides a population size estimate of 4439.I Census data indicated 21518 AA men in San Francisco.

0 5000 10000 15000 20000

0e+0

02e−0

54e−0

56e−0

58e−0

5posterior for population size

population size

Densi

ty

SF mean

0 2000 4000 6000 8000

0.000

000.0

0005

0.000

100.0

0015

0.000

200.0

0025

0.000

30

posterior for the number of AA MSM with HIV

HIV+ count

Densi

ty

SF mean

0 5000 10000 15000 20000

1e−05

2e−05

3e−05

4e−05

5e−05

6e−05


population size

Density

SF mean

0 2000 4000 6000 8000 10000

0.00000

0.00005

0.00010

0.00015

posterior for the number of AA MSM with HIV

HIV+ count

Density

SF mean

Discussion

I It is important to estimate the networked population sizeI There is information on the population size implicit in the

decreasing degrees of the sample nodes over timeI Using successive sampling model we can model the decreaseI We can incorporate prior information about the population size

using the Bayesian frameworkI In the Bayesian framework we can estimate uncertainty of the

estimates in a natural way

Cautions

I The difference between the model with disease and withouthighlights the importance of the specification of the model for thedegree distribution.

I The estimates depend on the prior distribution for populationsize.

I The estimates are biased because the successive samplingmodel is not perfect - and will be be increasingly misspecified aswe get further from the configuration network.

I There is another important (and influential) tuning parameter: K -the truncation value for degrees.

I This approach is promising. It is designed to be combined withdata from other methods (e.g., scale up) to provide the mostaccurate overall estimate.

I Fundamentally, RDS data typically does not contain muchinformation about the population size. The Bayesian approachenables us to quantify this.



I Solution: Model the degree distributions of the diseased from thenon-diseased with separate Conway-Maxwell-Poisson models.

4 6 8 10 12 14 16

0.00.5

1.01.5

2.0

Posterior for Mean Degree: true overall mean degree is 7

degree

Dens

ityNo Disease With Disease

2.0 2.5 3.0 3.5 4.0 4.5

01

23

4

Posterior for s.d. degree

degree

Dens

ity


500

1000

1500

2000

2500

3000

Population Size REVISION 170200 RDS samples, prior.size.mode = truth


Po

pu

latio

n S

ize

●●●●●●

●●●●●●●

●

●●● ●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●

●●●●●●●●●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

1 2 4 1 2 4



Modeling of other characteristics of the population

I As we model many population characteristics, including thedisease status, we can compute estimates of them directly

I Example: disease prevalence

0.14

0.16

0.18

0.20

0.22

0.24

0.26

Disease Prevalence REVISION 170200 RDS samples, prior.size.mode = truth

Population size: red is 525, green is 715, and blue is 1000

Dis

ea

se

Pre

va

len

ce

●

●●

●

●

●

●●●●●●●●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

1 2 4 1 2 4



0.85

0.90

0.95

1.00

REVISION 170200 RDS samples, prior sampling fraction = truth

Coverage: proportion of samples whose 95% CI covered the true prevalenceTrue prevalence = 0.2

Popu

latio

n Si

ze

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1 2 4 1 2 4



Comparison to the Gile SS estimator

I The SS estimator in Gile JASA (2011) requires N knownI We use the posterior mode as a plug in estimate of N

Pre

vale

nce

750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000 750 1000 2000

α=1.0 α=1.8 α=1.0 α=1.8 α=1.0 α=1.8

ω=0.5 ω=1.0 ω=2.0

.14

.16

.18

.20

.22

.24

.26

●

●

MSE Ratio

0.80

Coverage

72 88

●

●

MSE Ratio

1.21

Coverage

90 90

●

●

MSE Ratio

0.84

Coverage

92 92

●

●

MSE Ratio

0.83

Coverage 76 89

●●

MSE Ratio

1.12

Coverage

94 90

●

●

MSE Ratio

1.08

Coverage

99 92

● ●

MSE Ratio

1.06

Coverage

92 94

● ●

MSE Ratio

1.03

Coverage

95 96

● ●

MSE Ratio

0.98

Coverage

100 98

● ●

MSE Ratio

1.04

Coverage

88 94

● ●

MSE Ratio

1.02

Coverage

96 98

● ●

MSE Ratio

0.99

Coverage

100 99

●

●

MSE Ratio

0.54

Coverage

40 80

●●

MSE Ratio

2.78

Coverage

96 79

●

●

MSE Ratio

0.61

Coverage

64 78

●

●

MSE Ratio

0.66

Coverage

52 80

●●

MSE Ratio

2.18

Coverage

97 83

●

●

MSE Ratio

0.85

Coverage

89 84

Figure: Spread of central 95% of simulated prevalence estimates forpopulation size 1000, with varying levels of homophily (α) and differentialactivity (ω). Solid lines represent prevalence estimates based on theposterior mean, dashed lines represent comparable estimates using the priormean. Relative efficiency (MSE posterior/MSE prior) is given above each bar,and the coverage of nominal 95% confidence intervals is below each bar.

Estimating the Size of Hidden Populations using … › ~handcock › hpmrg › UNAIDS_Reference_Group...Estimating the Size of Hidden Populations using Respondent-Driven Sampling

Documents