Page 1
arX
iv:2
002.
0084
9v2
[cs
.SI]
4 F
eb 2
020
Using Sampled Network Data With The Autologistic Actor
Attribute Model
Alex D. Stivala ∗ H. Colin Gallagher † David A. Rolls ‡ Peng Wang †
Garry L. Robins † ‡
February 5, 2020
Abstract
Social science research increasingly benefits from statistical methods for understanding the structured nature
of social life, including for social network data. However, the application of statistical network models within
large-scale community research is hindered by too little understanding of the validity of their inferences under
realistic data collection conditions, including sampled or missing network data. The autologistic actor attribute
model (ALAAM) is a statistical model based on the well-established exponential random graph model (ERGM)
for social networks. ALAAMs can be regarded as a social influence model, predicting an individual-level out-
come based on the actor’s network ties, concurrent outcomes of his/her network partners, and attributes of the
actor and his/her network partners. In particular, an ALAAM can be used to measure contagion effects, that is,
the propensity of two actors connected by a social network tie to both have the same value of an attribute. We
investigate the effect of using simple random samples and snowball samples of network data on ALAAM pa-
rameter inference, and find that parameter inference can still work well even with a nontrivial fraction of missing
nodes. However it is safer to take a snowball sample of the network and estimate conditional on the snowball
sampling structure.
Keywords— Autologistic actor attribute model, ALAAM, Social influence model, Missing data, Network sampling, Snow-
ball sampling
1 Introduction
An autologistic actor attribute model (ALAAM) (Robins et al., 2001b; Daraganova and Robins, 2013) is a statistical model
based on the widely used exponential random graph model (ERGM) for social networks (Frank and Strauss, 1986; Robins
et al., 2001a, 2007; Lusher et al., 2013). An ERGM can be used to model social networks, predicting the presence of a tie
between two actors based on other ties (structural properties of the network) and attributes of the actors (nodes) themselves.
By contrast, an ALAAM can be used to predict an attribute of an actor, based on the actor’s ties to other nodes in the network,
as well as attributes of the actor and its network partners. In this way it is similar to logistic regression, but, unlike logistic
regression or similar statistical techniques, specifically does not assume independence of the predicted attributes across actors
— an actor’s outcome attribute may depend also on those of its neighbors in the network. Hence an ALAAM may be used
as a model of social influence, examining how some attribute of an actor in a network is affected by his or her position in the
network and the attributes of other actors in the network.
Although ALAAMs are not widely cited (they are not included in Silk et al. (2017), for example) they have been used to
model the acquisition of descriptive norms as social category learning in social networks (Kashima et al., 2013), spatial and
network influence processes on unemployment (Daraganova and Pattison, 2013), and the performance of researchers (Letina,
2016; Letina et al., 2016).
As a result, ALAAMs have not been as widely studied as the better-known multiple regression quadratic assignment
procedure (MRQAP) (Krackhardt, 1988; Dekker et al., 2007) or network autocorrelation model (Ord, 1975; Cliff and Ord,
1981; Doreian, 1981; Anselin, 1990; Friedkin, 1990; Leenders, 2002). It is known that the network autocorrelation model has
a systematic negative bias in the estimation of the network effect under almost all conditions (Mizruchi and Neuman, 2008;
Neuman and Mizruchi, 2010), and Wang et al. (2014) present a detailed analysis of the statistical power and type I error rate
∗Corresponding author at: Institute of Computational Science, Universita della Svizzera italiana, Lugano, Switzerland. Email: alexan-
[email protected] †Centre for Transformative Innovation, Faculty of Business and Law, Swinburne University of Technology, Australia‡Melbourne School of Psychological Sciences, The University of Melbourne, Australia
1
Page 2
of the network autocorrelation model. More recently, Sewell (2017) proposes a method to use network autocorrelation with
egocentric network samples, where only ties directly involving randomly sampled actors are considered, and Dittrich et al.
(2017, 2019) propose a Bayesian approach to the network autocorrelation model.
ALAAMs, being the social influence counterpart of the social selection ERGM model, share the latter’s flexibility in that a
wide variety of effects can be included in the model (subject to dependence assumptions). This makes them potentially much
larger in scope (but more complex) than models such as network autocorrelation, which only estimate a single network effect.
In this paper we investigate the effect of missing data and network sampled data on the inference of ALAAM parameters. We
examine two forms of sampling: first, data missing at random, as would occur when taking a simple random sample from the
network, and, second, network samples obtained via snowball sampling.
A practical motivation for this study is the manner in which social influence models may be applied to epidemiological
studies of health outcomes in large-scale community samples (e.g. Gibbs et al. 2013). Conventionally, these studies often use
cross-sectional data to examine the prevalence of health conditions (e.g. Dirkzwager et al. 2006) or probable mental health
conditions across a given population (e.g. Kessler et al. 2002; Bryant et al. 2014). In such settings, a methodological priority is
placed on obtaining some form of random sample: given the impracticality of interviewing an entire general population, ran-
dom sampling allows one to make valid inferences based on a representative sample of individuals drawn from that population.
However, the individualistic assumptions upon which such research designs are usually based are increasingly at odds with
conceptual frameworks of human behavior which appeal to the structured nature of social life. For instance, health researchers
and other social scientists increasingly seek to understand health and other individual outcomes as embedded within larger
social systems of interconnected actors, employing an array of socio-structural concepts, including social capital (De Silva
et al., 2005; Kawachi et al., 2013), community resilience (Norris et al., 2008), and community wellbeing (Gibbs et al., 2015).
Social network research methods offer a resolution to this tension, particularly statistical models for social networks,
illuminating the multifaceted role that network ties play in the prevalence of health issues, including processes of social
support, social selection, and social influence and diffusion (Valente, 2010). However, in many instances, a complete census
of a network may be prohibitively expensive and difficult: it may be the case that not every member of the population can be
identified, much less recruited into a study. Complicating matters is that network researchers cannot rely on random sampling
techniques, which have been designed to mitigate the very thing which is of primary interest: social interdependence among
actors (Robins, 2015). We are thus left with an important dilemma: sociocentric research methods are often difficult to execute
properly, depending on the research setting, yet the questions that these methods can address become no less important, such
as the prevalence and spread of mental health issues across a disaster-affected area. A central issue is therefore the degree to
which network data may be missing, either randomly, or by some design (i.e. sampled), and still yield valid inferences.
As a result, a considerable amount of research has been directed at identifying the impact of different forms of missing
data, or the sampling of network data, and adequately accounting for it within statistical models. Work examining the effect
of missing data on social network parameter estimation goes back at least to Holland and Leinhardt (1973). More recently,
Kossinets (2006) examines the effect of missing data on estimating network statistics of a bipartite graph. Smith and Moody
(2013) investigates the effects of nodes missing at random, and Smith et al. (2017) the effects of non-random missing data,
on network centrality, degree homophily, and topology measures. Robins et al. (2004) investigate the effect of missing data
on ERGM parameter estimation. Koskinen et al. (2013) propose a Bayesian technique to handle missing data in ERGM
parameter estimation. Previous work that has examined the effect of snowball sampling on the estimation of social network
properties includes Handcock and Gile (2010); Thompson and Frank (2000); Illenberger and Flotterod (2012); Pattison et al.
(2013). Handcock and Gile (2010), Pattison et al. (2013), and Stivala et al. (2016) describe methods to estimate ERGM
parameters from snowball samples, while Thompson and Frank (2000) describe a maximum likelihood estimator of population
network parameters given a snowball sample, and Illenberger and Flotterod (2012) describe estimators of topological network
parameters using snowball samples. Nonetheless, efforts to describe the appropriate sampling techniques for network research
are still at an early stage, and the application of sociocentric (whole-network) methods to questions of health remain limited
to relatively small, well-delineated systems in which the collection of near-complete whole-network data is feasible, such as
networks of students within a high school (e.g. Kiuru et al. (2012)).
However, research has still only partially addressed this question of missing and sampled network data. Previous work on
missing network data and the sampling of network data pertains mostly to network statistics (e.g. Borgatti et al. 2006; Smith and
Moody 2013; Smith et al. 2017), or social selection models, namely ERGM, in which one aims to predict tie formation based
on an exogenous set of actor attributes, and endogenous processes of network tie formation. By contrast, no existing work
examining the effect of missing or sampled data on its social influence counterpart, ALAAM, in which the aim is to predict
an individual outcome based on an exogenous set of network ties, an exogenous set of personal attributes, and endogenous
processes of social influence or diffusion. Thus, social influence models are advantageous in addressing the same questions
as the conventional regression models used within epidemiological studies (e.g. logistic regression), while at the same time
taking account of network interdependencies among these outcomes in a principled manner. This article therefore aims to
commence research on whether ALAAM represents a valid socio-structural approach with missing and sampled network data,
which represent a more realistic set of circumstances in many empirical settings.
2
Page 3
Actor with attribute
Actor with or without attribute
Activity Contagion
Figure 1: The structural ALAAM configurations used.
2 Autologistic Actor Attribute Models (ALAAMs)
ALAAMs were first described in Robins et al. (2001a), modeling the probability of attribute Y (a vector of binary attributes)
given the network X (a matrix of 0-1 tie variables). The model can be expressed in the form (Daraganova and Robins, 2013):
Pr(Y = y|X = x) =1
κ(θI)exp
(
∑I
θIzI(y,x,w)
)
(1)
where θI is the parameter corresponding to the network-attribute statistic zI , in which the “configuration” I is defined by
a combination of dependent attribute variables y, network variables x, and actor covariates w, and κ(θI) is a normalizing
quantity which ensures a proper probability distribution.
The ALAAM predicts the outcome variable Y , taking into account network dependencies in a principled manner which
is not possible with standard logistic regression. Assumptions about which attributes Y are independent, and therefore which
configurations I are allowed in the model, determine the class of the model. In the simplest case, in which any two attribute
variables Yi and Y j are assumed to be independent, the only possible configuration is a single node. Then there are no network
effects, and the model reverts to standard logistic regression (Daraganova and Robins, 2013, p. 105).
The simplest network dependence assumption is that an attribute variable Yi is conditionally dependent on network tie X jk
if and only if {i}∩{k, j} 6= /0, that is, if and only if the actor i is one end of the tie X jk. Hence the configurations I allowed in
this class of model include stars (Daraganova and Robins, 2013, p. 107), as well as the contagion effect, that is, the propensity
of two nodes with a tie between them to both have the attribute Y .
We will use this dependence assumption, and also include covariate effects, in which other attributes of an actor influence
its outcome attribute Y , just as they would in standard logistic regression if we did not have the network effects.
The ALAAM configurations we will use are:
Attribute density The number of nodes with attribute Y ;
Activity Presence of a tie at a node with attribute Y . That is, whether having attribute Y is associated with having a tie to
others;
Contagion The propensity for two nodes with a tie between them to both have attribute Y ;
Binary The propensity for a node to have attribute Y based on another binary attribute U ;
Continuous The propensity for a node to have attribute Y based on its continuous attribute V .
The structural configurations are shown in Figure 1.
By estimating the parameters θI , we can make inferences as to whether or not the corresponding effects are significant,
given the other effects included in the model. For example, if the Contagion parameter is positive and significant, it means that,
in the network in question, the number of directly connected actors that both have the outcome attribute is more than would
be expected to occur by chance, given the other effects in the model. The Binary and Continuous parameters therefore have
exactly the same meaning as they would in standard logistic regression, with the exception that, in our model which includes
network effects, their value is estimated while controlling for the structural effects.
In this paper we examine the effect of missing data and snowball samples on the validity of these inferences.
3 Snowball Sampling
Snowball sampling (Coleman, 1958; Goodman, 1961) is a method to generate a sample of nodes in a network by “link tracing”,
that is, when a node is in the sample, obtaining further nodes by “following” links (ties) from the node in question. There are
many variations and other techniques related to snowball sampling, and a useful set of papers (Goodman, 2011; Heckathorn,
2011; Handcock and Gile, 2011) clarifies the distinctions between them.
To obtain a snowball sample from a network, the first step is to take a random sample of some number of nodes, called the
seed nodes or wave 0 of the sample. Then wave 1 of the sample consists of all nodes that have a tie to a node in wave 0, but
3
Page 4
are not themselves in wave 0. Wave l of the snowball sample consists of all nodes that have a tie to a node in wave l − 1 of
the sample, but are not themselves in waves 0, . . . , l−1. The l-wave snowball sample is then the set of all nodes thus obtained,
and the ties between them — that is, the subgraph induced by the nodes in the sample. This is illustrated in Figure 2.
We use two forms of snowball sampling in this paper. In the first form, at most a fixed number m of ties is followed from a
node to obtain more nodes in the sample. This is the “traditional” form described in Goodman (1961), and is the situation in a
survey design where respondents are asked to name up to m friends, for example. This is also known as a fixed choice design,
and has also been described as degree censoring (Kossinets, 2006) at m.
In the second form, there is no such limit m and all ties from a sampled node are followed. This is equivalent to l steps of
breadth-first search (BFS) in the network, and is therefore also known as BFS sampling, and is used for example in Newman
(2003); Lee et al. (2006); Kurant et al. (2011); Pattison et al. (2013); Stivala et al. (2016). In the graphs shown in the results in
the following sections, we denote this form with “m = Inf” since it is equivalent to sampling with an infinite value of the fixed
maximum number of ties m.
The snowball sampling parameters that govern the structure and size of a snowball sample are then: the number of waves,
the number of seed nodes, and m, the maximum number of ties to follow (when m = Inf it is BFS sampling, otherwise it is a
fixed choice snowball sampling design).
The advantage of snowball sampling in obtaining a sample retaining network structural information is illustrated by Fig-
ures 2 and 3. Figure 2 is a two-wave (and ten seed nodes) snowball sample from the network science collaboration network
(Figure 4) (Newman, 2006). This sample includes 146 nodes (of the original 1589), i.e. approximately 9% of the nodes in
the original network. Figure 3 is the network resulting from sampling exactly the same number of nodes at random from the
original network. It is clear that this retains very little of the original network structure (most nodes are isolates, and the largest
structures are cliques of size three, i.e. triangles) compared to the snowball sample.
3.1 Conditional Estimation
ALAAM parameters are usually (when a full network is available) estimated by maximum likelihood estimations (MLE) using
a stochastic approximation algorithm (Snijders, 2002; Lusher et al., 2013). In short, this involves a Markov chain Monte Carlo
(MCMC) procedure, whereby, at each iteration, the outcome variable for a randomly chosen actor is toggled, and the resulting
changes in the sufficient statistics of the model are used to compute the acceptance probability in a Metropolis–Hastings
algorithm to simulate a sequence of simulated outcome vectors. These are then used in an iterative algorithm to estimate
ALAAM parameters by stochastic approximation.
When parameters are to be estimated from a snowball sample, however, a conditional estimation algorithm, as proposed
in the similar situation for ERGM estimation (Pattison et al., 2013) is used. Conditional estimation for snowball sampled data
in the context of ALAAM parameter estimation is described in Daraganova (2009); Daraganova and Pattison (2013); Kashima
et al. (2013). The essential idea is that the estimates are conditional on the fixed network and on the outcome attributes fixed
in the outermost wave (wave 2 in the two-wave snowball sample illustrated in Figure 2), as in this last wave of the snowball
sample, there is no data about their network nominations (other than those already in earlier waves), and so the outcome
variable of these nodes is made exogenous to the model. In concrete terms, this means that in the MCMC procedure, only
nodes in the innermost waves have their outcome variable toggled.
4 Implementation
ALAAM estimation is done using IPNet (Wang et al., 2009), with the estimations, one for each of the NA = 100 simulated
ALAAMs, run in parallel on a Linux compute cluster using GNU parallel (Tange, 2011). Only converged parameter esti-
mates, defined as those with a reported t-ratio with a magnitude less than 0.1, are included. A parameter estimate is considered
significant if its magnitude is more than twice its standard error.
Scripts for sampling in networks, visualization, and bootstrap error estimation are written in R (R Core Team, 2013) using
the igraph package (Csardi and Nepusz, 2006). Graphs are generated with the R ggplot2 package (Wickham, 2009); locally
weighted polynomial regression curves are generated using the loess function with default parameters.
In graphs and inference, 95% confidence interval are used. Confidence intervals on RMSE plots are computed by the
non-parametric bias-corrected and accelerated (BCa) method (Efron, 1987; Davison and Hinkley, 1997). This method adjusts
for bias and skewness in the bootstrap distribution. The bootstrap replicates are constructed by taking random resamples of size
NA with replacement from the NA estimates (one for each simulated ALAAM). Bootstrap confidence intervals are estimated
with 20000 replicates using the R boot package (Davison and Hinkley, 1997).
The confidence interval shown for estimated type I and type II error rates is the 95% confidence interval for the binomial
proportion computed using the Wilson score interval (Wilson, 1927).
4
Page 5
Wave
0 1 2
Figure 2: A two wave snowball sample, obtaining 146 nodes from a 1589 node network.
5
Page 6
Figure 3: A network sample obtained as the subgraph induced by a random sample of 146 nodes from a 1589
node network.
6
Page 7
Figure 4: The network science collaboration network, N = 1589.
7
Page 8
Network N Components Mean Max. Density Clustering Positive outcome %
degree degree coefficient mean s.d.
ERGM 500 5 4.90 11 0.00983 0.10347 15 2.19
Project 90 4430 1 8.31 159 0.00188 0.34332 15 0.34
Add Health 2539 1 8.24 27 0.00324 0.14189 15 0.54
Table 1: Network statistics of the simulated (ERGM) and empirical networks.
Network N Density Activity Contagion Binary Continuous
ERGM 500 -7.20 0.55 1.00 1.20 1.15
Project 90 4430 -15.0 0.55 1.00 1.20 1.15
Add Health 2539 -12.5 0.55 1.00 1.20 1.15
Table 2: Parameters of the simulated ALAAMs.
5 Simulation Studies
To evaluate the behavior of ALAAM estimation with missing data, we require ALAAMs with known parameters. This is done
by simulating an ALAAM (that is, generating the outcome variable from specified ALAAM parameters) on a fixed network.
Hence we first need a network to use. We use an undirected network with 500 nodes. This network was taken as a single
sample from a sequence of simulated networks generated by PNet (Wang et al., 2009) with the ERGM edge, alternating k-
star, alternating k-triangle, and alternating 2-path parameters equal to -4.0, 0.2, 1.0, and -0.2, respectively. These parameters
are chosen to generate “reasonable” network statistics, in line with parameters estimated from empirical networks, and are the
same as those used in Pattison et al. (2013). The networks are sampled from a Markov chain Monte Carlo (MCMC) simulation,
with sufficient burn-in (of the order of 107 iterations) to ensure initialization effects are minimized.
As well as the simulated network, we use two empirical networks, as used in a study of respondent-driven sampling
(Goel and Salganik, 2010). The first is the “Project 90” network, a sexual contact network of high-risk heterosexuals in
Colorado Springs (Potterat et al., 2004; Woodhouse et al., 1994; Klovdahl et al., 1994; Rothenberg et al., 1995). The second
is a friendship network from the National Longitudinal Study of Adolescent Health (“Add Health”) (Harris and Udry, 2015;
Moody, 2001). As in Goel and Salganik (2010), we use only the giant components of these networks. Descriptive statistics of
the networks are shown in Table 1.
Each node in each network has a binary and a continuous attribute. The binary attribute is assigned the positive value for
50% of the nodes, chosen at random. For the continuous attribute, the attribute value vi at each node i is viiid∼ N(0,1).
The simulated ALAAMs are generated with the parameters shown in Table 2. These parameters were chosen so that
approximately 15% of nodes have a positive outcome variable. For each set of parameters in this table, a set of 100 outcomes
is sampled from the distribution with those parameters using IPNet (Wang et al., 2009). The outcomes are sampled from
a MCMC distribution with sufficient burn-in (106 iterations for the 500 node simulated network, 107 iterations for the Add
Health network, and 108 iterations for the Project 90 network) to ensure initialization effects are minimized, and the samples
are taken far enough apart (105 iterations for the simulated network and 107 iterations for the larger empirical networks) to
ensure that they are essentially independent. The proportion of nodes with a positive outcome variable is shown in Table 1 —
it is close to 15% on average. This proportion was selected to correspond roughly to rates of mental health conditions within a
disaster-affected population (e.g. Bryant et al. 2014).
Figure 5 shows the simulated 500 node network with nodes colored according to the outcome variable for one sample from
the ALAAM simulations.
5.1 Simulation Study 1: Random Sampling of Nodes
In the first study we investigate the effect of sampling nodes independently at random from the network, which we refer to as
random node sampling, or “simple random sampling”. The network then used for estimation is the subgraph of the original
graph induced by the selected (not omitted) set of nodes. In other words, when a node is not included, all edges connected to
that node are also not in the resulting network. We estimate the ALAAM parameters from this sampled subnetwork, comparing
the results to the estimation of the full network.
Figure 6 shows the effect of random node sampling on the root mean square error (RMSE) in ALAAM parameter estima-
tion. The RMSE is the square root of the mean squared difference between the estimate and the true value. (Note that in this
figure the RMSE values are on different scales for each parameter). The RMSE decreases steadily as more the sample size
increases, although for the Contagion parameter there appears to be a larger jump from 100 to 200 nodes.
Figure 7 shows the effect of random node sampling on the type II error rate, that is, the false negative rate, in ALAAM
8
Page 9
Outcome
0 1
Binary attribute
0 1
Continuous attribute
−2.7 0 2.9
Figure 5: The simulated network N = 500 with nodes shaped according to the binary attribute and node size
proportional to the continuous attribute. Nodes with the positive outcome variable are colored light blue/green
(16.2% of the nodes) and the others dark blue.
9
Page 10
Binary Continuous
Density Activity Contagion
200 300 400 500 200 300 400 500
200 300 400 500 200 300 400 500 200 300 400 500
0.5
1.0
1.5
0.1
0.2
0.3
0.4
0.20
0.25
0.30
0.35
0.40
1
2
3
4
0.3
0.4
0.5
0.6
Number of nodes in sample
RM
SE
Figure 6: Effect on root mean square error of random node sampling. Simulated ERGM network N = 500.
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
II e
rror
rat
e %
Density
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
II e
rror
rat
e %
Activity
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
II e
rror
rat
e %
Contagion
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
II e
rror
rat
e %
Binary
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
II e
rror
rat
e %
Continuous
Figure 7: Effect on type II error of network random node sampling. Simulated ERGM network N = 500.
10
Page 11
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
II e
rror
rat
e %
Density
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
II e
rror
rat
e %
Activity
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
II e
rror
rat
e %
Contagion
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
II e
rror
rat
e %
Binary
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
II e
rror
rat
e %
Continuous
Figure 8: Effect on type II error rate of random node sampling. Project 90 network N = 4430.
parameter inference. This is the percentage of experiments (over the 100 simulated ALAAMs) in which the estimate has the
wrong sign or the confidence interval covers zero (so we cannot reject the null hypothesis that the parameter for the effect is
zero). With the exception of the Activity parameter, the type II error rate does not increase greatly even when only 300 nodes
(60% of the total) are sampled. Hence even if we can only sample (at random) 60% of the network, ALAAM inference still
has good power on effects other than Activity. Even this low power relative to the other effects could be because the Activity
parameter in our simulations has a small magnitude compared to the other effects.
Figure 8 for the Project 90 network and Figure 9 for the Add Health network, show the effect of increasingly larger random
node sampling sizes on the type II error rate. In both networks, Density and Contagion have very low type II error rates over
the whole range of sample sizes, while the rates for Binary and Continuous decrease slowly. The behavior of the Activity
parameter, however, is quite different. For the Project 90 network (Figure 8) the type II error rate is very low for the entire
range of sample sizes, while for the Add Health network (Figure 9) the error rate is very high for less than 1000 nodes, and
rather abruptly becomes very low for larger sample sizes. This in turn is different from the behavior on the simulated network
where the type II error rate on the Activity parameter declines more smoothly with increased sample size (Figure 7).
In order to measure the type I error rate in inference (false positive rate), for an effect, we require simulated ALAAMs
which do not have that effect present (its parameter is zero). So for each of our ALAAM effects (except Density, which if zero
results in almost all nodes having a positive outcome variable), we simulate another set of 100 ALAAM outcomes in which
the corresponding parameter is set to zero. This allows us to test for false positives with respect to this zero effect, that is, the
percentage of experiments in which, for an estimate of a zero effect parameter, the confidence interval does not include zero.
The other parameters are retained at their values shown in Table 2, except in the case of the Activity effect, which when it
is zero, results in less than 1% of the nodes having a positive outcome variable, and so the Density parameter was increased
to -4.0 for the simulated ERGM network, -7.0 for the Project 90 network, and -6.0 for the Add Heath network, to obtain a
reasonable number of nodes with a positive outcome variable.
The type I error rates obtained in this way are shown in Figure 10 for the simulated network, which shows that even when
the sample size is only 100 nodes (20% of the nodes), the type I error rate does not increase significantly on any of the ALAAM
effects tested. It is notable that the type I error rates for contagion and the other attribute predictors hardly change as the sample
becomes a smaller proportion of the network, illustrating that, at least for this example, the inference of a “significant” effect
is robust for a sample that is only 20% of the entire network.
Only 1.7% of ALAAM estimations just described do not converge, consisting almost entirely of those with very small
sample sizes.
11
Page 12
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
II e
rror
rat
e %
Density
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
II e
rror
rat
e %
Activity
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
II e
rror
rat
e %
Contagion
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
II e
rror
rat
e %
Binary
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
II e
rror
rat
e %
Continuous
Figure 9: Effect on type II error rate of random node sampling. Add Health network N = 2539.
When using the empirical networks, however, the situation is not as encouraging. Figure 11 shows the results for the Project
90 network, and Figure 12 for the Add Health network. For effects other than Activity, the type I error rates are, sometimes,
higher than desirable (particularly for Contagion in the Project 90 network), but never much more than 25%. However for the
Activity parameter, the type I error rate can be over 50% (for the Add Health network), and the relationship between sample
size and type I error rate is noticeably not monotonic, particularly for the Add Health network.
For both the Project 90 and Add Health networks, less than 1% of the ALAAM estimations just described do not converge.
Hence it seems that selecting nodes at random from an empirical network (which may have, for example, a highly skewed
degree distribution), rather than following a network sampling scheme (such as snowball sampling) can risk an unacceptably
high (and particularly difficult to estimate or predict, given the non-monotonicity in sample size) type I error rate in some
parameters.
5.2 Simulation Study 2: Snowball Sampling
The second study investigates the effect on ALAAM estimation of network samples obtained via snowball sampling. Snowball
samples are obtained using 1, 2, or 3 waves. For the 2- and 3-wave snowball sample, the number of seed nodes is varied from
1 to 20. For the 1-wave snowball sample, the number of seed nodes is varied from 1 to 100, since when there is only a single
wave, many fewer nodes are obtained in the sample, so a larger number of seed nodes may be required. For each of these
conditions, we investigate the effect of a fixed-choice snowball sample, in which only up to m ties are followed (that is, degree
censoring at m), with m = 3 or m = 5, as well as the case where there is no degree censoring (m = Inf).
Figures 13, 14 and 15 show the size of the snowball samples obtained with different snowball sampling parameters for the
simulated ERGM network, the Project 90 network, and the Add Health network, respectively. Clearly, the sample sizes grows
faster in the number of seed nodes when more waves are used. Not using degree censoring also results in large sample sizes,
and also larger variance in sample size.
Figure 16 shows the type II error rate, on network samples from the simulated ERGM network, obtained by snowball
sampling with different numbers of waves and seeds, and with and without fixed choice (at two different values) limiting the
maximum number of ties followed in the snowball sampling. This shows that using more waves in the snowball sampling
gives higher power (lower type II error rate). This is particularly noticeable for the Activity parameter, which has extremely
low power when using one or two waves, but can achieve the same power as using the full network with 13 seeds (when not
using fixed choice) when using three waves.
12
Page 13
no Density parameter
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
I er
ror
rate
%Activity
0
25
50
75
100
200 300 400 500
Number of nodes in sampleTy
pe I
erro
r ra
te %
Contagion
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
I er
ror
rate
%
Binary
0
25
50
75
100
200 300 400 500
Number of nodes in sample
Type
I er
ror
rate
%
Continuous
Figure 10: Effect on type I error rate of random node sampling. Simulated ERGM network N = 500.
13
Page 14
no Density parameter
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
I er
ror
rate
%Activity
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sampleTy
pe I
erro
r ra
te %
Contagion
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
I er
ror
rate
%
Binary
0
25
50
75
100
1000 2000 3000 4000
Number of nodes in sample
Type
I er
ror
rate
%
Continuous
Figure 11: Effect on type I error rate of random node sampling. Project 90 network N = 4430.
14
Page 15
no Density parameter
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
I er
ror
rate
%
Activity
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
I er
ror
rate
%
Contagion
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
I er
ror
rate
%
Binary
0
25
50
75
100
1000 2000
Number of nodes in sample
Type
I er
ror
rate
%
Continuous
Figure 12: Effect on type I error rate of random node sampling. Add Health network N = 2539.
Number of waves: 1 Number of waves: 2 Number of waves: 3
0
100
200
300
400
500
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20Number of seeds
Nod
es in
sam
ple
m
3
5
Inf
Figure 13: Size of the snowball sample with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree
censoring (m = Inf). Simulated ERGM network N = 500.
15
Page 16
Number of waves: 1 Number of waves: 2 Number of waves: 3
0
1000
2000
3000
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20Number of seeds
Nod
es in
sam
ple
m
3
5
Inf
Figure 14: Size of the snowball sample with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree
censoring (m = Inf). Project 90 network N = 4430.
Number of waves: 1 Number of waves: 2 Number of waves: 3
0
500
1000
1500
2000
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20Number of seeds
Nod
es in
sam
ple
m
3
5
Inf
Figure 15: Size of the snowball sample with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree
censoring (m = Inf). Add Health network N = 2539.
16
Page 17
Density Activity Contagion Binary Continuous
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of seeds
Type
II e
rror
rat
e %
m
3
5
Inf
Figure 16: Effect on type II error rate of conditional estimation with snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Simulated ERGM network N = 500. The
solid horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
17
Page 18
Density Activity Contagion Binary Continuous
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of seeds
Type
II e
rror
rat
e %
m
3
5
Inf
Figure 17: Effect on type II error rate of conditional estimation with snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Project 90 network (N = 4430). The solid
horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
It is clear that, for a given number of waves and seeds, there is higher power with a higher value of m (the number of links
to follow) in the fixed choice design, and higher still when not using a fixed choice design (that is, following all links in the
snowball sampling). For example, when using three waves, the Contagion parameter has the same power as the baseline (using
the entire network) with 10 or more seeds when not using fixed choice (m = Inf), but never achieves this same power even with
20 seeds using fixed choice with m = 3.
The results are qualitatively similar, for the empirical networks, Project 90 (Figure 17) and Add Health (Figure 18). An
exception is that the advantage of not using degree censoring is even greater for the empirical networks than in the simulated
network (especially on the Activity, Binary, and Continuous parameters), and that for the Activity parameter on the Add Health
network in particular, using a fixed choice design with m = 3 results in very low power even for 3 waves and 20 seeds.
To address the question of whether these results are simply due to the different sample sizes generated by the snowball
sampling parameters (waves, seeds, fixed choice size m), or if the sample structure might be relevant, Figure 19 shows the
type II error rate as a function of sample size, for both snowball sampling and simple random sampling (that is, the results
from simulation study 1, described in the previous section). These graphs show that, for a given snowball sample size, there is
(mostly) no significant difference in the type II error rate for the different snowball sampling parameters, except in some cases
where fixed choice m = 3 has significantly different power from the others (e.g. on Activity on Binary with 3 waves). Snowball
sampling has a lower type II error rate than random sampling for the Activity parameter, and a similar type II error rate for
the Contagion parameter, when two or three waves are used. However for the Binary and Continuous parameters, snowball
sampling has a significantly higher type II error rate than random sampling over almost the entire range of sample sizes.
Figures 20 and 21 show the corresponding results for the Project 90 and Add Health networks, respectively. In general,
snowball sampling has a similar or lower type II error rate than random node sampling for the same sample size. Exceptions
to this are for very small sample sizes, and for the Binary parameter in the Add Health network.
Figure 22 shows that, in the simulated ERGM network, when using three waves, except for very small numbers of seeds,
the type I error rate is not significantly different from the baseline (using the entire network) on any of the parameters. However
when using fewer than three waves, the type I error rate can be far higher.
Figures 23 and 24 show the corresponding results for the Project 90 and Add Health networks, respectively. Similarly for
the results for the simulated ERGM network, using three waves results in an acceptable type I error rate, however using fewer
waves frequently does not. A very noticeable exception is in the Contagion parameter in the Project 90 network (Figure 23)
18
Page 19
Density Activity Contagion Binary Continuous
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20 4 8 12 16 20
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of seeds
Type
II e
rror
rat
e %
m
3
5
Inf
Figure 18: Effect on type II error rate of conditional estimation with snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Add Health network (N = 2539). The solid
horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
19
Page 20
Number of waves: 1 Number of waves: 2 Number of waves: 3
Density
Activity
Contagion
Binary
Continuous
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of nodes in sample
Type
II e
rror
rat
e % m
3
5
InfSimplerandomsampling
Figure 19: Type II error rate of conditional estimation plotted against sample size for snowball sampling with 1, 2,
or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf), as well as for simple random
sampling. For snowball sampling, the number of seeds is varied from 1 to 100 for 1 wave, and 1 to 20 for 2 and
3 waves, and only the locally weighted polynomial regression (loess in R) curve is shown, to make the figure
clearer. Simulated ERGM network N = 500. The solid horizontal line is the baseline error rate, with the dashed
lines showing its 95% confidence interval.
20
Page 21
Number of waves: 1 Number of waves: 2 Number of waves: 3
Density
Activity
Contagion
Binary
Continuous
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of nodes in sample
Type
II e
rror
rat
e % m
3
5
InfSimplerandomsampling
Figure 20: Type II error rate of conditional estimation plotted against sample size for snowball sampling with 1, 2,
or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf), as well as for simple random
sampling. For snowball sampling, the number of seeds is varied from 1 to 100 for 1 wave, and 1 to 20 for 2 and
3 waves, and only the locally weighted polynomial regression (loess in R) curve is shown, to make the figure
clearer. Project 90 network (N = 4430). The solid horizontal line is the baseline error rate, with the dashed lines
showing its 95% confidence interval.
21
Page 22
Number of waves: 1 Number of waves: 2 Number of waves: 3
Density
Activity
Contagion
Binary
Continuous
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of nodes in sample
Type
II e
rror
rat
e % m
3
5
InfSimplerandomsampling
Figure 21: Type II error rate of conditional estimation plotted against sample size for snowball sampling with 1, 2,
or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf), as well as for simple random
sampling. For snowball sampling, the number of seeds is varied from 1 to 100 for 1 wave, and 1 to 20 for 2 and
3 waves, and only the locally weighted polynomial regression (loess in R) curve is shown, to make the figure
clearer. Add Health network (N = 2539). The solid horizontal line is the baseline error rate, with the dashed lines
showing its 95% confidence interval.
22
Page 23
Activity Contagion Binary Continuous
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Number of seeds
Type
I er
ror
rate
%
m
3
5
Inf
Figure 22: Effect on type I error rate of conditional estimation of snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Simulated ERGM network N = 500. The
solid horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
23
Page 24
Activity Contagion Binary Continuous
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820Number of seeds
Type
I er
ror
rate
%
m
3
5
Inf
Figure 23: Effect on type I error rate of conditional estimation of snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Project 90 network (N = 4430). The solid
horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
where the type I error rate actually increases very rapidly to very high values when degree censoring is used with three waves;
however when degree censoring is not used, the type I error rate is not significantly different from the baseline (the rate when
the full network is used).
Figures 25, 26, and 27 show the number of converged estimates for the simulated ERGM network, the Project 90 network,
and the Add Health network, respectively. Note that, unlike simulation study 1, where nodes were included in the network
sample at random, the network sample is now a snowball sample, and conditional estimation is used to estimate ALAAM
parameters conditional on the snowball sampling structure. As a result of this, it can be more difficult to obtain a converged
estimate, especially when a smaller number of waves are used. In fact, when using fewer than three waves, and particularly
when using only one wave, the percentage of converged estimates can be very low, unless a very large number of seeds is used.
When using three waves, however, most estimations converge even for a relatively small number of seeds.
With this in mind, the type I error rate results (Figures 22, 23, and 24) are perhaps not as bad as they first appear. Specif-
ically, it would appear that in situations where the type I error rate is likely to be unacceptably high, it is quite likely that
conditional estimation will not converge. Hence rather than risking false positive conclusions, it is likely that in fact no esti-
mate could be obtained at all. (Whether this is actually a preferable situation is perhaps arguable).
6 Conclusion
ALAAM parameter inference can work well even with a relatively small sample taken from the original network, when the
sample is obtained by a snowball sample and the estimation is conditional on this sampling structure.
Relatively small amounts of data missing at random from a network may not have an adverse effect on ALAAM parameter
inference, however the amount of missing data, and the magnitude of its effect on error rates, may be difficult to estimate. The
main recommendation from this study is that, rather than risk a possibly unknown amount of missing data, to use a snowball
sampling scheme to obtain a structured network sample instead. Further, snowball sampling should use (at least) three waves,
and degree censoring should not be used (i.e. all links should be followed, rather than capping at an arbitrary number).
The main limitation of this study are that only three networks (one simulated from an ERGM, and two empirical), were
used, and that the ALAAM model is relatively simple (it includes “contagion”, probably the most important and frequently
24
Page 25
Activity Contagion Binary Continuous
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Num
ber of waves: 1
Num
ber of waves: 2
Num
ber of waves: 3
2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820 2 4 6 8 101214161820Number of seeds
Type
I er
ror
rate
%
m
3
5
Inf
Figure 24: Effect on type I error rate of conditional estimation of snowball sampling with 1, 2, or 3 waves and
degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Add Health network (N = 2539). The solid
horizontal line is the baseline error rate, with the dashed lines showing its 95% confidence interval.
25
Page 26
Number of waves: 1 Number of waves: 2 Number of waves: 3
25
50
75
100
10 20 30 40 50 60 70 80 90 100 10 20 10 20Number of seeds
num
ber
of c
onve
rged
est
imat
ions
m
3
5
Inf
Figure 25: Number of converged estimates (out of 100) when using conditional estimation with snowball sampling
with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Simulated ERGM
network N = 500.
26
Page 27
Number of waves: 1 Number of waves: 2 Number of waves: 3
25
50
75
100
10 20 30 40 50 60 70 80 90 100 10 20 10 20Number of seeds
num
ber
of c
onve
rged
est
imat
ions
m
3
5
Inf
Figure 26: Number of converged estimates (out of 100) when using conditional estimation with snowball sampling
with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Project 90 network
(N = 4430).
27
Page 28
Number of waves: 1 Number of waves: 2 Number of waves: 3
25
50
75
100
10 20 30 40 50 60 70 80 90 100 10 20 10 20Number of seeds
num
ber
of c
onve
rged
est
imat
ions
m
3
5
Inf
Figure 27: Number of converged estimates (out of 100) when using conditional estimation with snowball sampling
with 1, 2, or 3 waves and degree censoring at m = 3, m = 5, or no degree censoring (m = Inf). Add Health network
(N = 2539).
28
Page 29
used parameter for an ALAAM, but does not include any triangular configurations, for example). Hence we cannot generalize
with any confidence about performance on other networks, or more complex models, although the fact that the results are
similar on the simulated and empirical networks of different sizes might be an encouraging sign that the results will at least
be applicable to networks “like” those tested. The methods described here could be used on the actual network collected in
empirical studies as a sensitivity analysis of ALAAM parameter inference.
Another major limitation of this work is that, although two of the networks used are empirical, in all cases both the
attributes (nodal covariates) were simulated, and the actual binary outcome variable was simulated from an ALAAM. Hence,
in some sense this is the easiest case for estimation of an ALAAM, as it is known that the outcome was generated from an
ALAAM. Further work could be to validate ALAAM estimation and conditional ALAAM estimation from snowball samples
of networks with empirical nodal covariates and outcome variable. Such data is potentially available, for example, from the
Project 90 network.
We have assumed, in the fixed choice snowball sampling design, that the (up to) m links followed from each node are
chosen at random. However in practice when conducting network data collection and asking respondents to name up to m
friends, for example, the named contacts are subject to cognitive biases (Freeman et al., 1987) and hence not random. This
could potentially lead to biases in the sample that are not considered in this study.
This study is an early step in understanding how ALAAMs and sociocentric network designs may be conducted effectively
within general community settings. In such settings, missing data, and/or the necessity to obtain data on only a sample, not
the full network, are inevitable, but the research questions are often of paramount importance. An interesting possibility raised
by these findings is whether long-established random sampling approaches could be complemented with snowball and referral
techniques for estimating network-based determinants for individual outcomes. A wider program of research is merited to
determine this conclusively.
Acknowledgments
This research was supported by a Victorian Life Sciences Computation Initiative (VLSCI) grant number VR0261 on its Peak
Computing Facility at the University of Melbourne, an initiative of the Victorian Government, Australia. We also used the
University of Melbourne ITS High Performance Computing facilities.
This research uses data from Add Health, a program project directed by Kathleen Mullan Harris and designed by J.
Richard Udry, Peter S. Bearman, and Kathleen Mullan Harris at the University of North Carolina at Chapel Hill, and funded
by grant P01-HD31921 from the Eunice Kennedy Shriver National Institute of Child Health and Human Development, with
cooperative funding from 23 other federal agencies and foundations. Special acknowledgment is due Ronald R. Rindfuss and
Barbara Entwisle for assistance in the original design. Information on how to obtain the Add Health data files is available on
the Add Health website (http://www.cpc.unc.edu/addhealth). No direct support was received from grant P01-HD31921
for this analysis.
References
L. Anselin. Some robust approaches to testing and estimation in spatial econometrics. Regional Science and Urban Economics,
20(2):141–163, 1990.
S. P. Borgatti, K. M. Carley, and D. Krackhardt. On the robustness of centrality measures under conditions of imperfect data.
Social Networks, 28(2):124–136, 2006.
R. A. Bryant, E. Waters, L. Gibbs, H. C. Gallagher, P. Pattison, D. Lusher, C. MacDougall, L. Harms, K. Block, E. Snow-
don, V. Sinnott, G. Ireton, J. Richardson, and D. Forbes. Psychological outcomes following the Victorian Black Saturday
bushfires. Australian and New Zealand Journal of Psychiatry, 48(7):634–643, 2014.
A. D. Cliff and J. K. Ord. Spatial processes: models & applications. Taylor & Francis, 1981.
J. S. Coleman. Relational analysis: the study of social organizations with survey methods. Human Organization, 17(4):28–36,
1958.
G. Csardi and T. Nepusz. The igraph software package for complex network research. InterJournal Complex Systems, 1695,
2006. URL http://igraph.sf.net.
G. Daraganova. Statistical models for social networks and network-mediated social influence processes: Theory and applica-
tion. PhD thesis, The University of Melbourne, 2009.
G. Daraganova and P. Pattison. Autologistic actor attribute model analysis of unemployment: dual importance of who you
know and where you live. In D. Lusher, J. Koskinen, and G. Robins, editors, Exponential Random Graph Models for Social
Networks, chapter 18, pages 237–247. Cambridge University Press, New York, 2013.
29
Page 30
G. Daraganova and G. Robins. Autologistic actor attribute models. In D. Lusher, J. Koskinen, and G. Robins, editors,
Exponential Random Graph Models for Social Networks, chapter 9, pages 102–114. Cambridge University Press, New
York, 2013.
A. C. Davison and D. V. Hinkley. Bootstrap Methods and Their Application. Cambridge University Press, Cambridge, 1997.
M. J. De Silva, K. McKenzie, T. Harpham, and S. R. Huttly. Social capital and mental illness: a systematic review. Journal of
Epidemiology and Community Health, 59(8):619–627, 2005.
D. Dekker, D. Krackhardt, and T. A. Snijders. Sensitivity of MRQAP tests to collinearity and autocorrelation conditions.
Psychometrika, 72(4):563–581, 2007.
A. J. Dirkzwager, L. Grievink, P. G. Van der Velden, and C. J. Yzermans. Risk factors for psychological and physical health
problems after a man-made disaster. The British Journal of Psychiatry, 189(2):144–149, 2006.
D. Dittrich, R. T. A. Leenders, and J. Mulder. Bayesian estimation of the network autocorrelation model. Social Networks, 48:
213–236, 2017.
D. Dittrich, R. T. A. Leenders, and J. Mulder. Network autocorrelation modeling: A Bayes factor approach for testing (multiple)
precise and interval hypotheses. Sociological Methods & Research, 48(3):642–676, 2019.
P. Doreian. Estimating linear models with spatially distributed data. Sociological Methodology, 12:359–388, 1981.
B. Efron. Better bootstrap confidence intervals. Journal of the American Statistical Association, 82(397):171–185, 1987.
O. Frank and D. Strauss. Markov graphs. Journal of the American Statistical Association, 81(395):832–842, 1986.
L. C. Freeman, A. K. Romney, and S. C. Freeman. Cognitive structure and informant accuracy. American Anthropologist, 89
(2):310–325, 1987.
N. E. Friedkin. Social networks in structural equation models. Social Psychology Quarterly, pages 316–328, 1990.
L. Gibbs, E. Waters, R. A. Bryant, P. Pattison, D. Lusher, L. Harms, J. Richardson, C. MacDougall, K. Block, E. Snowdon,
et al. Beyond bushfires: Community, resilience and recovery — a longitudinal mixed method study of the medium to long
term impacts of bushfires on mental health and social connectedness. BMC Public Health, 13(1):1036, 2013.
L. Gibbs, S. Howell-Meurs, K. Block, D. Lusher, J. Richardson, C. MacDougall, E. Waters, L. Harms, et al. Community
wellbeing: applications for a disaster context. Australian Journal of Emergency Management, 30:20–24, 2015.
S. Goel and M. J. Salganik. Assessing respondent-driven sampling. Proceedings of the National Academy of Sciences of the
USA, 107(15):6743–6747, 2010.
L. A. Goodman. Snowball sampling. The Annals of Mathematical Statistics, pages 148–170, 1961.
L. A. Goodman. Comment: On respondent-driven sampling and snowball sampling in hard-to-reach populations and snowball
sampling not in hard-to-reach populations. Sociological Methodology, 41(1):347–353, 2011.
M. S. Handcock and K. J. Gile. Modeling social networks from sampled data. The Annals of Applied Statistics, 4(1):5–25,
2010.
M. S. Handcock and K. J. Gile. Comment: On the concept of snowball sampling. Sociological Methodology, 41(1):367–371,
2011.
K. M. Harris and R. J. Udry. National Longitudinal Study of Adolescent to Adult Health (Add Health) Wave I, 1994-1995,
2015. URL https://doi.org/10.15139/S3/11900.
D. D. Heckathorn. Comment: Snowball versus respondent-driven sampling. Sociological Methodology, 41(1):355–366, 2011.
P. W. Holland and S. Leinhardt. The structural implications of measurement error in sociometry. Journal of Mathematical
Sociology, 3(1):85–111, 1973.
J. Illenberger and G. Flotterod. Estimating network properties from snowball sampled data. Social Networks, 34(4):701–711,
2012.
Y. Kashima, S. Wilson, D. Lusher, L. J. Pearson, and C. Pearson. The acquisition of perceived descriptive norms as social
category learning in social networks. Social Networks, 35(4):711–719, 2013.
I. Kawachi, S. Takao, and S. V. Subramanian. Global perspectives on social capital and health. Springer, New York, 2013.
30
Page 31
R. G. Kessler, G. Andrews, L. J. Colpe, D. K. Mroczek, S.-L. T. Normand, E. E. Walters, and A. M. Zaslavsky. Short screening
scales to monitor population prevalences and trends in non-specific psychological distress. Psychological Medicine, 32:
959–976, 2002.
N. Kiuru, W. J. Burk, B. Laursen, J.-E. Nurmi, and K. Salmela-Aro. Is depression contagious? a test of alternative peer
socialization mechanisms of depressive symptoms in adolescent peer networks. Journal of Adolescent Health, 50(3):250–
255, 2012.
A. S. Klovdahl, J. J. Potterat, D. E. Woodhouse, J. B. Muth, S. Q. Muth, and W. W. Darrow. Social networks and infectious
disease: The Colorado Springs study. Social Science & Medicine, 38(1):79–88, 1994.
J. H. Koskinen, G. L. Robins, P. Wang, and P. E. Pattison. Bayesian analysis for partially observed network data, missing ties,
attributes and actors. Social Networks, 35(4):514–527, 2013.
G. Kossinets. Effects of missing data in social networks. Social Networks, 28(3):247–268, 2006.
D. Krackhardt. Predicting with networks: Nonparametric multiple regression analysis of dyadic data. Social Networks, 10(4):
359–381, 1988.
M. Kurant, A. Markopoulou, and P. Thiran. Towards unbiased BFS sampling. IEEE Journal on Selected Areas in Communi-
cations, 29(9):1799–1809, 2011.
S. H. Lee, P.-J. Kim, and H. Jeong. Statistical properties of sampled networks. Physical Review E, 73:016102, 2006.
R. T. A. Leenders. Modeling social influence through network autocorrelation: constructing the weight matrix. Social networks,
24(1):21–47, 2002.
S. Letina. Network and actor attribute effects on the performance of researchers in two fields of social science in a small
peripheral community. Journal of Informetrics, 10(2):571–595, 2016.
S. Letina, G. Robins, and D. Maslic Sersic. Reaching out from a small scientific community: the social influence models of col-
laboration across national and disciplinary boundaries for scientists in three fields of social sciences. Revija za sociologiju,
46(2):103–139, 2016.
D. Lusher, J. Koskinen, and G. Robins, editors. Exponential Random Graph Models for Social Networks. Structural Analysis
in the Social Sciences. Cambridge University Press, New York, 2013.
M. S. Mizruchi and E. J. Neuman. The effect of density on the level of bias in the network autocorrelation model. Social
Networks, 30(3):190–200, 2008.
J. Moody. Peer influence groups: Identifying dense clusters in large networks. Social Networks, 23(4):261–283, 2001.
E. J. Neuman and M. S. Mizruchi. Structure and bias in the network autocorrelation model. Social Networks, 32(4):290–300,
2010.
M. E. J. Newman. Ego-centered netwoks and the ripple effect. Social Networks, 25(1):83–95, 2003.
M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):
036104, 2006.
F. H. Norris, S. P. Stevens, B. Pfefferbaum, K. F. Wyche, and R. L. Pfefferbaum. Community resilience as a metaphor, theory,
set of capacities, and strategy for disaster readiness. American Journal of Community Psychology, 41(1-2):127–150, 2008.
K. Ord. Estimation methods for models of spatial interaction. Journal of the American Statistical Association, 70(349):
120–126, 1975.
P. E. Pattison, G. L. Robins, T. A. B. Snijders, and P. Wang. Conditional estimation of exponential random graph models from
snowball sampling designs. Journal of Mathematical Psychology, 57(6):284–296, 2013.
J. Potterat, D. E. Woodhouse, S. Q. Muth, R. B. Rothenburg, W. W. Darrow, A. S. Klovdahl, J. B. Muth, et al. Network
dynamism: history and lessons of the Colorado Springs study. In Network epidemiology: A handbook for survey design and
data collection. Oxford University Press, 2004.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna,
Austria, 2013. URL http://www.R-project.org.
G. Robins. Doing Social Network Research: Network-based Research Design for Social Scientists. Sage, London, 2015.
31
Page 32
G. Robins, P. Elliott, and P. Pattison. Network models for social selection processes. Social Networks, 23(1):1–30, 2001a.
G. Robins, P. Pattison, and P. Elliott. Network models for social influence processes. Psychometrika, 66(2):161–189, 2001b.
G. Robins, P. Pattison, and J. Woolcock. Missing data in networks: exponential random graph (p∗) models for networks with
non-respondents. Social Networks, 26(3):257–283, 2004.
G. Robins, T. Snijders, P. Wang, M. Handcock, and P. Pattison. Recent developments in exponential random graph (p∗) models
for social networks. Social Networks, 29(2):192–215, 2007.
R. B. Rothenberg, D. E. Woodhouse, J. J. Potterat, S. Q. Muth, W. W. Darrow, and A. S. Klovdahl. Social networks in disease
transmission: the Colorado Springs study. NIDA research monograph, 151:3–19, 1995.
D. K. Sewell. Network autocorrelation models with egocentric data. Social Networks, 49:113–123, 2017.
M. J. Silk, D. P. Croft, R. J. Delahay, D. J. Hodgson, N. Weber, M. Boots, and R. A. McDonald. The application of statistical
network models in disease research. Methods in Ecology and Evolution, 8(9):1026–1041, 2017.
J. A. Smith and J. Moody. Structural effects of network sampling coverage I: Nodes missing at random. Social Networks, 35
(4):652–668, 2013.
J. A. Smith, J. Moody, and J. H. Morgan. Network sampling coverage II: The effect of non-random missing data on network
measurement. Social Networks, 48:78–99, 2017.
T. A. B. Snijders. Markov chain Monte Carlo estimation of exponential random graph models. Journal of Social Structure, 3
(2):1–40, 2002.
A. D. Stivala, J. H. Koskinen, D. A. Rolls, P. Wang, and G. L. Robins. Snowball sampling for estimating exponential random
graph models for large networks. Social Networks, 47:167–188, 2016.
O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. URL
http://www.gnu.org/s/parallel.
S. K. Thompson and O. Frank. Model-based estimation with link-tracing sampling designs. Survey Methodology, 26(1):87–98,
2000.
T. W. Valente. Social networks and health: Models, methods, and applications. Oxford University Press, New York, 2010.
P. Wang, G. Robins, and P. Pattison. PNet: program for the simulation and estimation of exponential random graph (p∗)
models. Department of Psychology, The University of Melbourne, 2009.
W. Wang, E. J. Neuman, and D. A. Newman. Statistical power of the social network autocorrelation model. Social Networks,
38:88–99, 2014.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer, New York, 2009. ISBN 978-0-387-98140-6. URL
http://had.co.nz/ggplot2/book.
E. B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Associa-
tion, 22(158):209–212, 1927.
D. E. Woodhouse, R. B. Rothenberg, J. J. Potterat, W. W. Darrow, S. Q. Muth, A. S. Klovdahl, H. P. Zimmerman, H. L. Rogers,
T. S. Maldonado, J. B. Muth, et al. Mapping a social network of heterosexuals at high risk for HIV infection. AIDS, 8(9):
1331–1336, 1994.
32