Conditional Copula Inference and Efficient Approximate MCMC by Evgeny Levi A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistical Sciences University of Toronto c Copyright 2019 by Evgeny Levi
145
Embed
Conditional Copula Inference and Efficient Approximate MCMC...his insightful comments about MCMC theory. I wish to thank Nancy Reid, David Brenner, Radford Neal, Jerry Brunner, Fang
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Conditional Copula Inference and Efficient ApproximateMCMC
by
Evgeny Levi
A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy
Graduate Department of Statistical SciencesUniversity of Toronto
where fj, Fj are the density and, respectively, the cdf for Yj, and ω denotes all the parameters
4
and latent variables in the joint and marginals models. The copula density function is
denoted by c and it depends on X through unknown function θ(X) = g−1(η(X)).
Depending on the strength of assumptions we are willing to make about η(X), a number of
possible approaches are available. The most direct is to assume a known parametric form for
the calibration function, e.g. constant or linear, and estimate the corresponding parameters
by maximum likelihood estimation [37]. This approach relies on knowledge about the shape
of the calibration function which, in practice, can be unrealistic. A more flexible approach
uses non-parametric methods [3, 90] and estimate the calibration function using smoothing
methods. Recently, we have seen a number of developments using nonparametric Bayesian
techniques for estimating a multivariate copula using an infinite mixture of Gaussian copulas
[95], or via flexible Dirichlet process priors [96, 67]. The infinite mixture approach in [95]
was extended to estimate any conditional copula with a univariate covariate by [22], while
an alternative Bayesian approach based on a flexible cubic spline model for the calibration
functions was built by [20]. For multivariate covariates, [80], [15] and [49] avoid the curse
of dimensionality that appears even for moderate values of q, say q ≥ 5, by specifying
an additive model structure for the calibration function. Few alternatives to the additive
structure exist. One exception is [42] who used a sparse Gaussian Process (GP) prior for
estimating the calibration function and subsequently used the same construction for vine
copulas estimation in [58]. However, when the dimension of the predictor space is even
moderately large the curse of dimensionality prevails and it is expected that the q-dimensional
GP used for calibration estimation will not capture important patterns for sample sizes that
are not very large. Moreover, the full efficiency of the method proposed in [42] is difficult
to assess since their model is build with uniform marginals, which in a general setup is
equivalent to assuming exact knowledge about the marginal distributions. In fact, when
the marginal distributions are estimated it is of paramount importance to account for the
resulting variance inflation due to error propagation in the copula estimation as reflected
by equations (1.5)-(1.8). The Bayesian model in which joint and marginal components are
simultaneously considered will appropriately handle error propagation as long as it is possible
to study the full posterior distribution of all the parameters in the model, be they involved
in the marginals or copula specification.
5
1.2 Brief review of Markov Chain Monte Carlo (MCMC)
Since we implement MCMC algorithms for both parts of this thesis, a brief introduction is
given here.
We start by assuming that we need to sample from some distribution with density π(x)
where x ∈ X ∈ Rq. Moreover we do not have access to the closed form of this density but
only π(x) can be evaluated where:
π(x) =π(x)
C,
and C is some unknown normalization constant. This situation is typical in Bayesian statis-
tics where posterior distribution of parameters θ given observed data y is:
π(θ|y) =p(θ)f(y|θ)∫p(θ)f(y|θ)dθ
,
here p(θ) is the prior and f(y|θ) is the model density. In this setting π(θ) = p(θ)f(y|θ)can be easily computed while the normalization constant is generally not known. In these
problems where posterior cannot be found in closed form, the objective is to draw from the
posterior θ(t), t = 1, . . . ,M and then use these to approximate quantities of interest. By the
Strong Law of Large Numbers:
1
M
M∑t=1
g(θ(t))a.s.→ E[g(θ)], (1.10)
for any measurable function g(·). When dimension q is moderate or large it becomes in-
creasing difficult to draw independent samples from π therefore MCMC algorithms aim to
simulate dependent samples by constructing a Markov chain with the stationary distribution
π , see [19, 14].
First we define a Markov Chain:
Definition 1.2.1. Given a pair of state space and Borel σ-field (X ,F). A stochastic process
X(0), X(1), . . . , X(M), . . ., is a Markov chain if:
P (X(t) ∈ A|X(0), . . . , X(t−1)) = P (X(t) ∈ A|X(t−1)) for all A ∈ F .
We can think about this process as ordered in time and future values depend only on
the present and not on the past. In MCMC theory we usually assume that these conditional
probabilities are homogeneous:
Definition 1.2.2. A Markov chain is homogeneous if P (X(t) ∈ A|X(t−1)) is the same for all
6
t = 1, 2, . . ..
So that conditional probability does not change with time t. Therefore the joint distri-
bution of a random vector (X(0), . . . , X(M)) is
P (x(0), . . . , x(M)) = P (x(0))× P (x(1)|x(0))× · · · × P (x(M)|x(M−1)),
so it is fully specified by the initial distribution P (x(0)) and the transition probability (or
kernel) which we denote by P (x, dy) so that P (X(1) ∈ A|X(0) = x) =∫A P (x, dy)dy. It
is also convenient to define PM(x, ·) = P (X(M) ∈ ·|X(0) = x) which is the conditional
probability of a Markov chain in M steps.
A very important concept in MCMC is stationary (or invariant) distribution of a Markov
chain which is defined as:
Definition 1.2.3. π(x) defined on X is called invariant for a Markov chain on (X ,F) if∫π(dx)P (x,A) = π(A) for all A ∈ F .
It just means that if for example X(0) has distribution π(x) then X(1) also has the same
distribution and similarly all other random variables. Note that if we set X(0) ∼ π(x) (where
π(x) is invariant) then homogeneous Markov Chain becomes strongly stationary.
In MCMC we actually have a reverse problem; here we are given π(x) that we want to
sample from and the first step is to construct a Markov chain (with appropriate transition
kernel) for which π(x) is invariant. Once we have it, if we start a chain by sampling from π(x)
then by stationarity all X(1), X(2), · · · , will follow the same target distribution and the goal
is achieved. However if we can simulate X(0) from the target distribution, we do not need the
Markov chain at all, so the main question is if we sample from a different distribution at step
0 will the distribution of X(M) converge to π(x) as M increases. Under several assumptions
this convergence result is actually true. First we define a Total Variation Distance between
two measures ν1 and ν2 as
‖ν1(·)− ν2(·)‖TV = supA|ν1(A)− ν2(A)|.
The main convergence result is:
Theorem 1.2.1. If a Markov chain defined on (X ,F) with transition kernel P (x, ·) and in-
variant measure π(·) is φ-irreducible and aperiodic, then for π almost every x ∈ X we have:
limM→∞
‖PM(x, ·)− π(·)‖TV = 0.
7
The importance of this theorem is that we can start a Markov chain by setting x(0) at
almost any value then run the chain (for which π(x) is stationary) and after large number
of iterations we expect samples from the target distribution. The result depends on two
assumptions that are usually satisfied in practice, see [64] for details:
Definition 1.2.4. A Markov chain is φ-irreducible if there exist a non-zero σ-finite measure φ
in X such that for all A ∈ X with φ(A) > 0 and for all x ∈ X , there exist positive integer
M such that PM(x,A) > 0.
Definition 1.2.5. A Markov chain with invariant distribution π(x) is aperiodic if there do
not exist disjoint subsets A1, · · · ,Ad ∈ F , d ≥ 2 with P (x,Ai+1) = 1 for all x ∈ Ai(1 ≤ i ≤ d− 1) and P (x,A1) = 1 for all x ∈ Ad.
1.2.1 Metropolis Hastings algorithm
Given unnormalized density π(x) of the target distribution π(x), to apply MCMC theory we
need to find such conditional distribution P (x, ·) so that the target is invariant. Metropolis-
Hastings [63, 41] is probably the most frequently used algorithm to construct such transitional
distributions. The main idea, at iteration t is to sample a proposal x∗ from some distribution
q(·|X(t−1) = x) which can depend on the previous state, calculate appropriate probability of
acceptance α(x, x∗) and then accept x∗ with this probability, see Algorithm 1. Notice that
Algorithm 1 Metropolis-Hastings
1: Given initial x(0) and required number of samples M .2: for t = 1, · · · ,M do3: Set x = x(t−1).4: Simulate x∗ ∼ q(·|x) where q(·|x) is some density.
5: Calculate α(x, x∗) = min(
1, π(x∗)q(x|x∗)π(x)q(x∗|x)
)= min
(1, π(x∗)q(x|x∗)
π(x)q(x∗|x)
).
6: Simulate u ∼ U(0, 1).7: if If u ≤ α(x, x∗) then8: Accept, X(t) = x∗.9: else
10: Reject, X(t) = x.11: end if12: end for
α(x, x∗) depends on the ratio between π(x∗) and π(x) and therefore this algorithm can be
implemented when the normalization constant C is unknown.
8
It is easy to see that the transition kernel for this algorithm is:
P (x, dx∗) = α(x, x∗)q(x∗|x)dx∗ + r(x)δx(dx∗),
where r(x) = 1 −∫α(x, x∗)q(x∗|x)dx∗ and δx(·) is point mass at point x. It can be shown
that this transition probability preserves the target distribution π(x).
1.3 Bayesian Inference and Gaussian Processes
Assume we observe n independent realizations, y1, . . . , yn, of a random variable Y ∈ R
and that each observation yi corresponds to a covariate measurement xi ∈ Rq. Henceforth,
we assume that x1, . . . , xn are fixed by design. The distribution of Yi has a known form
and depends on xi through some unknown function f and parameter σ so that the joint
distribution of the data is
P (y|x1, . . . , xn, σ) = P (y|f(x1), . . . , f(xn), σ) =n∏i=1
P (yi|f(xi), σ). (1.11)
Usually, the main inferential goal is to estimate the unknown smooth function f : Rq → R,
while σ is a nuisance parameter. If we let x = (x1, . . . , xn)T denote the n covariate values,
then a Gaussian Process (GP) prior on the function f implies
f = (f(x1), f(x2), . . . , f(xn))T ∼ N (0, K(x,x; w)), (1.12)
where N (µ,Σ) denotes a normal distribution with mean µ and variance matrix Σ and K is a
variance matrix which depends on x and additional parameters w. Here we use the squared
exponential kernel to model the matrix K(x,x; w), i.e. its (i, j) element is
k(xi, xj; w) = ew0 exp
[−
q∑s=1
(xis − xjs)2
ews
], (1.13)
where xis is the sth coordinate value for ith covariate measurement xi. The unknown param-
eters w = (w0, . . . , wq) that determine the strength of dependence in (1.13) are inferred from
the data. Of interest is predicting the values of the nonlinear predictor at new observations
x∗ = (x∗1, . . . , x∗m)T , which we denote f∗ = (f(x∗1), . . . , f(x∗m))T . In the case in which the
covariate dimension, q, is moderately large, an accurate estimation of f∗ will require a large
9
sample size, n. Unfortunately, this desideratum is hindered by the computational cost of
fitting a GP model when n is large. For example, if Yi∼N (f(xi), σ2) then equations (1.12)
and (1.11) yield a joint Gaussian distribution of Y = (Y1, . . . , Yn) and f∗. If y = (y1, . . . , yn)
denotes the observed response, then the conditional distribution of f∗|Y = y is N(µ∗,Σ∗)
and K(x∗,x∗; w), K(x∗,x; w) and K(x,x∗; w) have their elements defined using (1.13).
With the Gaussian sampling model it is clear from (1.14) and (1.15) that the MCMC sam-
pling of the posterior requires at each iteration the calculation and inversion of the matrix
K(x,x; w) + σ2In ∈ Rn×n which becomes prohibitive when n is large. To make GP models
applicable for larger data we refer to the literature on sparse GP [74, 87, 66] in which it is
assumed that learning about f can be achieved using a smaller sample of m latent variables,
called inducing variables, which may be a subsample of the original data or can be built
using other considerations as further discussed. The intuitive idea is to use the inducing
variables to channel the information contained in the covariate values x = {x1, . . . , xn}. We
denote the inducing values as x = (x1, . . . , xm)T ∈ Rm×q and K(x, x; w) ∈ Rn×m the matrix
K(x, x; w) =
k(x1, x1; w) · · · k(x1, xm; w)
.... . .
...
k(xn, x1; w) · · · k(xn, xm; w)
, (1.16)
where k(xi, xj; w) is defined as in (1.13). The ratio m/n influences the trade-off between
computational efficiency and statistical efficiency, as a smaller m will favour the former
and a larger m will ensure no significant loss of the latter. If the function values for the
inducing points are defined as f = (f(x1), . . . , f(xm))T then the joint density of the response
vector Y, the latent variable f and the parameter w can be expressed only in terms of the
m-dimensional vector f since
P (y, f ,w|x, x) = P (y|A(x, x; w)f)N (f ; 0, K(x, x; w))p(w), (1.17)
10
where N (x;µ,Σ) is the normal density with mean µ and covariance Σ, p(w) is the prior
probability for the parameters w and
A(x, x; w) = K(x, x; w)K(x, x; w)−1. (1.18)
The form of P (y|A(x, x; w)f) is derived under the assumption that f = A(x, x; w)f and de-
pends on form of the sampling model P (y|f , σ), e.g., when the latter is N (f , σIn) we obtain
P (y|A(x, x; w)f) = N (A(x, x; w)f , σIn).
The posterior distribution π(f ,w|y,x) is not tractable, but sampling from it will be much
less expensive since K(x, x; w) ∈ Rn×m and K(x, x; w) ∈ Rm×m. While the inducing inputs
x can be selected from the samples collected, we will use an alternative approach where
we group the covariate values observed, x, into m clusters, and choose the cluster-specific
covariate averages as x1, . . . , xm. For instance, given a specific value k, one can use a simple
k-means algorithm [12] to classify x into k clusters and estimate clusters’ means using an
iterative method. Intuitively, it makes sense to have more inducing points in regions that
exhibit more variation in covariate values.
Given a new test point x∗ we are interested in the corresponding posterior predictive distri-
bution of f ∗ = f(x∗):
P (f ∗|x∗,x,y) =
∫P (f ∗|f ,w, x∗, x)P (f ,w|x,y)dfdw. (1.19)
In general, the integral involved in (1.19) cannot be calculated in closed form but we can use
posterior draws (f ,w)(t), t = 1, . . . ,M given x,y to approximate distribution of f ∗|x∗,x,yby samples
(f ∗)(t) = A(x∗, x; w(t))f (t), t = 1, · · · ,M.
Statistical inference can be build on these samples.
Finally, in order to reduce the dimensionality of the parameter space, we assume that
f(xi) = f(xTi β), (1.20)
and we set f = (f(z1), . . . , f(zm))T , where (z1, . . . , zm) are inducing variables in R, f : R→R is an unknown function that is of interest and β ∈ Rq is normalized, i.e. ‖β‖ = 1.
Note that without normalization the parameter β is not identifiable. Here {z1, . . . , zm} play
the same role as {x1, . . . , xm} for general sparse GP. They help sample the posterior latent
11
variables much faster and should be spread in the range of {xT1 β, . . . , xTnβ}. In the next
chapter we show how to choose the positions of these inducing inputs. The single index
model (SIM) defined by (1.20) coupled with the sparse GP approach (henceforth denoted
GP-SIM) has the advantage that it casts the original problem of estimating a general function
f in q dimensions based on n observations into the estimation of q-dimensional parameter
vector β and of the one-dimensional map f based on m << n inducing points. The GP-SIM
approach was successfully implemented for mean regression problems [17, 39] and quantile
regression [44]. It can be used for large covariate dimension and is much more flexible than
simple linear model.
1.4 Model Selection
The conditional copula model involves two types of selection. First one needs to choose the
copula family from a set of possible candidates. Second, it is often of interest to determine
whether a parametric simple form for the calibration is supported by the data. For instance,
a constant calibration function indicates that the dependence structure does not vary with
the covariates, a conclusion that may be of scientific interest in some applications. Let ω(t)
denote the vector of parameters and latent variables drawn at step t from the posterior
corresponding to model M. We consider two measures of fit that can be estimated from
the MCMC samples ω(t), t = 1, . . . ,M . As was mentioned before the observed data set is
denoted by D = {y1i, y2i, xi}ni=1.
Cross-Validated Pseudo Marginal Likelihood
The cross-validated pseudo marginal likelihood (CVML) [33, 40] calculates the average (over
parameter values) prediction power for model M via
CVML(M) =n∑i=1
log (P (y1i, y2i|D−i,M)) , (1.21)
where D−i is the data set from which the ith observation has been removed. An estimate of
(1.21) can be obtained using posterior draws for all the parameters and latent variables in
the model. Specifically, if the latter are denoted by ω, then
E[P (y1i, y2i|ω,M)−1
]= P (y1i, y2i|D−i,M)−1, (1.22)
12
where the expectation is with respect to conditional distribution of ω given full data D and
the model M. Based on the posterior samples we can estimate the CVML as
CVMLest(M) = −n∑i=1
log
(1
M
M∑t=1
P (y1i, y2i|ω(t),M)−1
). (1.23)
The model with the largest CVML is selected.
Watanabe-Akaike Information Criterion
The Watanabe-Akaike Information Criterion [WAIC, 91] is an information-based criterion
that is closely related to the CVML, as discussed in [92], [35] and [89].
The WAIC is defined as
WAIC(M) = −2fit(M) + 2p(M), (1.24)
where the model fitness is
fit(M) =n∑i=1
logE [P (y1i, y2i|ω,M)] (1.25)
and the penalty
p(M) =n∑i=1
V ar[logP (y1i, y2i|ω,M)]. (1.26)
The expectation in (1.25) and the variance in (1.26) are with respect to the conditional
distribution of ω given the data and can be computed using Monte Carlo samples from π.
For instance, the Monte Carlo estimate of the fit is
fit(M) =n∑i=1
log
(∑Mt=1 P (y1i, y2i|ω(t),M)
M
), (1.27)
and p(M) can be estimated similarly using the posterior samples. The model with the
smallest WAIC is preferred.
13
1.5 Simplifying Assumption
Great dimension reduction of the parameter space is achieved under the so-called simplifying
assumption (SA) that assumes Cθ(X) = C (as in (1.3)), i.e. the conditional copula is constant
[38, 21]. The SA condition can significantly simplify the vine copula estimation [for example,
see 1], but it is known to lead to bias when it is wrongly assumed [2]. Therefore, for
conditional copula models it is of practical interest to assess whether the data supports or
not SA. A first step towards a formal test for SA can be found in [4]. The reader is referred
to [23] for an excellent review of work on SA, and ideas for future developments.
If the calibration function and marginal distributions are modeled parametrically, e.g.
η(X) = α0 +K∑j=1
αjΨj(X),
where Ψ(X) are some basis functions and unknown parameters are estimated using maximum
likelihood estimation (MLE) [37] then we can utilize standard asymptotic theory to test
α1 = α2 = . . . = αK = 0 via a canonical likelihood ratio test. However this approach relies
on knowledge of the shape of calibration function, which is unrealistic in practice. A number
of research contributions address this issue for frequentist analyses, e.g. [4], [38], [23], [48].
Moreover even if the calibration form is guessed correctly while marginals are misspecified it
can lead to wrong conclusions about the calibration behavior as noted in chapter 4 and [57].
Our contribution belongs within the Bayesian paradigm, following the general philosophy
expounded also in [49]. In this setting, it was observed in [20] that generic model selection
criteria tend to choose a more complex model even when SA holds.
1.6 Plan
In Chapter 2 we consider Bayesian joint analysis of the marginal and copula models using
flexible GP models. Our emphasis is placed on the estimation of the calibration function
η(X) which is assumed to have a GP prior that is evaluated at βTX for some normalized
β, thus coupling the GP-prior construct with the single index model (SIM) of [17] and
[39]. The GP-SIM is more flexible than a canonical linear model and computationally more
manageable than a full GP with q variables. The proposed model can be used for large
covariate dimension q and for large samples. Both marginal means will be fitted using
sparse GP approaches so that large data sets can be computationally manageable. The
14
dimension reduction of the SIM approach has been noted also by [30] who used two-stage
semiparametric methods to estimate the calibration function. In contrast to [30], we use a
Bayesian approach and estimate marginals and copula parameters jointly. So far, GP-SIM’s
have been used mostly in regression settings where the algorithm of [39] can be used to
efficiently sample the posterior distribution. However, the GP-SIM model for conditional
copulas involves a non-Gaussian likelihood which requires important modifications of their
algorithm.
A second contribution of this work (Chapters 3 and 4) deals with model selection issues that
are particularly relevant for the conditional copula construction. We consider of importance
the choice of copula family and identifying whether the simplifying assumption (SA) is
supported by the data. For the former task we develop a conditional cross-validated marginal
likelihood (CCVML) criterion and also examine its relation with the Watanabe Information
Criterion [91], while for determining the data support for SA we construct a permutation-
based variant of the CVML that shows good performance in our numerical experiments.
We then identify an important link between SA and missing covariates in the conditional
copula model. To our knowledge, this connection has not been reported elsewhere. Finally
we propose two other testing SA procedures that utilizes the idea of splitting the data set to
two sets, fitting a flexible model on the first and based on predictions made by this model on
the second data set. We then divide the data in the second (test) set to ”bins” by the order of
predicted values. To check if the distribution in each bin is the same we utilize permutation
or Chi-square test. We show with theoretical arguments that this procedure produces the
required Probability of Type I error and we support them with simulation results. We then
extend these ideas to other models, and show that generic tests may not be reliable when
the complexity of the model is data-driven. A merit of the proposed methods is that it is
quite general in its applicability, but this comes, unsurprisingly, at the expense of power.
In order to investigate whether the trade-off is reasonable we design a simulation study and
present its conclusions.
We close this part by applying proposed methods to a real world problem and analyze the
Wine Data set in Chapter 5.
15
Chapter 2
Bayesian Conditional Copula using
Gaussian Processes
2.1 GP-SIM for Conditional copula
We consider a bivariate response variable (Y1, Y2) ∈ R2 together with covariate measurement
X ∈ Rq. Hence, the data D = {(y1i, y2i, xi), i = 1 . . . n} consist of triplets (y1i, y2i, xi)
where y1i, y2i ∈ R and xi ∈ Rq. For notational convenience, let y1 = (y11, . . . , y1n)T ,
y2 = (y21, . . . , y2n)T and x = (x1, . . . , xn)T . We assume that the marginal distribution
of Yj (j = 1, 2) is Gaussian with mean fj(X) and constant variance σ2j . If we let Yj =
(Yj1, . . . , Yjn)T , j = 1, 2, and fj = (fj(x1), . . . , fj(xn))T we can compactly write:
Yj ∼ N (fj, σ2j In) j = 1, 2. (2.1)
Generally, it is difficult to discern whether the copula structure varies with covariates or not,
so we consider a conditional copula to account for the more general situation. Therefore,
the likelihood function is
L(ω) =n∏i=1
1
σ1
φ
(y1i − f1i
σ1
)1
σ2
φ
(y2i − f2i
σ2
)×
× cθ(xi)(
Φ
(y1i − f1i
σ1
),Φ
(y2i − f2i
σ2
)),
(2.2)
where c denotes a parametric copula density function, ω denotes all the parameters in the
model, while Φ and φ are the cumulative probability function and density function of a stan-
16
dard normal distribution, respectively. The parameter of a copula depends on the unknown
function θ(xi) = g−1(f(xi)), where f is assumed to take the form given in (1.20) and g is
a known invertible link function that allows an unrestricted parameter space for f . Note
that the form of the GP-SIM model used for estimating the copula parameter is invariant
to non-linear transformations. This implies that the formulation of the model is the same
whether we directly estimate the copula parameter, θ(X), Kendall’s τ(X), or other mea-
sures of dependence. However, this is not true if we use an additive model for θ(X), since
additivity is not preserved by non-linear transformations.
The GP-SIM is fully specified once we assign the GP priors to f1, f2, f and the parametric
priors for the remaining parameters, as follows:
f1 ∼ GP(w1), f2 ∼ GP(w2), f ∼ GP(w),
w1 ∼ N (0, 5Iq+1), w2 ∼ N (0, 5Iq+1), w ∼ N (0, 5I2),
β ∼ U(Sq−1), σ21 ∼ IG(0.1, 0.1), σ2
2 ∼ IG(0.1, 0.1).
(2.3)
The GP(w) is a Gaussian Process prior with mean zero, squared exponential kernel with
parameters w, U(Sq−1) is a uniform distribution on the surface of the q-dimensional unit
sphere and IG(α, β) denotes the inverse gamma distribution. The above prior for w captures
very wiggly functions for small values of w and almost constant functions for large values of
w. Prior for marginal variances is vague and would be conjugate in the absence of copula
term. In our experience, the results are not sensitive to the choice of hyperparameter values.
Because the focus of our work is on inference for the copula, we allow f1 and f2 to be
evaluated on Rq while f is on R. In order to avoid computational problems that affect
the GP-based inference when the sample size is large, the inference will rely on the Sparse
GP method that was described in the previous section. Suppose x1 are m1 inducing inputs
for function f1, x2 are m2 inducing inputs for function f2 and z are m inducing inputs for
function f . The number of inducing inputs m1, m2 and m can all be different, but in our
applications we will choose their values equal and significantly smaller than the sample size,
n. The choice is motivated by imperative computational time restrictions, given the large
number of numerical simulations we perform to investigate empirically the performance of
the approach in terms of estimation and model selection. In practice, the analyst should
ideally use the largest number of inducing points the computing environment can support.
As suggested earlier, we define x1 and x2 as centers of m1 and m2 clusters of x. If m1 = m2
then the inducing inputs are the same. We cannot use the same strategy for z, since then
17
we would need the centers for the clusters of the variable xTβ which are unknown. If we
assume that each covariate xis is between 0 and 1 (this can be achieved easily if we subtract
the minimum value and divide by range) then following the Cauchy-Schwartz inequality we
obtain
‖xTi β‖ ≤√‖xi‖2‖β‖2 ≤ √q ∀xi, β.
Hence we can choose z to be m equally spaced points in the interval [−√q,√q].Let f1 be f1 evaluated at x1, f2 be f2 evaluated at x2 and f be f evaluated at z. Then
the joint density of the observed data and parameters is proportional to:
where f1 = A(x, x1; w1)f1, f2 = A(x, x2; w2)f2, f = A(xTβ, z; w)f and pN and pIG are
multivariate normal and inverse gamma densities, respectively. Although here we adopt
a full GP prior for the marginal models, the approach can be easily adapted to consider
GP-SIM models for the marginals too.
The contribution of the conditional copula model to the joint likelihood breaks the
tractability of the posterior conditional densities and complicates the design of an efficient
MCMC algorithm that can sample efficiently from the posterior distribution. The conditional
joint posterior distribution of the latent variables (f) and parameters (w) given the observed
data D does not have a tractable form and its study will require the use of Markov Chain
Monte Carlo (MCMC) sampling methods. Specifically, we use Random Walk Metropolis
(RWM) within Gibbs sampling for w [19, 78, 7] while for f we will use the elliptical slice
sampling [65] that has been designed specifically for GP-based models and does not require
tuning of free parameters.
2.2 Computational Algorithm
Inference is based on the posterior distribution π(ω|D, x1, x2, z) where
ω = (f1, f2, f ,w1,w2,w, σ21, σ
22, β) ∈ Rk represents the vector of parameters and latent vari-
18
ables in the model, with k = 3m+3q+7. Since the posterior is not mathematically tractable,
its properties will be explored via Markov chain Monte Carlo (MCMC) sampling. In this
section we provide the detailed steps of the MCMC sampler designed to sample from π.
The general form of the algorithm falls within the class of Metropolis-within-Gibbs (MwG)
samplers in which we update in turn each component of the chain by sampling from its
conditional distribution, given all the other components. The presence of the copula in the
likelihood breaks the usual conditional conjugacy of the GP models so none of the compo-
nents have conditional distributions that can be sampled directly.
Suppose we are interested in sampling a target π(ω). A generic MwG sampler proceeds
as follows:
Step I Initialize the chain at ω(1)1 , ω
(1)2 , . . . , ω
(1)k .
Step R At iteration t+ 1 run iteratively the following steps for each j = 1, . . . , k:
1. Sample ω∗j ∼ qj(·|ω(t)j , ω
(t+1;t)−j ) where ω
(t+1;t)−j = (ω
(t+1)1 , . . . , ω
(t+1)j−1 , ω
(t)j+1, . . . , ω
(t)k ) is
the most recent state of the chain with the first j−1 components updated already
(hence the supraindex t+1), the jth component removed and the remaining n−jcomponents having the values determined at iteration t (hence the supraindex t).
2. Compute r = min
{1,
π(ω(t+1)1 ,...,ω
(t+1)j−1 ,ω∗j ,ω
(t)j+1,...,ω
(t)k )qj(ω
(t)j |ω
∗j ,ω
(t+1;t)−j )
π(ω(t+1)1 ,...,ω
(t+1)j−1 ,ωtj ,ω
(t)j+1,...,ω
(t)k )qj(ω
(∗)j |ω
(t)j ,ω
(t+1;t)−j )
}.
3. With probability r accept proposal and set ω(t+1)j = ω∗j and with 1 − r reject
proposal and let ω(t+1)j = ω
(t)j .
The proposal density qj(·|·) corresponds to the transition kernel used for the jth compo-
nent. Our algorithm uses a number of proposals corresponding to Random Walk Metropolis-
within-Gibbs (RWMwG), Independent Metropolis-within-Gibbs (IMwG) and Elliptical
Slice Sampling within Gibbs (SSwG) moves.
At the t+ 1 step we use the following proposals to update the chain:
wj: Use a RWM transition kernel: w∗ ∼ N (w(t)j , cwjIq+1). The constant cwj is chosen so
that the acceptance rate is about 30%, j = 1, 2.
w: Use the RWM: w∗ ∼ N (w(t), cwI2). The constant cw is chosen so that the acceptance
rate is about 30%.
19
σ2j : Without the copula, the conditional posterior distribution of σ2
j would be IG(0.1 +
n/2, 0.1 + (yj −Aj f (t)j )T (yj −Aj f (t)
j )), where Aj = A(x, xj; w(t+1)j ) for all j = 1, 2. We
will use this distribution to build and independent Metropolis (IM) type of transition
for σ2j , j = 1, 2. The acceptance rate is usually in the range of [0.25, 0.60] and the
chain mixes better than it would under a RWM.
β: Since β is normalized we will use RWM on unit sphere using ‘Von-Mises-Fisher’ dis-
tribution (henceforth denoted VMF). The VMF distribution has two parameters, µ
(normalized to have norm one) which represents the mean direction and κ, the concen-
tration parameter. A larger κ implies that the distribution will be more concentrated
around µ. The density is symmetric in µ and the argument and is proportional to
fVMF (x;µ, κ) ∝ exp(κxTµ).
The proposals are generated using β∗ ∼ VMF(β(t), κ), where κ is chosen so that the
acceptance rate is around 30%.
f ’s: For fj, j = 1, 2 and f we use the elliptical slice sampling proposed by [65] which does not
require the tuning of simulation parameters. Although not needed in our examples, we
note that if the chain’s mixing is sluggish, one can improve it using the parallelization
strategy proposed by [68].
In our experience the efficiency of the algorithm benefits from initial values that are not too
far from the posterior mode. Therefore we propose first to roughly estimate the parameters
in the two independent regressions for y1 and y2 to get (f1,w1, σ21)(1) and (f2,w2, σ
22)(1). Then
run another MCMC fixing marginals and only sampling (f ,w). This procedure estimates
(f ,w)(1). These 3 short chains (100-200 iterations each) provide good initial values for the
joint MCMC sampler. This simple approach shortens the time it would take for the original
chain to find the regions of high mass under the posterior. We have also found that the
chain’s mixing is accelerated when initial values for w are small, thus allowing for more
variation in the calibration function.
Remark: In our numerical experiments, we will fit the GP-SIM model to data with constant
calibration, i.e., with true values βi = 0 for all 1 ≤ i ≤ q. The constraint ||β|| = 1
forbids sampling null values for all the components of β simultaneously, and instead the
MCMC draws for β’s components are spread randomly in the support. However, the shape
of the calibration function is correctly recovered since the sampled values for the second
component of w were large reflecting the perfect dependence between f(xTi β) and f(xTj β)
20
for any 1 ≤ i 6= j ≤ n. This led to difficulties in identifying the SA as discussed below, and
compelled us to develop a new SA identification procedure that is described in Section 4.2.
2.3 Performance of the algorithms
2.3.1 Simulations
The purpose of the simulation study is to assess empirically: 1) the performance of the
estimation method under the correct and misspecified models, as well as 2) the ability of
the model selection criteria to identify the correct copula structure, i.e. the copula family
and the parametric form of the calibration function. For the former aim we compute the
integrated mean square for various quantities of interest, including the Kendall’s τ . In order
to facilitate the assessment of the estimation performance across different copula families,
we estimate the calibration function on the Kendall’s τ scale. The latter is given by
τ(X) = 4
(∫∫C(u1, u2|X)c(u1, u2|X)du1du2
)− 1.
We will compare 3 copulas: Clayton, Frank and Gaussian under the general GP-SIM model
and the Clayton with constant calibration function. To fit the model with constant copula,
we still use MCMC but instead of f , f ,w and β in calibration we have a constant scalar copula
parameter, θ. The RWMwG transition is used to sample θ, as the proposal distributions
for marginals’ parameters and latent variables remain the same. Table 2.1 shows density
copula functions (as function of its parameter θ) for each copula family. Table 2.2 provides
inverse-link functions g−1 used for calibration, the functional relationship between Kendall’s
τ and copula parameters and parameter ranges for every copula family used in this thesis.
Table 2.1: Copula density functions for each copula family.
Copula c(u1, u2|θ)Clayton (1+θ)(u1u2)−1−θ
A1/θ+2 ; where A = u−θ1 + u−θ2 − 1
Frank θ(1− e−θ)e−θ(u1+u2)[(1− e−θ)− (1− e−θu1)(1− e−θu2)
]−2
Gaussian 1√1−θ2 exp
(− θ2(y21+y22)−2θy1y2
2(1−θ2)
); where yj = Φ−1(uj)
T(v df) 12π√
1−θ21
dv(y1)dv(y2)
(1 +
y21+y22−2θy1y2v(1−θ2)
)−v/2−1
; yj = t−1v (uj) and dv(y)=univ. density of T(v df)
Table 2.2: Parameter’s range, Inverse-link functions and the functional relationship betweenKendall’s τ and the copula parameter.
Copula Range of parameter (θ) Inv-Link function Kendall’s τ formula
Clayton (−1,∞) \ {0} θ = exp(f)− 1 τ = θθ+2
Frank (−∞,∞) \ {0} θ = f No closed form
Gaussian, T (−1, 1) θ = exp(f)−1exp(f)+1
τ = 2π
arcsin θ
Gumbel (1,∞) θ = exp(f) + 1 τ = 1− 1θ
In addition to Kendall’s τ we use also the conditional mean of Y1 given y2 and x for
assessing the estimation. Such conditional means can be useful in prediction when one of
the responses is more expensive to measure than the other. The calculation is mathematically
straightforward
E(Y1|Y2 = y2, x) = f1(x) + σ1
∫ 1
0
Φ−1(z)cθ(x)
(z,Φ
(y2 − f2(x)
σ2
))dz. (2.5)
The integral in (2.5) is usually not tractable, but can be easily estimated via numerical
integration since it is one-dimensional and defined on the closed interval [0, 1].
2.3.2 Simulation Details
We generate samples of size n = 400 from each of the next 6 scenarios using the Clayton
copula. The covariates are generated independently from Uniform(0, 1) distribution. The
covariate dimension q in Scenario 3 is 10, in all other scenarios it is 2.
Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(15xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
τ(x) = 0.3 sin(5xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
Sc3 β = (1, 10,−3, 6, 1,−6, 3, 7,−1,−5)T/√
267, σ1 = σ2 = 0.2
f1(x) = cos(xTβ)
22
f2(x) = sin(xTβ)
τ(x) = 0.7 + 0.20 sin(5xTβ)
Sc4 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
τ(x) = 0.5
σ1 = σ2 = 0.2
Sc5 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
η(x) = 1 + 0.7 sin(3x31)− 0.5 cos(6x2
2)
σ1 = σ2 = 0.2
Sc6 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2)
f2(x) = 0.6 sin(3x1 + 5x2)
η(x) = 1 + 0.7x1 − 0.5x22
σ1 = σ2 = 0.2
Sc1 and Sc2 have calibration functions for which the SIM model is true for Kendall’s τ
and, consequently, also for the copula parameter. Sc1 corresponds to large dependence
(τ greater than 0.5) while Sc2 has small dependence (τ is between −0.3 and 0.3). Sc3
also has SIM form for calibration function but the covariate dimension is q = 10, so this
scenario is important to evaluate how well the algorithms scale up with dimension. Sc4
corresponds to the covariate-free dependence (τ = 0.5) and allows us to verify the power
to detect simple parametric forms for the calibration. Scenarios Sc5 and Sc6 do not have
SIM form, but have additive calibration function [as in 80]. They are used to evaluate the
effect of model misspecification on the inference. Note that Sc6 has almost SIM calibration
when x2 ∈ [0, 1]. From our experiments we found that when the number of inducing points
is m = 30 for marginals and calibration sparse GPs, we obtain a reasonable CPU time
that allows us to perform the desired number of replications and we can also capture the
general form of the estimated functions. On average one MCMC iteration (n = 400) with
GP-SIM calibration takes 0.02 seconds, one iteration with constant calibration (and GP for
marginals) takes 0.015 seconds. The MCMC samplers were run for 20, 000 iterations for all
scenarios.
The first half of the MCMC sample is discarded as burn-in and the second half is used
for inference. As noted earlier, starting values were found by running two GP regressions
23
separately to estimate marginal parameters and one MCMC sampler was run in order to
estimate calibration parameters. All three samplers were run for only 100 iterations.
2.3.3 Proof of concept based on one Replicate
In the absence of computable convergence bounds, we used the Gelman-Rubin [34] diag-
nostic statistics to decide the length of the chain’s run. To illustrate using Sc1, we ran
10 independent MCMC chains, each for 20, 000 iterations, that were started from different
initial values. The trace plots for the potential scale reduction factor (PSRF), computed up
to 10, 000 iterations for β, σ21 and σ2
2 are displayed in Figure 2.1. The plots show that the
Figure 2.1: Sc1: Clayton copula, Gelman-Rubin MCMC diagnostic for beta and twovariances.
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
Beta 1
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Beta 2
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Sigma Squared 1
last iteration in chain
shrin
k fa
ctor
median
97.5%
0 2000 4000 6000 8000 10000
1.0
1.5
2.0
2.5
3.0
Sigma Squared 2
last iteration in chain
shrin
k fa
ctor
median
97.5%
multivariate PSRF after 10, 000 iterations is 1.1. The subsequent 10, 000 samples were used
for inference.
Parameter Estimation
The simulation results show that Sc1 and Sc2 performed similarly. Since the calibration
function in Sc1 is more complicated, for the sake of reducing the chapter’s length we present
only results for that scenario. The trace-plots, autocorrelation functions and histograms of
posterior samples of β, σ21 and σ2
2 are shown in Figure 2.2 when the fitted copula belongs
to the correct Clayton family (the horizontal solid red line is the true value). Next we
24
Figure 2.2: Sc1: Trace-plots, ACFs and histograms of parameters based on MCMC samplesgenerated under the true Clayton family.
0 2000 4000 6000 8000 10000
0.25
0.35
Traceplot of Beta1
Iteration
Bet
a1
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Beta1
Histogram of Beta1
Beta1
Fre
quen
cy
0.25 0.30 0.35 0.40 0.45
040
080
014
00
0 2000 4000 6000 8000 10000
0.90
0.94
Traceplot of Beta2
Iteration
Bet
a2
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Beta2
Histogram of Beta2
Beta2
Fre
quen
cy
0.90 0.92 0.94 0.96
050
015
00
0 2000 4000 6000 8000 10000
0.03
50.
045
Traceplot of Sigma Squared 1
Iteration
Sig
ma
Squ
ared
1
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Sigma Squared 1
Histogram of Sigma Squared 1
Sigma Squared 1
Fre
quen
cy0.035 0.040 0.045
050
015
00
0 2000 4000 6000 8000 10000
0.03
50.
045
Traceplot of Sigma Squared 2
Iteration
Sig
ma
Squ
ared
2
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF of Sigma Squared 2
Histogram of Sigma Squared 2
Sigma Squared 2
Fre
quen
cy
0.035 0.040 0.045
050
015
00
show predictions for the marginals means with 95% credible intervals. Since these are 2-
dimensional we estimate ‘slices’ of this surface at values 0.2 and 0.8, so that we first fix
x1 = 0.2 then x1 = 0.8 and similarly for x2. The results are in Figure 2.3 (black is true,
green is estimation, red are credible intervals).
One of the inferential goals is the prediction of calibration function or, equivalently,
Kendall’s τ function. In this case we are dealing with only two covariates so their joint
effect can be visualized via the Kendall’s surface. In Figure 2.4 we show the true calibration
surface on the left panel and the fitted one on the right. The accuracy is remarkable and we
are hard put to see major differences between the two panels. Since the visual comparison
of the three-dimensional true and fitted surfaces may be misleading, so as with conditional
marginal means we estimate one dimensional slices at values 0.2 and 0.8 and the results,
shown in Figure 2.5, confirm the accuracy of the fit.
The predictive power of the model was assessed by fixing 4 covariate points and estimating
the corresponding Kendall’s τ values: τ(0.2, 0.2), τ(0.2, 0.8), τ(0.8, 0.2), τ(0.8, 0.8). At each
MCMC iteration these predictions are calculated and histograms (Figure 2.6) are constructed
(red lines are true value of τ). The same estimates are presented in Figure 2.7 when the
Gaussian copula is used for inference. One can notice that the estimates are biased in this
instance, thus emphasizing the importance of identifying the right copula family. Similar
25
Figure 2.3: Sc1: Estimation of marginal means. The leftmost 2 columns show the accuracyfor predicting E(Y1) and the rightmost 2 columns show the results for predicting E(Y2). Theblack and green lines represent the true and estimated relationships, respectively. The redlines are the limits of the pointwise 95% credible intervals obtained under the true Claytonfamily.
0.0 0.2 0.4 0.6 0.8 1.0
−0
.40
.00
.4
E(Y1),X1=0.2
X2
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.4−
1.0
−0
.6
E(Y1),X1=0.8
X2
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.0−
0.5
0.0
E(Y1),X2=0.2
X1
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−1
.5−
1.0
−0
.5
E(Y1),X2=0.8
X1
E(Y
1)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X1=0.2
X2
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X1=0.8
X2
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.6−
0.2
0.2
0.6
E(Y2),X2=0.2
X1
E(Y
2)
0.0 0.2 0.4 0.6 0.8 1.0
−0
.8−
0.4
0.0
0.4
E(Y2),X2=0.8
X1
E(Y
2)
Figure 2.4: Sc1: Estimation of Kendall’s τ dependence surface. The true surface (leftpanel) is very similar to the estimated one (right panel).
X_1
X_2
Kendall's tau
0.60
0.65
0.70
0.75
0.80
X_1
X_2
Kendall's tau
0.55
0.60
0.65
0.70
0.75
0.80
patterns have been observed when using the Frank copula.
We also show how well the algorithm estimates calibration function when covariate dimension
is large. Figure 2.8 shows one dimensional slices of Kendall’s τ function for Sc3 which is
estimated by Clayton GP-SIM model. Each plot is produced by varying one coordinate from
26
Figure 2.5: Sc1: Estimation of Kendall’s τ one-dimensional projections when x1 =0.2 or 0.8 (top panels) and when x2 = 0.2 or 0.8 (bottom panels). The black and greenlines represent the true and estimated relationships, respectively. The red lines are thelimits of the pointwise 95% credible intervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X1=0.2
X2
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X1=0.8
X2
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X2=0.2
X1
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Tau,X2=0.8
X1
Ken
dalls
Tau
Figure 2.6: Sc1: Histogram of predicted Kendall’s τ values obtained under the true Claytoncopula.
Tau at X1=0.2, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.2, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
060
010
00
Tau at X1=0.8, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
050
010
00
27
Figure 2.7: Sc1: Histogram of predicted τs with Gaussian copula model.
Tau at X1=0.2, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
010
030
050
0
Tau at X1=0.2, X2=0.8
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.2
Kendalls Tau
Fre
quen
cy
0.2 0.4 0.6 0.8
020
040
060
080
0
Tau at X1=0.8, X2=0.8
Kendalls TauF
requ
ency
0.2 0.4 0.6 0.8
050
010
0015
0020
00
0 to 1 while fixing all other coordinates at x = 0.5. We observe that even in this case the
estimated curves are very close to true Kendall’s τ function.
2.3.4 Multiple Replicates
So far, the results reported were based on a single implementation of the method. In order
to facilitate interpretation, we perform 50 independent replications under each of the six
scenarios described previously.
The MCMC sampler was run for 20, 000 iterations for all scenarios. As before, the first
half of iterations was ignored as a burn-in period. For each data set, 4 estimations were
done with Clayton, Frank, Gaussian and constant Clayton copulas. For Sc5 and Sc6 we
also fitted the Clayton copula with an additive model for the calibration function, as in
[80]. The marginal distributions models have the general GP form throughout the section.
In order to produce overall measures of fit, we report the integrated squared Bias (IBias2),
Variance (IVar) and mean squared error (IMSE) of Kendall’s τ evaluated at covariates x =
(x1, . . . , xn)T . The calculation requires finding points estimates for τr(xi) for 1 ≤ r ≤ R
independently replicated analyses and each i = 1, . . . , n. The formulas for IBias2, IVar and
28
Figure 2.8: Sc3: Estimation of Kendall’s τ one-dimensional projections for each coordinatefixing all other coordinates at 0.5 levels. The black and green lines represent the true andestimated relationships, respectively. The red lines are the limits of the pointwise 95%credible in intervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X1
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X2
Ken
dall'
s Ta
u0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X3
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X4
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X5
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X6
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X7
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X8
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X9
Ken
dall'
s Ta
u
0.0 0.2 0.4 0.6 0.8 1.0
0.5
0.7
0.9
X10
Ken
dall'
s Ta
u
IMSE are given by:
IBias2 =1
n
n∑i=1
R∑r=1
τr(xi)
R− τ(xi)
2
,
IVar =1
n
n∑i=1
V arr(τr(xi)),
IMSE = IBias2 + IVar.
(2.6)
We will apply these concepts not only for Kendall’s τ but also for E(Y1|Y2 = y2, X = x) for
different combinations (x, y2).
Estimation
IBias2, IVar and IMSE for each scenario and each model are shown in Table 2.3 (bold values
show smallest IMSE for each scenario). Note that the smallest IMSE is produced when
fitting the correct model and copula family. The Clayton model with GP-SIM calibration
Table 3.2: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.
accurate than correctly determining that the calibration function is constant. The latter
difficulty has been reported elsewhere [e.g., 20]. In part, this is due to the fact that the models
are flexible enough to capture the constant calibration and produce estimates that mislead a
cross-validation-based method. In section 4.2 we return to this problem and develop a new
permutation-based procedure that exhibits a drastically improved performance in numerical
experiments. Since Sc5 and Sc6 were simulated with Clayton additive calibration, we
show how often Clayton Additive model is selected over Clayton GP-SIM using different
34
Table 3.3: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a constant calibration with three models: Clayton, Frankand Gaussian, all of them assuming a GP-SIM calibration.
criteria (Table 3.4). The poor performance for Sc6 is not that surprising since the additive
calibration in this scenario has almost SIM form.
Table 3.4: The percentage of correct decisions for each selection criterion when comparingthe correct additive model with GP-SIM with non-constant calibration.
Clayton GP-SIMScenario CVML CCVML WAIC
Sc5 92% 94% 90%Sc6 30% 34% 28%
3.3 Additional Simulation Results Based On Multiple
Replicates
In addition to simulations shown in Sections 2.3.2 and 3.2.2 we also simulated and analyzed
50 independent replicates from each of the following scenarios:
Sc1b f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(15xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2
n = 1000
Sc7 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = 0.7 + 0.15 sin(2xTβ)
β = (2,−3)T , σ1 = σ2 = 0.2
n = 400
35
Sc1b is exactly the same as Sc1 with only difference that sample size is 1000 instead of 400.
In Sc7 we do not assume that generating β is normalized.
Tables 3.5 and 3.6 show IBias2, IVar and IMSE for each scenario (including Sc1 for
comparison) and each model for estimation of Kendall’s τ and E(Y1|y2, x) respectively.
First we note that Clayton with GP-SIM calibration produces smallest IMSE in all scenarios.
Another important observation is that IMSEs for Kendall’s tau and conditional response
prediction are smaller for Sc1b than for Sc1. Which indicates that as the sample size
increases the model produces more accurate predictions. Results for Sc7 are similar to Sc1,
so even when the true generating β in SIM is not normalized we still obtain acceptable
predictions for each test value x. Actually the posterior β just converges to a normalized
vector (2,−3)T/√
13.
Table 3.7 shows how often Clayton model (with non-constant calibration) is selected over
other models using CVML, CCVML and WAIC. Again we notice that the above criteria
perform well in distinguishing between copula families. Also in Sc1b all criteria select true
non-constant Clayton with higher rate than in Sc1 which is probably due to larger sample
size.
36
Table 3.7: The percentage of correct decisions for each selection criterion when comparingthe correct Clayton model with a non-constant calibration with all the other models: Frankmodel with non-constant calibration, Gaussian model with non-constant calibration, Claytonmodel with constant calibration.
4.1 Interesting Connection between Model Misspecifi-
cation and the Simplifying Assumption
Understanding whether the data support the SA or not is usually important for the subject
matter analysis, since a dependence structure that does not depend on the covariates can be
of scientific interest. The SA has also a serious impact on the statistical analysis, because
it has the potential to simplify greatly the estimation of the copula. There is however, an
interesting connection between model misspecification and SA which, as far as we know, has
not been reported elsewhere.
To illustrate the point, consider a random sampling design setting with two independent
random variables, X1, X2 serving as covariates in the Clayton copula model in which SA is
satisfied, the sample size n = 1500 and
f1(x) = 0.6 sin(5x1 + x2),
f2(x) = 0.6 sin(x1 + 5x2),
τ(x) = 0.5,
σ1 = σ2 = 0.2.
When we fit a GP-SIM model with the correct Clayton copula family, but with the X2
covariate omitted from both marginal and copula models, the estimated Kendall’s τ(x1)
exhibits a clear non-constant shape, as seen in Figure 4.1. The CVML, CCVML and WAIC
criteria, whose values are shown in Table 4.1, unanimously vote for a nonconstant calibration
function.
38
Figure 4.1: Estimation of Kendall’s τ as a function of x1 when only first covariate is usedin estimation. The dotted black and solid green lines represent the true and estimatedrelationships, respectively. The red lines are the limits of the pointwise 95% credible inintervals obtained under the true Clayton family.
0.0 0.2 0.4 0.6 0.8 1.0
−1.
0−
0.5
0.0
0.5
1.0
x_1
Ken
dalls
Tau
Table 4.1: Missed covariate: CVML, CCVML and WAIC criteria values for model withconditional copula depends on one covariate and when it is constant.
Variables CVML CCVML WAICX1 -508 -174 1017
Constant -570 -232 1140
While one may expect a non-constant pattern when the two covariates are dependent,
this residual effect of X1 on the copula may be surprising when X1 and X2 are inde-
pendent. We can gain some understanding by considering a simplified example in which
Yi|X1, X2 ∼ N (fi(X1, X2), 1) for i = 1, 2, and Cov(Y1, Y2|X1, X2) = Corr(Y1, Y2|X1, X2) = ρ,
hence constant in X1 and X2. Hence, for marginal models that include only X1, yielding
residuals Wi = Yi − E[Yi|X1] for i = 1, 2, we are interested in explaining the non-constant
dependence between Cov(W1,W2|X1) and X1. Standard statistical properties of covariance
where the covariance in (4.2) is with respect to the marginal distribution of X2. Hence it is
apparent that the conditional covariance Cov(W1,W2|X1) will generally not be constant in
X1. Note that if the true means have additive form, i.e. fi(X1, X2) = fi(X1) + fi(X2), for
i = 1, 2, then the covariances in (4.1) are indeed constant in X1, but the estimated value
of Cov(Y1, Y2|X1) will be biased. Although here we focused on the covariance as a measure
of dependence, the argument is extendable to copula parameters or Kendall’s tau, but the
calculations are more involved.
In conclusion, violation of the SA may be due to the omission of important covariates
from the model. This phenomenon along with the knowledge that in general it is difficult
to measure all the variables with potential effect on the dependence pattern, suggests that
a non-constant copula is a prudent choice.
4.2 A Permutation-based CVML to Detect Data Sup-
port for the Simplified Assumption
In this section we try to modify the CVML and the conditional CCVML method to identify
data support for SA after the copula family is selected.
As was shown in Section 3.2.2 CVML, CCVML and WAIC criteria yield good results
in identifying correct copula family but do not perform well in recognizing that the true
calibration is constant. In other word they have large probability of Type I error. This is in
line with [20] who also noted that the traditional Bayesian model selection criteria, e.g. the
Deviance information criterion (DIC) of [88], tend to prefer the more complex calibration
model over a simple model with constant calibration even when the latter is actually correct.
In addition of the simulations presented in the previous section, we add here that when the
marginal distributions are estimated, the performance of the existing criteria worsens. To
illustrate, we have simulated 50 replicates of sample sizes 1500 using Clayton copula from
Sc1, Sc4 and Sc5. Each sample is fitted with the general model introduced here and a con-
stant Clayton copula, while marginals are estimated using a general GP. Table 4.2 shows the
proportion of correct decisions for the three scenarios and various selection criteria. These
40
results show that even for a large sample size, the proportion of right decisions for Sc4, i.e.
when SA holds, is quite low. One of the explanations is that the general model does a good
job at capturing the constant trend of the calibration function and yields predictions that
are not too far from the ones produced with the simpler (and correct) model. The modified
CVML we propose is inspired by two desiderata: i) to separate the set of observations used
for prediction from the set of observations used for fitting the model, and ii) to amplify the
impact of the copula-induced errors in the CCVML calculation. The former will reduce the
implicit bias one gets when the same data is used for estimation and testing, while the latter
is expected to increase the power to identify SA.
For i) we randomly partition the data into a training set D = {y1i, y2i, xi}i=1,...,n1 and a
test set D∗ = {y∗1i, y∗2i, x∗i }i=1,...,n2 . In our numerical experiments we have kept two thirds of
observations in the training set. In order to achieve ii) we note that permuting the response
indexes will not affect the copula term if SA is indeed satisfied and will perturb the pre-
diction when SA is not satisfied. However, one must cautiously implement this idea, since
the permutation λ : {1, . . . , n2} → {1, . . . , n2} will affect the marginal model fit, regardless
of the SA status, as yjλ(i) will be paired with xi, for all j = 1, 2. Below we describe the
permutation-based CVML criterion that combines i) and ii).
Assume that the fitted GP-SIM model yields posterior samples from the conditional distri-
bution of latent variables and parameters ω(t) ∼ π(ω|D), t = 1 . . .M . Then we define the
observed data criterion as the predictive log probability of the test cases which can be easily
estimated from posterior samples, as follows:
CVMLobs =
n2∑i=1
logP (y∗1i, y∗2i|D, x∗i ) ≈
n2∑i=1
log
{1
M
M∑t=1
P (y∗1i, y∗2i|w(t), x∗i )
}=
=
n2∑i=1
log
{1
M
M∑t=1
1
σ(t)1
φ
(y∗1i − f
∗(t)1i
σ(t)1
)1
σ(t)2
φ
(y∗2i − f
∗(t)2i
σ(t)2
)×
×cθ∗(t)i
[Φ
(y∗1i − f
∗(t)1i
σ(t)1
),Φ
(y∗2i − f
∗(t)2i
σ(t)2
)]},
where f∗(t)1i , f
∗(t)2i , θ
∗(t)i are the predicted values for the test cases produced by the GP-SIM
model. Consider J permutations of {1 . . . n2} which we denote as λ1, . . . , λJ , and compute
41
J permuted CVMLs as:
CVMLj =
n2∑i=1
log
{1
M
M∑t=1
1
σ(t)1
φ
(y∗1i − f
∗(t)1i
σ(t)1
)1
σ(t)2
φ
(y∗2i − f
∗(t)2i
σ(t)2
)×
× cθ∗(t)λj(i)
[Φ
(y∗1i − f
∗(t)1i
σ(t)1
),Φ
(y∗2i − f
∗(t)2i
σ(t)2
)]}. (4.3)
Note that CVMLobs differs from CVMLj only in the values of the copula parameters. While
for the former we use θ(x∗i ), in the latter we use θ(x∗λj(i)) for the dependence between y∗1i and
y∗2i. If calibration is constant then CVMLobs and CVMLj should be similar, hence we define
the evidence
EV = 2×min
J∑j=1
1{CVMLobs<CVMLj}
J,
J∑j=1
1{CVMLobs>CVMLj}
J
. (4.4)
Under the null model with constant calibration with known marginals and if we assume
that CVMLobs and {CVMLj : 1 ≤ j ≤ J} are iid for each j, then each term inside the min
function in (4.4) has a Uniform(0, 1) limiting distribution when J → ∞. In that case it
follows that P (EV < 0.05) = 0.05. In practice, the ideal situation just described is merely
an approximation since the {CVMLj : 1 ≤ j ≤ J} are not independent and we compute EV
using a fixed number of permutations. Nevertheless, the ideal setup can be used to build
our decision that when EV > 0.05 the data support SA, and otherwise they do not.
A similar rule can be build using the CCVML criterion. For instance, its value for test
data is
CCVMLobs =1
2
n2∑i=1
logP (y∗1i|D, x∗i , y∗2i) +1
2
n2∑i=1
logP (y∗2i|D, x∗i , y∗1i). (4.5)
The permutation-based version of (4.5) can be obtained using the same principle as in (4.3)
thus leading to the counterpart of (4.4) for CCVML.
Table 4.3 shows the proportion of correct decisions using proposed methods with 1000
and 500 samples in training and test set respectively, and J = 500 permutations. The results,
especially those for Sc4, clearly show an important improvement in the rate of making the
correct selection, with only a slight decrease in the power to detect non-constant calibrations.
We can also notice that CVML and CCVML performed similarly.
42
Table 4.2: The percentage of correct deci-sions for each selection criterion and scenar-ios. GP-SIM and SA were fitted with Clay-ton copula, sample size is 1500.
Table 4.3: The percentage of correct de-cisions for each selection criterion and sce-nario. Predicted CVML and CCVML valuesbased on n1 = 1000 training and n2 = 500test data, respectively. The calculation ofEV is based on a random sample of 500 per-mutations.
where the overline a signifies the averages of Monte Carlo draws at.
Given the vector of calibration function evaluations at the test points, η = (η1, . . . , ηn2),
and a partition min(η) = a1 < . . . < aK+1 = max(η) of the range of η into K disjoint
43
intervals, define the set of observations in D2 that yield calibration function values between
ak and ak+1, Bk = {i : ak ≤ ηi < ak+1} k = 1, . . . , K. We choose the partition such that each
”bin” Bk has approximately the same number of elements, n2/K. Under SA, the bin-specific
A1 Compute the kth bin-specific Kendall’s tau τk from {Ui : i ∈ Bk}) k = 1, . . . , K.
A2 Compute the observed statistic T obs = SDk(τk) (where SDk(ak) is a standard deviationof ak over index k). Note that if SA holds, we expect the observed statistic to be closeto zero.
A3.1 Compute τjk = τ({Ui : λj(i) ∈ Bk}) k = 1, . . . , K.
A3.2 Compute test statistic Tj = SDk(τjk). Note if SA holds, then we expect Tj to beclose to T obs.
A4 We consider that there is support in favour of SA at significance level α if T obs is smallerthan the (1− α)-th empirical quantile calculated from the sample {Tj : 1 ≤ j ≤ J}.
Table 4.4: Method 1: A permutation-based procedure for assessing data support in favour ofSA
estimates for various measures of dependence, e.g. Kendall’s τ or Spearman’s ρ, computed
from the samples Ui, are invariant to permutations, or swaps across bins. Based on this
observation, we consider the procedure described in Table 1 for identifying data support
for SA. The distribution of the resulting test statistics obtained in Method 1 is determined
empirically, via permutations. Alternatively, one can rely on the asymptotic properties of the
bin-specific dependence parameter estimates and construct a Chi-square test. Specifically,
suppose the bin-specific Pearson correlations ρk are computed from samples {Ui : i ∈ Bk}),for all k = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K be the number of points in each
bin. It is known that ρk is asymptotically normal distributed for each k so that
√n(ρk − ρk)
d→ N (0, (1− ρ2k)
2),
where ρk is the true correlation in bin k. If we assume that {ρk: k = 1, . . . , K} are
independent, and set ρ = (ρ1, . . . , ρK)T and Σ = diag((1 − ρ21)2, . . . , (1 − ρ2
K)2), then we
have: √n(ρ− ρ)
d→ N (0,Σ).
44
B1 Compute the bin-specific Pearson correlation ρk from samples {Ui : i ∈ Bk}), for allk = 1, . . . , K. Let ρ = (ρ1, . . . , ρK)T , and n = n2/K, the number of points in each bin.
K)2) and A ∈ R(K−1)×K as in(4.6) then under SA we have that ρ1 = . . . = ρK and
n(Aρ)T (AΣAt)−1(Aρ)d→ χ2
K−1.
Compute T obs = n(Aρ)T (AΣAt)−1(Aρ).
B3 Compute p-value = P (χ2K−1 > T obs) and reject SA if p-value< α.
Table 4.5: Method 2: A Chi-square test for assessing data support in favour of SA
In order to combine evidence across bins, we define the matrix A ∈ R(K−1)×K as
A =
1 −1 0 · · · 0
0 1 −1 · · · 0...
......
......
0 0 · · · 1 −1
. (4.6)
Since under the null hypothesis SA holds, one gets ρ1 = . . . = ρK , implying
n(Aρ)T (AΣAt)−1(Aρ)d→ χ2
K−1.
Method 2, with its steps detailed in Table 4.5, relies on the ideas above to test SA.
Method 1 evaluates the p-value using a randomization procedure [56], while the second
is based on the asymptotic normal theory of Pearson correlations. To get reliable results
it is essential to assign test observations to ”correct” bins which is true when calibration
predictions are as close as possible to the true unknown values, i.e. η(xi) ≈ η(xi). The latter
heavily depends on the estimation procedure and sample size of the training set. Therefore
it is advisable to apply very flexible models for the calibration function estimation and have
enough data points in the training set. We immediately see a tradeoff as more observations
are assigned to D1 the better will be the calibration test predictions, at the expense of
decreasing power due to a smaller sample size in D2. For our simulations we have used
n1 ≈ 0.5n and n2 ≈ 0.5n, and K ∈ {2, 3}.To get the intuition behind the proposed methods consider an idealized example where
45
marginals are uniform, true calibration is known and we have access to ”infinite” data,
moreover we focus on situation with only 2 bins. Note that if SA is true then the correlation
in each bin is the same and any permutation should yield the same or very similar correlation.
On the hand if SA is not satisfied, and assuming for simplicity that the calibration only takes
2 values, then since observations are assigned to bins by their calibration values, bin 1 and
bin 2 will contain pairs following distributions π1(u1, u2) and π2(u1, u2) with corresponding
correlations ρ1 < ρ2 respectively. Note that after a random permutation, pairs in each bin will
follow a mixture distribution λπ1(u1, u2)+(1−λ)π2(u1, u2) and (1−λ)π1(u1, u2)+λπ2(u1, u2)
in bin 1 and 2 respectively with λ ∈ (0, 1) . It is obvious, that correlations of permuted data
in bins 1 and 2 are λρ1 + (1 − λ)ρ2 and (1 − λ)ρ1 + λρ2. Observe that each correlation is
between ρ1 and ρ2 which imply that the absolute difference between these two correlations
after any permutation must be less than ρ2 − ρ1 which is our observed test statistic. Of
course in a real finite data, the observed statistic is not such an obvious outlier but should
be somewhere in a tail of the distribution of statistics. In other words this example illustrates
that if SA is not satisfied then our proposed methods should reject this hypothesis (at least
for large enough sample size).
4.4 Simulations
In this section we present the performance of the proposed methods and comparisons with
generic CVML and WAIC criteria on simulated data sets. Different functional forms of
calibration function, sample sizes and magnitude of deviation from SA will be explored.
Simulation details
We generate samples of sizes n = 500 and n = 1000 from 3 scenarios described below.
For all scenarios the Clayton copula will be used to model dependence between responses,
while covariates are independently sampled from U [0, 1]. For all scenarios the covariate
dimension q = 2. Marginal conditional distributions Y1|X and Y2|X are modeled as Gaussian
with constant variances σ21, σ
22 and conditional means f1(X), f2(X) respectively. The model
parameters must be estimated jointly with the calibration function η(X). For convenience
we parametrize calibration on Kendall’s tau τ(X) scale.
Sc1 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
46
τ(x) = 0.5,σ1 = σ2 = 0.2.
Sc2 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = δ + γ × sin(10xTβ)
β = (1, 3)T/√
10, σ1 = σ2 = 0.2.
Sc3 f1(x) = 0.6 sin(5x1)− 0.9 sin(2x2),
f2(x) = 0.6 sin(3x1 + 5x2),
τ(x) = δ + γ × 2(x1 + cos(6x2)− 0.45)/3
σ1 = σ2 = 0.2.
Sc1 corresponds to SA since Kendall’s τ is independent of covariate level. The calibration
function in Sc2 has single index form for the calibration function, while in Sc3 it has an
additive structure on τ scale (generally not additive on η scale), these simulations are useful
to evaluate performance under model misspecification. We note that τ in Sc2 and Sc3
depends on parameters δ (average correlation strength) and γ (deviation from SA), which
in this study take values δ ∈ {0.25, 0.75} and γ ∈ {0.1, 0.2}, respectively.
Simulation results
For each sample size and scenario we have repeated the analysis using 250 independently
replicated data sets. For each data, the GP-SIM model suggested in chapter 2 ([57]) is fitted.
Such Bayesian non-parametric models are much more flexible than parametric ones and can
effectively capture various patterns. The inference is based on 5000 MCMC samples for all
scenarios, as the chains were run for 10000 iterations with 5000 samples discarded as burn-in.
The number of inducing inputs was set to 30 for all GP. For generic SA testing, GP-SIM
fitting is done for the whole data sets and posterior draws are used to estimate CVML,
CCVML and WAIC. Since the proposed methods require data splitting, we first randomly
divide the data equally into training and testing sets. We fit GP-SIM on the training set and
then use the obtained posterior draws to construct point estimates of F1(y1i|xi), F2(y2i|xi)and η(xi) for every observation in the test set. In Method 1 we used 500 permutations.
Table 4.6 shows the percentage of SA rejections for generic Bayesian selection criteria. The
presented results clearly illustrate that generic methods have difficulties identifying SA. This
leads to a loss of statistical efficiency since a complex model is selected over a much simpler
one. In the context of CVML or CCVML it can be explained by observing that both
47
Table 4.6: Simulation Results: Generic, proportion of rejection of SA for each scenario,sample size and generic criteria.
defined in Section 4.3, then canonical asymptotic results imply that if ρ1 = . . . = ρK and as
n→∞,
T = n(Aρ)T (AΣAT )−1(Aρ)d→ χ2
K−1. (4.7)
Based on the model fitted on D1, we define estimates of F1(y1i|xi) and F2(y2i|xi) by U =
{Ui = (F1(y1i|xi), F2(y2i|xi))}n2i=1. Note that U depends on D1 and X (covariates in test set).
Given a fixed number of bins K and assuming, without loss of generality, equal sample sizes
in each bin n = n2/K, we define a test statistic T (U) as in (4.7) with ρj estimated from
{U(j−1)n+1, . . . , Ujn}, for 1 ≤ j ≤ K.
Note that in Method 2, test cases are assigned to ”bins” based on the value of predicted
calibration function η(xi) which is not taken into account in the generic definition of test
statistic T (U) above. To close this gap, we introduce a permutation λ∗ : {1, . . . , n2} →{1, . . . , n2} that ”sorts” U from smallest η(x) value to largest i.e. Uλ∗ = {Uλ∗(i)}n2
i=1 with
η(xλ∗(1)) < η(xλ∗(2)) < · · · < η(xλ∗(n2)). Hence, the test statistic in Method 2 has form T (Uλ∗)
49
as in (4.7) but in this case test cases with smallest predicted calibrations are assigned to the
first group, or bin, and with largest calibrations to the Kth group/bin. Finally, define a test
function φ with specified significance level α to test SA:
φ(U |λ∗) =
1 if T (Uλ∗) > χ2K−1(1− α),
0 if T (Uλ∗) ≤ χ2K−1(1− α).
(4.8)
Intuitively, if SA is false then we would expect T (Uλ∗) to be larger then the critical value
χ2K−1(1− α).
The goal is to show that this procedure have probability of type I error equal to α, which is
equivalent to the expectation of the test function:
P(Type I error) =
∫φ(U |λ∗)P (λ∗|D1, X)P (U |D1, X)P (D1)P (X)dUdD1dXdλ
∗. (4.9)
Note that λ∗ does not depend on U because of the data splitting to train and test sets. Also
usually P (λ∗|D1, X) is just a point mass at some particular permutation. In general the
above integral cannot be evaluated, however if we assume that for all test cases:
F1(y1i|xi)p→ F1(y1i|xi) as n→∞ ∀i,
F2(y2i|xi)p→ F2(y2i|xi) as n→∞ ∀i,
(4.10)
then under SA and as n → ∞, P (U |D1, X) = P (U) ≈∏n2
i=1 c(u1i, u2i) where c is a copula
density and the expectation becomes:
P(Type I error) =
∫φ(U |λ∗)P (λ∗|D1, X)P (U)P (D1)P (X)dUdD1dXdλ
∗ =
=
∫ (∫φ(U |λ∗)P (U)dU
)P (λ∗|D1, X)P (D1)P (X)dD1dXdλ
∗ = α.
(4.11)
Since if SA is true,∫φ(U |λ∗)P (U)dU = α for any λ∗. Therefore if marginal CDF predictions
for test cases are consistent then this procedure has the required probability of type I error
for sufficiently large sample size.
50
4.6 Extensions to other models
The proposed idea of dividing the data into train and test subsets, splitting the observations
in the test set to bins and then use a test to check distributional difference between bins
can be extended to other models. For example, one can use a similar construction in mul-
tiple mean regression, logistic or quantile regression problems to check SA. The proposed
approaches for assessing SA can be particularly useful when f(X) (conditional mean or
quantile) is assumed to have a complex form, and flexible models such as generalized addi-
tive models, additive tree structures, non-parametric or based on Bayesian non-parametric
methods are utilized. Simulations we conducted suggest that generic testing procedures yield
large Type I error probabilities for non-parametric fitted models, a problem that is attenu-
ated using the permutation-based ideas described in this chapter. Below we describe how to
modify Methods 1 and 2 to regression problems and conduct a series of simulations to com-
pare performances (probability of Type I error and power) of the proposed algorithms with
standard testing procedures used in literature. Also, note in contrast to Bayesian view that
we have adopted for conditional copula problems we use frequentist paradigm for regression
problems below.
4.6.1 Multiple Regression
Multiple regression with Gaussian errors is used extensively in applied statistics for its sim-
plicity and theoretical properties. Here we assume that errors are identically distributed
(no heteroskedasticity). It is frequently of interest to test whether all the predictors (or
covariates) contribute to the prediction of the response. Therefore we define ”full” model as
a model that contains all the relevant covariates while ”reduced” or SA is a model with only
an intercept. For linear multiple regression we could use the global F test [28] with known
distribution under SA but as will be shown, when more general models for conditional mean
are fit, the generic F test no longer exhibits the correct significance level.
For this set of simulations we assume that yi = f(Xi) + εi for i = 1, · · · , n with εi
independent and identically distributed. We generate samples of sizes n = 500 and n = 1000
from 5 scenarios described below. For all scenarios covariates are independently sampled
Sc1 and Sc4 correspond to SA as conditional means do not depend on the covariates. Sc2
and Sc3 represent nonlinear models with Gaussian errors, the former has additive structure
the latter has interaction between covariates. Note that both depend on parameter γ which
controls the deviation from SA. Sc4, Sc5 include errors from Cauchy distribution which has
much heavier tails than Gaussian. These scenarios are useful to evaluate the performance of
testing algorithms when the assumption of normality is violated.
To apply the proposed testing procedures for this regression problem we do the following:
divide the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit a flexible model on the training set, here we use Generalized Additive model (GAM)
with cubic splines for each component (penalty is also estimated in the same fitting). With
estimated parameters make prediction of function f on the test set f(x) (similar to estimated
calibration function in conditional copula problem), based on f(x) split the test set to K
bins so that number of points in each bin is n = n2/K. Here we can implement either
Chi-square or permutation approach. For Chi-square test (Method 2), let
~y = (y1, · · · , yK)T ,
where yk is the average of responses in bin k ∈ {1, · · · , K}. Then for regression with Gaussian
noise it is trivially derived from the normal theory that under SA:
n(A~y)T (AΣAT )−1A~y ∼ χ2K−1, (4.12)
where A as in (4.6) and Σ = diag(σ2, · · · , σ2) with σ2 is estimated from GAM fit on the
training set. Note that this result is approximately true for non-Gaussian noise (with finite
variance) and sufficiently large n by Central Limit Theorem. This result can be used to assess
evidence against SA. Permutation test (Method 1) is constructed by first finding observed test
statistic T obs = y(K)− y(1) (largest minus smallest average), and then by permuting responses
compute proportion of permutations with permuted test statistics Tj, j = 1, · · · , J greater
than T obs. This proportion is an estimate of the p-value and if it is less than pre-specified
α, SA is rejected. For all scenarios we set significance level α = 0.05.
52
We compare these methods to the following generic approaches to SA testing. Since we
focus on non-linear conditional mean, the standard procedure is to fit flexible non-parametric
model (on the whole data set), find SSEfull (sum of residuals squared) and degrees of freedom
DFfull and compare them to SSEred and DFred = 1 of the model with only an intercept. This
is an instance of partial F statistic that has exact F distribution under SA and if the fitted
model is linear in parameters.
F ∗ =(SSEred − SSEfull)/(DFfull −DFred)
SSEfull/(n−DFfull)∼ F (DFfull −DFred, n−DFfull), (4.13)
where F (a, b) represents F-distribution with a and b degrees of freedom for numerator and
denominator respectively. We denote this approach by (F-test). Note that the above exact
distribution may not be valid when a non-parametric model such as Generalized Additive,
Random Forest, Support Vector or Gaussian process is fitted. That is why we also consider
less realistic test (F-exact) where the observed test statistic F ∗ is compared to the exact
critical value (that corresponds to α level) which is estimated by repeatedly (M times)
generating response vector Y from SA (keeping covariates fixed), each time fitting the ”full”
and ”reduced” models, and finally calculating observed test statistic F ∗. Once we have
approximate distribution of the test statistic F ∗ under SA we can estimate critical value by
taking empirical (1 − α) quantile. For simulation below we set M = 200, note that this
approach may not be feasible in practice when the dimensionality of data is large and/or
fitting the full model is computationally costly. We also show performances of AIC and BIC
criteria, so that after fitting ”full” and ”reduced” models, the model with smallest criterion’s
value is selected. Since we assume that errors are identically distributed it follows that in
these examples SA implies independence of covariates and response therefore we introduce
Bootstrap test for independence which again usually not feasible in real world problems for
computational inefficiency. For this test given pairs {yi, xi}ni=1 we fit the ”full” model and
calculate some measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the ”full” model to {yλj(i), xi}ni=1 and calculate Tj, finally reject
SA (or independence) if T obs is greater than (1 − α) quantile of {Tj}nj=1. For discrepancy
measure T we consider −SSEfull (BOOT-SSE) and F ∗ (BOOT-F) as in Eq (4.13), note that
these measures are ”large” when SA is false. Similar to (F-exact) test these tests require
many estimations of the complicated (or full) model, for all simulations we set J = 100.
We set the ”full” model to be GAM with cubic splines for each covariate and estimation of
the degree of smoothness. For better comparisons, in addition to fitting flexible GAM we
53
also fit additive multiple regression with polynomial degrees of 1,3 and 10 for each predictor.
The SA is tested with standard global (F-test) in (4.13).
For every scenario we generate 500 sets of responses (keeping covariates fixed) and test
simplifying assumption using all of the described procedures with significance level α = 0.05.
Results:
Table 4.8: Simulation Results for Regression: Generic, proportion of rejections of SA foreach scenario, sample size and generic criteria.
We generate samples of sizes n = 500 and n = 1000 from these four scenarios, covariates
are independently sampled from U [−1, 1] with covariate dimension q = 2 for the first two
scenarios and q = 10 for Sc3 and Sc4. Sc1 and Sc3 correspond to SA as probability of Y = 1
is set to 0.5 and does not depend on covariates. In Sc2 function f(X) has additive structure
while Sc4 represents nonlinear, non-additive dependence of probability on predictors. Note
that Sc2 and Sc4 have additional parameter γ which is associated with the deviation from
SA.
To apply the proposed testing procedures for this logistic problem we do the following: divide
the whole data set into training and test sets (half to each set so that n1 = n2 = bn/2c),fit flexible model on the training set, here we use Generalized Additive model (GAM) with
cubic splines for each component. With estimated parameters make prediction of function f
on test set f(x) (similar to estimated calibration function in conditional copula problems),
56
based on f(x) split the test set to K bins so that number of points in each bin is n = n2/K.
Here we can implement either Chi-square (Method 2) or permutation (Method 1) approach.
For Chi-square test, let~p = (p1, · · · , pK)T ,
where pk is the average of responses or sample proportion in bin k ∈ {1, · · · , K}. If n is
large enough then by Central Limit Theorem we get that under SA the following approximate
distributional result holds:
n(A~p)T (AΣAT )−1A~p·∼ χ2
K−1, (4.15)
where A as in Eq (4.6) and Σ = diag(p0(1 − p0), · · · , p0(1 − p0)) with p0 is estimated as
the sample proportion of all the responses in the test set. This result can be used to assess
evidence against SA.
Permutation test is performed by first finding observed test statistic T obs = p(K)−p(1) (largest
minus smallest sample proportion), and then by permuting responses compute proportion of
permutations with test statistics Tj, j = 1, · · · , J greater than T obs (here J is the number of
permutations). This proportion is an estimate of the p-value and if it is less than α, SA is
rejected. For all scenarios we set significance level α = 0.05.
We compare these algorithms to the following generic approaches to SA testing. Since we
focus on non-linear function f(X), the standard procedure is to fit a flexible ”full” model
(on the whole data set), find DEVfull (deviance of this model) and degrees of freedom DFfull
and compare them to DEVred and DFred of the model with only an intercept, see for example
[27]. Here deviance is defined as −2 times log-likelihood, and deviance for the ”full” must
be smaller than for the ”reduced” one. This is an example of the likelihood ratio test that
has an approximate χ2 distribution under SA.
T ∗ = DEVred −DEVfull·∼ χ2
DFfull−DFred. (4.16)
We denote this approach by (T-test). Note that the above result is generally used for
nested models with the full model having parametric form and as will illustrated later may
not be valid when non-parametric model such as Generalized Additive, Random Forest
or Gaussian process is fitted for f(X). That is why we also consider less realistic test
(T-exact) where observed test statistic T ∗ is compared to the exact critical value (that
corresponds to α level) which is estimated, by repeatedly (M times) generating response
57
Y from SA (keeping covariates fixed), each time fitting ”full” and ”reduced” models, and
finally calculating observed test statistic T ∗. Once we obtain an approximate distribution of
the test statistic T ∗ under SA we can estimate the critical value by taking (1− α) quantile.
For simulations below we set M = 200, note that this approach may not be feasible in
practice when the dimensionality of the data is large and the ”full” model is computationally
expensive to fit. We also show performances of AIC and BIC criteria, so that after fitting
”full” and ”reduced” models, the model with smallest criterion’s value is selected. Since
we assume that Yi are independent it follows that in this example SA implies independence
therefore we introduce generic bootstrap test for independence which again may not be
feasible in real world problems. For this test given pairs {yi, xi}ni=1 we fit ”full” model and
calculate a measure T obs, then we consider J permutations λj : {1, · · · , n} → {1, · · · , n},j = 1, · · · , J . For each j fit the same model to {yλj(i), xi}ni=1 and calculate Tj, finally reject
SA (or independence) if T obs is greater than (1−α) quantile of {Tj}nj=1. For the discrepancy
measure T we consider LogLikfull (BOOT-LL) and T ∗ (BOOT-T) as in Eq (4.16), note
that these measures as ”large” when SA is false. Similar to (T-exact) test these tests for
independence require many estimations of the ”full” model, for all simulations we set J =
100.
We set here the ”full” model to be GAM with cubic splines and estimation of smoothing
parameter for each predictor. For better comparisons, in additional to fitting flexible GAM
we also fit simple additive models with polynomial degrees 1,3 and 10 for every covariate.
The SA is tested with standard global (T-test) as in (4.16).
For every scenario we generate 500 sets of responses (keeping covariates fixed) and test
simplifying assumption using all of the described procedures with significance level α = 0.05.
Results:
Table 4.10 presents proportion of rejections using described above generic testing procedures.
First we examine probability of Type I error by looking on Sc1 and Sc3. Note that standard
likelihood ratio test (T-test) has very large rejection rate for scenarios 1 and 3, it even exceeds
74% for Sc3. Note that the error increases with the dimensionality of covariates. AIC also
has much larger probability of Type I error than expected 5%. BIC on the other hand
produces very small rejection rate under SA but the power for other scenarios is the lowest
for this criterion. Likelihood ratio test works quite well for polynomial models (except for
10-degree) as predicted by the theory. However as with regression example the power of
polynomial models depends significantly on the degree and choosing correct degree for each
58
Table 4.10: Simulation Results for Logistic Regression: Generic, proportion of rejections ofSA for each scenario, sample size and generic criteria.
Table 4.11: Simulation Results for Logistic Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size and number of bins.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Table 4.13: Simulation Results for Quantile Regression: Proposed methods, proportion ofrejections of SA for each scenario, sample size, τ and number of bins.
Permutation test χ2 testn = 500 n = 1000 n = 500 n = 1000
Scenario K = 2 K = 3 K = 2 K = 3 K = 2 K = 3 K = 2 K = 3τ = 0.1
which is certainly surprising. Likelihood ratio and AIC also have very strange behavior
since their significance level changes from 0 to 93% as τ increases from 0.1 to 0.9. This
can be explained by the observation that both these measures require a likelihood function
which is not assumed to have a particular form when fitting a quantile regression. Hence we
63
can conclude that utilizing these build-in functions can lead to significant errors. As with
the mean and logistic regressions F-exact and bootstrap procedures have the required 5%
significance. Another observation is that BOOT-SSE has much larger power then BOOT-∑β2i . Quantile value τ plays a very important role in this example. Note that for Sc2 the
power for F-exact is around 100% when τ = 0.1 and then it declines to 4% when τ = 0.9.
BOOT-SSE method on the other hand obtains a power that is quite small but very stable
as τ changes. Note that for the quantile regression to get critical values for F-exact we need
to simulate from the correct model under SA. It is generally not feasible since in real world
problems we do not know the actual distribution that generated the data.
Table 4.13 shows proportion of rejections for permutation and Pearson χ2 test using ”bins”.
Probability of Type I error for both these methods is around 5% for every τ level. Again
number of bins K does not impact the performance and Method 1 works here a little better
than Method 2. When τ = 0.1, Methods 1 and 2 have very similar power to the best
generic F-exact test (and much better than BOOT-SSE). If τ = 0.9 again Methods 1 and 2
have similar power to F-exact however both have much less power than BOOT-SSE. When
conditional median (τ = 0.5) is estimated the proposed methods have lower power than both
F-exact and BOOT-SSE.
Based on the described observations we can conclude that for the quantile regression the
SA assessment crucially depends on the τ level. There is no one method that always works.
Two novel SA testing procedures are much faster and computationally manageable than F-
exact or Bootstrap tests and work very well for small quantile values however as τ increases
BOOT-SSE should be implemented. Of course this is only true in this example where the
noise has χ21 distribution which is skewed to the right, for other scenarios we may observe
different dependence on τ .
64
Chapter 5
Data Analysis
5.1 Red Wine Data
We consider the data of [18] consisting of various physicochemical tests of 1599 red variants
of the Portuguese ”Vinho Verde” wine. Acidity and density are properties closely associated
with the quality of wine and grape, respectively. Of interest here is to study the dependence
pattern between ‘fixed acidity’ (Yfa) and ‘density’ (Yde) and how it changes with values of
are constructed by varying one predictor while fixing all others at their mid-range values.
The plots clearly demonstrate that when covariates are fixed at their mid-range values, the
conditional correlation between ‘fixed acidity’ and ‘density’ increases with ‘volatile acidity’,
‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘pH’, ‘sulphates’ and ‘alcohol’, and decreases with
levels of ‘citric acid’. These relationships can influence the preparation method of the wine.
In order to demonstrate the difficulty one would have in gauging the complex evolution
of dependence between two responses as a function of covariates we plot in Figure 5.3 the
response variables together as they vary with each covariate. It is clear that the model
manages to identify a pattern that would be very difficult to distinguish without the help of
67
Figure 5.2: Wine Data: Slices of predicted Kendall’s τ as function of covariates. Red curvesrepresent 95% credible intervals.
0.5 1.0 1.5
0.0
0.4
0.8
volatile.acidity
Ken
dalls
Tau
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
citric.acid
Ken
dalls
Tau
5 10 15
0.0
0.4
0.8
residual.sugar
Ken
dalls
Tau
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.4
0.8
chlorides
Ken
dalls
Tau
0 10 20 30 40 50 60 700.
00.
40.
8
free.sulfur.dioxide
Ken
dalls
Tau
0 50 100 150 200 250 300
0.0
0.4
0.8
total.sulfur.dioxide
Ken
dalls
Tau
2.8 3.0 3.2 3.4 3.6 3.8 4.0
0.0
0.4
0.8
pH
Ken
dalls
Tau
0.5 1.0 1.5 2.0
0.0
0.4
0.8
sulphates
Ken
dalls
Tau
9 10 11 12 13 14 15
0.0
0.4
0.8
alcohol
Ken
dalls
Tau
Figure 5.3: Wine Data: Plots of ‘fixed acidity’(blue) and ‘density’(red) (linearly transformedto fit on one plot) against covariates.
0.5 1.0 1.5
volatile.acidity
0.0 0.2 0.4 0.6 0.8 1.0
citric.acid
5 10 15
residual.sugar
0.0 0.1 0.2 0.3 0.4 0.5 0.6
chlorides
0 10 20 30 40 50 60 70
free.sulfur.dioxide
0 50 100 150 200 250 300
total.sulfur.dioxide
2.8 3.0 3.2 3.4 3.6 3.8 4.0
pH
0.5 1.0 1.5 2.0
sulphates
9 10 11 12 13 14 15
alcohol
a flexible mathematical model.
68
Part II
Approximated Bayesian Methods
69
Chapter 6
Introduction
6.1 The Need of Simulation Based Methods
When data y0 ∈ X n is observed and the sampling distribution has density function f(y0|θ)indexed by parameter θ ∈ Rq, Bayesian inference for functions of θ rely on the characteristics
of the posterior distribution:
π(θ|y0) =p(θ)f(y0|θ)∫
Rqp(θ)f(y0|θ)dθ
∝ p(θ)f(y0|θ), (6.1)
where p(θ) denotes the prior distribution.
Since the early 1990s Bayesian statisticians have been able to operate largely free of
computation-induced constraints due to the rapid development of Markov chain Monte Carlo
(MCMC) sampling methods [see, for example 19, for a recent review]. This class of methods
allows one to produce samples from π in (6.1) despite its often intractable denominator.
While traditional MCMC samplers such as Metropolis-Hastings or Hamiltonian MCMC [see
14, and references therein] can draw from distributions with unknown normalizing con-
stants, they rely on a closed form for the functional form of the unnormalized posterior, i.e.
p(θ)f(y0|θ) (as was discussed in chapter 1).
Larger data should yield answers to more complex problems. The latter can be tackled
statistically using increasingly complex models, in as much as the sampling distribution is
no longer available in closed form. In this complex settings, a much weaker assumption often
holds, namely that, for any θ ∈ Rq, draws y ∼ f(y|θ) can be sampled. To get a motivation
70
for the simulation based methods, consider for example Hidden Markov Model:
X0 ∼P (x0),
Xi|xi−1 ∼P (Xi|xi−1, θ), i = 1, . . . , n,
Yi|xi ∼P (Yi|xi, θ), i = 1, . . . , n.
(6.2)
Note that except for examples with Gaussian transition and emission distributions, marginal
distribution P (y1, · · · , yn|θ) cannot be calculated in closed form. It is possible to treat hidden
random variables Xi as auxiliary and sample them with parameters using Particle MCMC
(PMCMC) [8] or ensemble MCMC [82] but it becomes increasing difficult as n increases.
Moreover for some financial time series models [69] (Stochastic Volatility for log return for
example) α-Stable distribution may be useful to model transition and/or emission probabil-
ities. The challenge is that stable distribution do not have closed form density but can be
sampled from and hence particle and ensemble MCMC are not feasible. Other widely used
examples where the likelihood functions cannot be expressed analytically include various
networks models [51] and Markov random fields [79].
In the absence of a tractable likelihood function, statisticians have turned to approximate
methods to perform Bayesian inference. Here we consider two alternative approaches that
have been proposed and gained momentum recently: the Approximate Bayesian Computa-
tion (ABC) [60, 10, 83, 25] and the Bayesian Synthetic Likelihood (BSL)[94, 26, 73]. These
algorithms are only based on pseudo-data simulations from f(y|θ) and do not require a
tractable form of the likelihood. Both algorithms are effective when they are combined with
Markov chain Monte Carlo sampling schemes to produce samples from an approximation of
π and both share the potential need for intense computational effort at each update. In the
next sections we describe in detail existing methods for Approximate Bayesian Computation
and Bayesian Synthetic Likelihood.
6.2 Approximate Bayesian Computation (ABC)
For models with intractable or computationally expensive likelihood evaluations, simulation
based algorithms such as ABC are frequently the only choice for the inference. In its sim-
plest form, the ABC is a reject/accept sampler. Suppose we observe the data y0, given
a user-defined summary statistic S(y) ∈ Rp, the Accept/Reject ABC repeatedly samples
parameters ζ∗ from the prior, each time simulates a pseudo-data y∗ ∼ f(y|ζ∗), and then
71
compares S(y∗) with S(y0). If both of them are the same, the generated ζ∗ is accepted,
otherwise rejected, see Algorithm 2. We emphasize that a closed form equation for the likeli-
Algorithm 2 Accept/Reject ABC
1: Given observed y0 and required number of samples M .2: for t = 1, · · · ,M do3: Match = FALSE4: while Not Match do5: ζ∗ ∼ p(θ)6: y ∼ f(y|ζ∗)7: if S(y) = S(y0) then8: θ(t) = ζ∗.9: Match = TRUE
10: end if11: end while12: end for
hood is not needed, only the ability to generate from f(y|θ) for any θ. If S(y) is a sufficient
statistics and Pr(S(y) = S(y0)) > 0 then the algorithm yields posterior samples from the
true posterior π(θ|y0). Alas, the level of complexity for models where ABC is needed, makes
it unlikely for these two conditions to hold. In order to implement ABC under more realistic
assumptions, a (small) constant ε is chosen and ζ∗ is accepted whenever d(S(y), S(y0)) < ε,
where d(S(y), S(y0)) is a user-defined distance function. If we denote the distribution of
accepted draws by πε(θ|S(y0)) then we obtain
limε↓0
πε(θ|S(y0)) = π(θ|S(y0)). (6.3)
In light of (6.3) one would like to have S(y) = y, but if the sample size of y0 is large, then the
curse of dimensionality leads to Pr(d(y,y0) < ε) ≈ 0. Thus, getting even a moderate number
of samples using ABC can be unattainable in this case. Unless S is sufficient, some infor-
mation about θ is lost so much attention is placed on finding appropriate low-dimensional
summary statistics [see, for example 29, 72]. We assume that the summary statistic function
S(·) is given. While Accept/Reject ABC can be used to sample from the posterior distri-
bution of θ, the computation cost is prohibitively large when we require closeness between
pseudo and observed data, i.e. d(S(y), S(y0)) < ε. This imposes constraints on the size of
the threshold ε which ends up being selected based on the available computational power
and time rather than on other factors such as precision.
72
Under weak or no information about the parameters in the model, the prior and a poste-
rior may be misaligned, i.e. regions of mass concentration do not overlap. Hence, parameters
values that are drawn from the prior will be rarely retained making the algorithm very inef-
ficient. Algorithm 3 presents the ABC-MCMC algorithm of [61] which avoids sampling from
Algorithm 3 ABC MCMC
1: Given y0, ε > 0 and required number of samples M .2: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.3: for t = 1, · · · ,M do4: Generate ζ∗ ∼ q(·|θ(t−1)).5: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).
6: Calculate α = min(
1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗)
p(θ(t−1))q(ζ∗|θ(t−1))
)7: Generate independent U ∼ U(0, 1).8: if U ≤ α then9: θ(t) = ζ∗.
10: else11: θ(t) = θ(t−1)
12: end if13: end for
the prior and instead relies on chain with a Metropolis-Hastings (MH) transition kernel, with
state space {(θ,y) ∈ Rq×X n}, proposal distribution q(ζ|θ)×f(y|ζ) and target distribution
πε(θ,y|y0) ∝ p(θ)f(y|θ)1{δ(y0,y)<ε}, (6.4)
where δ(y0,y) = d(S(y), S(y0)). Note that the goal is the marginal distribution for θ which
Therefore if we knew the conditional probability P (δ(y0,y) < ε|θ) for every θ, then we could
run a MH algorithm to sample from the approximate target given in (6.5). Other MCMC de-
signs suitable for ABC can be found in [55] and [13]. Sequential Monte Carlo SMC have also
been successfully used for ABC (we denote it by ABC-SMC) [85, 31]. ABC-SMC requires a
specified decreasing sequence ε0 > · · · > εJ . This method uses the Particle MCMC design
[8] in which samples are updated as the target distribution evolves with ε. More specifically,
it starts by sampling θ(1)0 , . . . , θ
(M)0 from πε0(θ|y0) using Accept-Reject ABC. Subsequently,
at time t + 1 samples available at time t are sequentially updated so their distribution is
73
πεt+1(θ|y0) [see 55, for a complete description of the SMC-MCMC]. The advantage of this
method is not only that it starts from large ε but also that it generates independent draws. A
comprehensive coverage of computational techniques for ABC can be found in [84] and refer-
ences therein. The ABC-MCMC algorithm proposed by [54] approximates P (δ(y0,y) < ε|θ)by J−1
∑Jj=1 1{δ(y0,yj)<ε} where J ≥ 1 and each yj is simulated from f(y|θ). Note that this
estimator is unbiased and hence the stationary distribution of θ is πε(θ|y0) as a consequence
of the pseudo-marginal MCMC [9]. Clearly, when the probability P (δ < ε|θ) is very small,
this method would require simulating large number of δs (or equivalently ys) in order to
move to a new state. Note that even when the proposed parameter θ is near the ”true”
unknown parameter θ0, the simulated δ(y,y0) given θ can be greater than ε due to random-
ness of conditional distribution δ|θ, in this case the chain can become sluggish to the point
of being impractical. We also note a general lack of guidelines concerning the selection of ε,
which is unfortunate as the performance of ABC sampling depends heavily on its value.
Notice that the choice of proposal distribution q(·|θ) can dramatically influence the per-
formance of ABC-MCMC. To make a fair comparison between different methods we revise
ABC-MCMC algorithm by introducing a decreasing sequence ε0 > · · · > εJ (J is number of
”steps”) similar to ABC-SMC and ”learning” transition kernel during burn-in as in Algo-
rithm 4. The main difference is that during burn-in period of length B the ε-sequence starts
with a higher value (which makes finding initial θ(0) values much more feasible) and gradu-
ally decreases while the proposal distribution is adapted in the same period. The adaptation
of the proposal takes place only during the burn-in period. For independent MH sampling
the generic proposal is Gaussian N (·|µ, Σ) with constant c set to 2 or 3, for the random
walk sampler the standard transitional probability is N (·|θ(t−1), Σ) with c = 2.382/q which
is proven to be optimal for Gaussian posterior [75, 76].
All the algorithms discussed so far rely on numerous generations of pseudo-data. These can
be computationally costly, some attempts to reduce computational cost in ABC are made
in [93] and [45]. The approaches are based on learning the dependence between δ and θ.
Therefore instead of simulating many statistics for each proposed θ the accelerated algorithm
captures information from all simulated pairs through the functional form. Flexible regres-
sion models are used to model these unknown functional relationships, and the performance
depends on the signal to noise ratio and on the ability to capture patterns that can be highly
complex.
74
Algorithm 4 ABC MCMC modified (ABC-MCMC-M)
1: Given y0, sequence ε0 > · · · > εJ , constant c, burn-in period B and required number ofsamples M .
2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), S(y0)) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb).6: for t = 1, · · · ,M do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of {θ(t)} t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance of{θ(t)} t = 1, · · · , (aj − 1).
10: end if11: Generate ζ∗ ∼ q(·|θ(t−1), µ, Σ).12: Simulate y∗ ∼ f(y|ζ∗) and let δ∗ = d(S(y∗), S(y0)).
13: Calculate α = min(
1,1{δ∗<ε}p(ζ∗)q(θ(t−1)|ζ∗,µ,Σ)
p(θ(t−1))q(ζ∗|θ(t−1),µ,Σ)
)14: Generate independent U ∼ U(0, 1).15: if U ≤ α then16: θ(t) = ζ∗.17: else18: θ(t) = θ(t−1)
19: end if20: end for
75
6.3 Bayesian Synthetic Likelihood
As an alternative to ABC which requires tuning of ε and selection of a distance function d(·, ·),[94] tackled the intractability of the sampling distribution by assuming that the conditional
distribution for the statistic S(y) given θ is Gaussian with mean µθ and covariance matrix
Σθ. The Synthetic Likelihood (SL) procedure assigns to each θ the likelihood SL(θ) =
N (s0;µθ,Σθ), where N (x;µ,Σ) denotes the density of a normal with mean µ and covariance
Σ, and s0 = S(y0). SL can be used for maximum likelihood estimation as in [94] or within the
Bayesian paradigm as proposed by [26] and [73]. In Bayesian Synthetic Likelihood (BSL) [73]
propose to implement a Metropolis-Hastings sampler that has π(θ|s0) ∝ p(θ)N (s0;µθ,Σθ)
as stationary distribution. It is clear that direct implementation is not possible as the
conditional mean and covariance matrix are unknown. However, both can be estimated
based on m statistics (s1, · · · , sm) sampled from their conditional distribution given θ. More
precisely, after simulating yi ∼ f(y|θ) and setting si = S(yi), i = 1, · · · ,m, estimate
µθ =
∑mi=1 sim
,
Σθ =
∑mi=1(si − µθ)(si − µθ)T
m− 1,
(6.6)
so that the synthetic likelihood is
SL(θ|y0) = N (S(y0)|µθ, Σθ). (6.7)
The pseudo-code in Algorithm 5 shows the steps involved in the BSL-MCMC sampler.
Since each Metropolis-Hastings step requires calculating the likelihood ratios between two
SLs calculated at different θs one can anticipate the heavy computational load involved in
running the chain for thousands of iterations, especially if sampling each y is expensive. Note
that even though these estimates for the conditional mean and covariance are unbiased, the
estimated value of the Gaussian likelihood is biased and therefore pseudo marginal MCMC
theory is not applicable here. [73] presents an unbiased Gaussian likelihood estimator and
empirically show that using biased and unbiased estimates generally perform similarly. They
also remark that this procedure is very robust to the number of simulations m, and demon-
strate empirically that using m = 50 to 200 produce similar results.
The normality assumption for summary statistics is certainly a strong assumption which
may not hold in practice. Following up on this, [6] relaxed the jointly Gaussian assumption
1: Given s0 = S(y0), number of simulations m and required number of samples M .2: Get initial θ(0), estimate µθ(0) , Σθ(0) by simulating m statistics given θ(0).3: Define h(θ(0)) = N (s0; µθ(0) , Σθ(0)).4: for t = 1, · · · ,M do5: Generate ζ∗ ∼ q(·|θ(t−1)).6: Estimate µζ∗ , Σζ∗ by simulating m statistics given ζ∗.
10: if U ≤ α then11: Set θ(t) = ζ∗ .12: else13: Set θ(t) = θ(t−1).14: end if15: end for
to Gaussian copula with non-parametric marginal distribution estimates (NON-PAR BSL),
which includes joint Gaussian as a special case but is much more flexible. The estimation is
based, as in the BSL framework, on m pseudo-data samples simulated for each θ.
6.4 Plan
It is evident that both ABC and BSL are computationally costly and require enormous num-
ber of pseudo-data simulations to run even a moderate size MCMC chain. Accelerating these
algorithms is especially important for very large data sets, time consuming pseudo-data sim-
ulations or summary statistic calculations. We propose to speed up these methods by storing
past simulated draws and using those to approximate unknown likelihood. While this re-
duces drastically the computation time, it raises the need to control the approximating error
introduced when modifying the original transition kernel. The objective is to approximate
P (δ < ε|θ) and N (s0;µθ,Σθ) for any θ at every MCMC iteration using past simulated (θ, δ)
and (θ, s) for ABC and BSL respectively. K-Nearest-Neighbor (kNN) method is used as a
non-parametric estimation tool for quantities described above. The main advantage of kNN
is that it is uniformly strongly consistent which guarantees that for a large enough chain
history, we can control the error between the intended stationary distribution and that of
the proposed accelerated MCMC.
77
The structure of this part is the following: in Chapter 7 we describe the accelerated
MCMC algorithms for ABC. In Chapter 8 we extend the proposed method to BSL. The
practical impact of these algorithms is evaluated via simulations in Chapter 9 and the data
analysis involving the Stochastic Volatility model (with α-stable errors) applied to a time
series of daily log returns of Dow Jones index between Jan 1, 2010 and Dec 31, 2017 is
presented in Chapter 10. The theoretical analysis showing control of perturbation errors in
total variation norm is presented in Chapter 11.
78
Chapter 7
Approximated ABC (AABC)
7.1 Computational Inefficiency of ABC
The problem that we tackle in this thesis, is the computational burden of standard ABC
procedures, like ABC-MCMC. As was pointed out in the previous chapter ABC algorithms
require large number of pseudo-data simulations when the threshold ε is small enough which
results in low computational efficiency. Letting θ0 be parameter that generated the observed
data, if P (δ < ε|θ0) is small then even when the proposed ζ∗ state is close to θ0 there
is high probability that generated δ∗ will be greater than ε and therefore rejected. Thus
many samples are rejected not because they are in the tail of the (approximate) posterior
but simply due to variability of δ conditional on θ. Note that for all ABC methods past
simulation results are completely ignored, we think however that they could provide essential
information that can significantly accelerate the algorithm. This observation is the basis of
the proposed method.
7.2 Approximated ABC-MCMC (AABC-MCMC)
In this section we described a novel algorithm for ABC-MCMC sampler that utilizes past
pseudo data simulations and significant improve performance of a chain. We mentioned in
the last chapter that the objective of ABC-MCMC (given threshold ε) is to sample from this
distribution with support Θ:
πε(θ|y0) ∝ p(θ)P (δ(y0,y) < ε|θ), (7.1)
79
where δ(y0,y) = d(S(y), S(y0)) with y ∼ f(y|θ). If P (δ(y0,y) < ε|θ) = h(θ) was known for
every θ then we could run an exact MH-MCMC chain with invariant distribution proportional
to p(θ)h(θ). Since this function of θ is generally unknown it is estimated by an indicator
1{δ(y0,y)<ε} which is an unbiased estimator. Suppose now that at iteration t + 1, we stored
N − 1 past simulations ZN−1 = {ζn, δn}N−1n=1 where ζ denotes θ proposal which is generated
independently of the MCMC (otherwise the Markovian property of the chain is violated).
Given two new independent proposals ζ∗, ζ∗ ∼ q(|θ(t)) the first is the proposal used for
the chain update, the second is used to enrich the ”history”. We then generate one δ∗ by
first simulating y∗ (given ζ∗) then calculating statistics s∗ = S(y∗) and finally computing
discrepancy between s∗ and s0 = S(y0), δ∗ = d(s∗, s0). We then combine past samples ZN−1
with a new pair (ζ∗, δ∗), ZN = ZN−1 ∪ (ζ∗, δ∗), and estimate h(ζ∗) as follows
h(ζ∗) =
∑Nn=1WNn(ζ∗)1{δn<ε}∑N
n=1WNn(ζ∗). (7.2)
Here weight function WNn(ζ∗) = WN(ζn, ζ∗) = W (‖ζn− ζ∗‖) depends on Euclidean distance
between ζn and ζ∗ assigning more weights to pairs that are closest. We will discuss several
choices for W (·) function below.
In other words a non-parametric estimate of h(ζ) is produced for each ζ based on previous
simulations ZN . Notice that if there is a close neighbor of ζ∗ for which discrepancy is less
than ε then the estimated h(ζ∗) will not be zero and there is a chance of moving to a different
state. Intuitively, this is expected to stabilize the acceptance probability of the chain and
preform better than standard ABC-MCMC. Since the proposed weighted estimate is no
longer an unbiased estimator of h(θ), a new theoretical evaluation is needed to study the
effect of perturbing the transition kernel on the statistical analysis. Central to the algorithm’s
utility is the ability to control the total variation distance between the desired distribution
of interest given in (7.1) and the modified chain’s target. As will be shown in chapter 11, we
rely on three assumptions to ensure that the chain would approximately sample from (7.1):
1) compactness of Θ; 2) uniform ergodicity of the chain using the true h and 3) uniform
convergence in probability of h(θ) to h(θ) as N →∞.
K-Nearest-Neighbor (kNN) regression approach [32, 11] has a property of uniform con-
sistency [16] therefore for h(θ) we employ this technique. Here we define K = g(N) (for
simulations we use g(·) =√
(·)) and let λ : {1, · · · , N} → {1, · · · , N} be a permutation of
indices that sorts {ζn} ∈ ZN = {ζn, δn}Nn=1 from closest to ζ∗ to furthest. Suppose now that
80
Algorithm 6 Approximated ABC MCMC (AABC-MCMC)
1: Given y0 with summary statistics s0, sequence ε0 > · · · > εJ , constant c, burn-in periodB, required number of samples M , initial simulations ZN = {ζn, δn}Nn=1 with ζn ∼ p(ζ),yn ∼ f(·|ζn) and δn = d(S(yn), s0).
2: Define ε = ε0.3: Find initial θ(0) with simulated y such that d(S(y), s0) < ε.4: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
5: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)6: for t = 1, · · · , J do7: if t = aj for some j = 1, · · · , J then8: Set ε = εj.9: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).
10: end if11: Generate ζ∗, ζ∗
iid∼ N (·; µ, Σ).12: Simulate y∗ ∼ f(·|ζ∗) and let δ∗ = d(S(y∗), s0).13: Add simulated pair of parameter and discrepancy to the past set: ZN = ZN−1 ∪{ζ∗, δ∗} and set N = N + 1.
14: h(ζ∗) =∑Nn=1WNn(ζ∗)1{δn<ε}∑N
n=1WNn(ζ∗).
15: h(θ(t)) =∑Nn=1WNn(θ(t))1{δn<ε}∑N
n=1WNn(θ(t)).
16: Calculate α = min(
1, p(ζ∗)h(ζ∗)N (θ(t);µ,Σ)
p(θ(t))h(θ(t))N (ζ∗;µ,Σ)
)17: Generate independent U ∼ U(0, 1).18: if U ≤ α then19: θ(t+1) = ζ∗.20: else21: θ(t+1) = θ(t)
22: end if23: end for
81
after the permutation, the past set ZN is rearranged by distance to ζ∗ so that (ζ1, δ1) has
smallest ‖ζ1− ζ∗‖ while (ζN , δN) has largest distance ‖ζN − ζ∗‖. Then kNN sets WNn(ζ∗) to
zero all n > K and there are several weight choices for n ≤ K, we focus on two:
(U) The uniform kNN with WNn(ζ∗) = 1 for all n ≤ K;
(L) The linear kNN with WNn(ζ∗) = W (‖ζn − ζ∗‖) = 1− ‖ζn − ζ∗‖/‖ζK − ζ∗‖ for n ≤ K
so that the weight decreases from 1 to 0 as n increases from 1 to K.
Moreover kNN theoretical arguments generally require independent pairs of {ζn, δn}Nn=1,
therefore for proposal distribution we apply independent sampler so that q(·|θ(t)) = q(·).As in Algorithm 4 we allow ε-sequence decrease gradually during the burn-in period.
In all our simulation we assume that proposal is Gaussian which of course can be changed
to any other appropriate distribution (with positive support on Θ) related to a particular
problem. The entire procedure is outlined in Algorithm 6.
To conclude, at the end of a simulation of size M the MCMC samples are {θ(1), . . . , θ(M)}and the history used for updating the chain is {(ζ1, δ1), . . . , (ζM , δM)}. The two sequences
are independent of one another, i.e. for any N > 0, the elements in ZN are independent of
the chain’s history up to time N .
Note also that h(θ) is estimated in numerator and denominator of probability of acceptance
in every iteration, so even for a current state this value is recalculated and not borrowed
from the previous iteration. This procedure generally improves mixing of a chain and it is
theoretically justified as will be shown in chapter 11. Constant c here controls the variability
of the proposal, if it is too small then MCMC will not explore the posterior effectively, if too
large then there would be many rejections as frequently proposed values will be in tails of
the posterior. For all our simulations and real data example we use c = 1.5 which was found
empirically quite satisfactory.
82
Chapter 8
Approximated BSL (ABSL)
8.1 Computational Inefficiency of BSL
Similar to ABC, BSL is computationally costly and requires many pseudo-data simulations
to run even moderate chains, since for every iteration it generates m pseudo-data sets. This
m cannot be small since then estimations of conditional mean µθ and covariance Σθ will not
be accurate especially for moderate or large dimension of summary statistic p. To accelerate
BSL-MCMC we propose to store and utilize past simulations of (ζ , s) to approximate the
conditional mean and covariance for every proposed ζ∗ (proposed parameter), making the
whole procedure computationally faster. Instead of simulating m pseudo-data sets, only one
is simulated and used in combination with the past simulations. The approach can trivially be
extended to NONPAR-BSL algorithm but we do not pursue it further. K-Nearest-Neighbor
(kNN) method is used as a non-parametric estimation tool for quantities described above.
1: Given s0 = S(y0), constant c, burn-in period B, J number of adaption points dur-ing burn-in , required number of samples M , initial pseudo data simulations ZN ={ζn, sn}Nn=1 with ζn ∼ p(ζ), yn ∼ f(·|ζn) and sn = S(yn).
2: Get initial θ(0).3: Let µ be expectation of prior distribution and Σ = cΣ where Σ is covariance of the priorp(θ).
4: Define, b = b(B/J)c and define sequence (a1, · · · , aJ) = (b, 2b, · · · , Jb)5: for i = 1, · · · ,M do6: if i = aj for some j = 1, · · · , J then7: Find µ as mean of θ(t) t = 1, · · · , (aj − 1) and Σ = cΣ where Σ is covariance ofθ(t) t = 1, · · · , (aj − 1).
8: end if9: Generate ζ∗, ζ∗
iid∼ N (·; µ, Σ).10: Simulate y∗ ∼ f(·|ζ∗) and let s∗ = S(y∗).11: Add simulated pair of parameter and statistic to the past set: ZN = ZN−1 ∪ {ζ∗, s∗}
sponding summary statistics (s1, . . . , s100) and set A to be the inverse of covariance matrix
of (s1, . . . , s100). This procedure (with an updated A) is repeated several times, at the final
stage we calculate (δ1, . . . , δ100) and set ε0 to be 5% quantile of these discrepancies. The
number of simulations was set to 500 and 100 just for computational convenience and is not
driven by any theoretical arguments. To estimate the final threshold ε15 we use the Random
Walk version of Algorithm 4 with M = B = 5000 and initial threshold ε0. We add one
modification by setting εj, j = 1, . . . , 15 equal to 1% quantile of generated discrepancies δ
between adaptation points aj−1 and aj, the final threshold is set to ε15. Intuitively ε se-
86
quence should decrease as the chain states move closer to the ”true” parameter. Note that
this chain is only used to approximate the final ε and cannot be used to study the properties
of approximate posterior. Intermediate values ε1, . . . , ε14 are then computed as equidistant
points on the natural log scale between ε0 and ε15.
We compare the following algorithms:
(SMC) Standard Sequential Monte Carlo for ABC;
(ABC-RW) The modified ABC-MCMC algorithm which updates ε and the random walk Metropolis
transition kernel during burn-in;
(ABC-IS) The modified ABC-MCMC algorithm which updates ε and the Independent Metropolis
transition kernel during burn-in;
(BSL-RW) Modified BSL where it adapts the random walk Metropolis transition kernel during
burn-in;
(BSL-IS) Modified BSL where it adapts the independent Metropolis transition kernel during
burn-in;
(AABC-U) Approximated ABC-MCMC with independent proposals and uniform (U) weights;
(AABC-L) Approximated ABC-MCMC with independent proposals and linear (L) weights;
(ABSL-U) Approximated BSL-MCMC with independent proposals and uniform (U) weights;
(AABC-L) Approximated BSL-MCMC with independent proposals and linear (L) weights.
(Exact) When likelihood is computable, posterior samples were generated using MCMC.
For SMC 500 particles were used, total number of iterations for ABC-RW, ABC-IS, AABC-
U, AABC-L, ABSL-U and ABSL-L is 50000 with 10000 for burn-in. Since BSL-RW and
BSL-IS are much more computationally expensive, total number of iterations were fixed at
10000 with 2000 burn-in and 50 simulations of y for every proposed ζ∗ (i.e. m = 50). Exact
chain was run for 5000 iterations and 2000 for burn-in. It must be pointed out that all
approximate samplers are based on the same summary statistics, same discrepancy function
and the same ε sequence, so that they all start with the same initial conditions.
87
9.2 Measures for Comparisons
For more reliable results we compare these sampling algorithms under data set replications,
in this study we set number of replicates R to be 100, so that for each model 100 data
sets were generated and each one was analyzed with the described above sampling methods.
Assorted statistics and measures were calculated for every model and data set, letting θ(t)rs
represent posterior samples from replicate r = 1, · · · , R, iteration t = 1, · · · ,M and param-
eter component s = 1, · · · , q and similarly θ(t)rs posterior from an exact chain (all draws are
after burn-in period). We also define θtrues denote the true parameter that generated the
data. Moreover let Drs(x), Drs(x) be estimated density function at replicate r = 1, · · · , Rand components s = 1, · · · , q for approximate and exact chains respectively. Then the
following quantities are defined:
Diff in mean (DIM) = Meanr,s(|Meant(θ(t)rs )−Meant(θ
(t)rs )|),
Diff in covariance (DIC) = Meanr,s(|Covt(θ(t)rs )− Covt(θ(t)
rs )|),
Total Variation (TV) = Meanr,s
(0.5
∫|Drs(x)− Drs(x)|dx
),
Bias2 = Means
((Meantr(θ
(t)rs )− θtrues
)2),
Var = Means(V arr(Meant(θ(t)rs ))),
MSE = Bias2 + Var,
where Meant(ast) is defined as average of {ast} over index t and in similar manner V art(ast)
and Covt(ast) representing variance and covariance respectively. The first three measures
are useful in determining how close posterior draws from different samplers are to the draws
generated by the exact chain (when it is available). On the other hand the last three are
standard quantities that measure how close in mean square posterior means are to the true
parameters that generated the data. To study efficiency of proposed algorithms we need to
take into account CPU time that it takes to run a chain as well as auto-correlation properties.
Define auto-correlation time (ACT) for every parameter’s component and replicate of samples
θ(t)rs as:
ACTrs = 1 + 2∞∑a=1
ρa(θ(t)rs ),
where ρa is auto-correlation coefficient at lag a. In practice we sum all the lags up to the first
negative correlation. Letting M − B to be number of chain iterations (after burn-in) and
88
CPUr correspond to total CPU time to run the whole chain during replicate r, we introduce
Effective Sample Size (ESS) and Effective Sample Size per CPU (ESS/CPU) as:
ESS = Meanrs((M −B)/ACTrs),
ESS/CPU = Meanrs((M −B)/ACTrs/CPUr).(9.1)
Note that these indicators are averaged over parameter components and replicates. ESS
intuitively can be thought as approximate number of ”independent” samples out of M −B, the higher is ESS the more efficient is the sampling algorithm, when ESS is combined
with CPU (ESS/CPU) it provides a powerful indicator for MCMC’s efficiency. Generally a
sampler with highest ESS/CPU is preferred as it produces larger number of ”independent”
draws per unit time.
9.3 Moving Average Model
A popular toy example to check performances of ABC and BSL techniques is MA2 model:
ziiid∼ N (0, 1); i = {−1, 0, 1, · · · , n},
yi = zi + θ1zi−1 + θ2zi−2; i = {1, · · · , n}.(9.2)
The data are represented by the sequence y = {y1, · · · , yn}. It is well known that Yi follow
a stationary distribution for any θ1, θ2, but there are conditions required for identifiability.
Hence, we impose uniform prior on the following set:
θ1 + θ2 > −1,
θ1 − θ2 < 1,
−2 < θ1 < 2,
−1 < θ2 < 2.
(9.3)
It is very easy to see that the joint distribution of y is multivariate Gaussian with mean
0, diagonal variances 1 + θ21 + θ2
2, covariance at lags 1 and 2, θ1 + θ1θ2 and θ2 respectively
and zero at other lags. In this case, (Exact) sampling is feasible. For simulations we set
{θ1 = 0.6, θ2 = 0.6}, n = 200 and define summary statistics S(y) = (γ0(y), γ1(y), γ2(y))
as sample variance and covariances at lags 1 and 2. First we show results based on one
replicate. Figure 9.1 shows the trace plots, histograms and auto-correlation functions esti-
89
mated from posterior draws for parameters θ1 and θ2 for the AABC-U sampler. Note that
only post burn-in samples are shown. Similarly Figure 9.2 displays behavior of ABSL-U
Figure 9.1: MA2 model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6
010
0020
0030
0040
0050
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6 0.7 0.8
050
010
0020
0030
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
Figure 9.2: MA2 model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
−0.
50.
00.
51.
0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
−0.5 0.0 0.5 1.0
020
0040
0060
0080
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.5
1.0
1.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.0 0.5 1.0 1.5
010
0030
0050
00
0 10 20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
90
Figure 9.3: MA2 model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow) and θ2 (bottom row) and shows in order from left to right: Trace-plot, Histogram andAuto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.30
0.40
0.50
0.60
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.3 0.4 0.5 0.6
050
015
0025
0035
00
0 500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
0.2 0.3 0.4 0.5 0.6 0.7 0.8
010
0030
0050
00
0 500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF for θ2
sampler. These algorithms can be compared to standard ABC-RW method in Figure 9.3.
In the interest of keeping paper length within reasonable limits, we do not report the per-
formance of the remaining algorithms, but report that AABC-L is similar to AABC-U and
ABSL-L to ABSL-U. ABC-IS is generally less efficient than ABC-RW. From these plots it
is evident that proposed AABC-U and ABSL-U have much better mixing than ABC-RW.
Auto-correlation function for these two methods has quite small values because independent
proposal is implemented there compared to random walk where proposal depends on the
current state.
To see how close the draws from approximated samplers are to the draws from the exact
chain, we plot estimated densities in Figure 9.4. Left and right side plots refer to θ1 and
θ2, respectively. The two upper plots compare estimated density of exact MCMC sampler
with ABC-based ones (SMC, ABC-RW and AABC-U), while the two lower plots compare
the exact sampler with Synthetic Likelihood based methods (BSL-IS and ABSL-U). All ap-
proximate samplers’ draws deviate from the exact samples however posterior distribution of
AABC-U is very similar to SMC and ABC-RW, similarly distribution produced by ABSL-U
is very close to BSL-IS. This observation is true for both components, θ1 and θ2. The dif-
ference between approximate posterior distributions produced by simulation-based methods
and the exact posterior is probably due to the choice of summary statistic which does not
91
Figure 9.4: MA model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1 and θ2.
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
Exact
BSL−IS
ABSL−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
45
θ2D
enis
ty
Exact
BSL−IS
ABSL−U
capture the information about the parameters in the most effective way.
To study accuracy, precision and efficiency of proposed samplers we perform a simulation
study where 100 data sets are generated and all samplers are run for every data set. The
results are summarized in Table 9.1. Examining this table we immediately note that ES-
Table 9.1: Simulation Results (MA model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, Variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
S/CPU measure is much larger for proposed algorithms than for standard methods. The
92
improvement is very substantial, for example ESS/CPU for AABC-U is 12 times larger than
for the best standard ABC procedures like SMC. Similar results are shown for Bayesian Syn-
thetic Likelihood. The main reason for such efficiency is using past draws to make decision
about accepting or rejecting a proposal. The improvement in efficiency is of no use if pos-
terior distributions are very different from the exact one. Therefore examining DIM, DIC,
TV and MSE quantities that calculate how close posterior draws are to samples generated
by the exact MCMC is essential. For all these quantities the smaller the value the better is
the sampler. We see that all these measures for AABC-U and AABC-L are very similar to
SMC, ABC-RW and ABC-IS and frequently outperforms them. Similarly for BSL approach.
Another observation is that the approximated algorithm with uniform and linear weights
generally perform very similarly.
9.4 Ricker’s Model
Ricker’s model is analyzed very frequently to test Synthetic Likelihood procedures [94, 73].
It is a particular instance of hidden Markov model:
x−49 = 1; ziiid∼ N (0, exp(θ2)2); i = {−48, · · · , n},
xi = exp(exp(θ1))xi−1 exp(−xi−1 + zi); i = {−48, · · · , n},
yi = Pois(exp(θ3)xi); i = {−48, · · · , n},
(9.4)
where Pois(λ) is Poisson distribution with mean parameter λ and n = 100. Only y =
(y1, · · · , yn) sequence is observed, first 50 values are ignored. Note that all parameters
θ = (θ1, θ2, θ3) are unrestricted, the prior is given as (each prior parameter is independent):
θ1 ∼ N (0, 1),
θ2 ∼ Unif(−2.3, 0),
θ3 ∼ N (0, 4).
(9.5)
We restrict the range of θ2 as all algorithms become unstable for θ2 outside this interval. Note
that the marginal distribution of y is not available in closed form, but transition distribution
of hidden variables Xi|xi−1 and emission probabilities Yi|xi are known and hence we can
run Particle MCMC (PMCMC) [8] or Ensemble MCMC [82] to sample from the posterior
distribution π(θ|y0). Here we are utilizing the Particle MCMC with 100 particles. As
93
suggested in [94] we set θ0 = (log(3.8), 0.9, 2.3) and define summary statistics S(y) as the
14-dimensional vector whose components are:
(C1) #{i : yi = 0},
(C2) Average of y, y,
(C3:C7) Sample auto-correlations at lags 1 through 5,
(C8:C11) Coefficients β0, β1, β2, β3 of cubic regression
(yi − yi−1) = β0 + β1yi + β2y2i + β3y
3i + εi, i = 2, . . . , n,
(C12-C14) Coefficients β0, β1, β2 of quadratic regression
y0.3i = β0 + β1y
0.3i−1 + β2y
0.6i−1 + εi, i = 2, . . . , n.
Figures 9.5, 9.6 and 9.7 show trace-plots, histograms and ACF function for AABC-
U, ABSL-U and ABC-RW samplers for each component (red lines correspond to the true
parameter). We show here ABC-RW instead of ABC-IS since the latter has much worse
Figure 9.5: Ricker’s model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.8
1.2
1.6
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.8 1.0 1.2 1.4 1.6
020
0050
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
0−
1.0
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.0 −1.5 −1.0 −0.5 0.0
020
0040
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
1.8
2.2
2.6
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
1.8 2.0 2.2 2.4 2.6 2.8
020
0050
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
performance for this model. The main observation is that mixing of AABC-U is much
better than in ABC-RW with smaller auto-correlation values. ABSL-U has higher auto-
correlations than AABC-U but still performs quite well. To see how close the draws from
94
Figure 9.6: Ricker’s model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.8
1.0
1.2
1.4
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.8 1.0 1.2 1.4
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−1.
5−
0.5
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−1.5 −1.0 −0.5 0.0
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
2.1
2.3
2.5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
2.1 2.2 2.3 2.4 2.5 2.6 2.7
020
0040
00
0 50 100 150 200 250 300
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.7: Ricker’s model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
1.0
1.2
1.4
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
1.0 1.1 1.2 1.3 1.4 1.5
020
0040
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−1.
5−
0.5
0.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
010
0025
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
2.0
2.2
2.4
2.6
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
2.0 2.2 2.4 2.6
010
0030
00
0 100 200 300 400 500 600 700
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
simulation-based algorithms to the draws from the exact chain, we plot estimated densities
in Figure 9.8. Two upper plots (left and right are associated to parameter’s component)
compares estimated density of exact PMCMC sampler (with 100 particles) with ABC-based
95
ones (SMC, ABC-RW and AABC-U), two lower plots compare exact sampler with Synthetic
Likelihood based methods (BSL-RW and ABSL-U), here we have chosen BSL-RW over BSL-
IS since it has better general performance in this model. Observe that ABC-based samplers
Figure 9.8: Ricker’s model: Estimated densities for each component. First row comparesExact, SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-RW andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.
0.5 1.0 1.5 2.0
01
23
4
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
−2.0 −1.5 −1.0 −0.5 0.0
0.0
0.5
1.0
1.5
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
1.5 2.0 2.5 3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
θ3
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.5 1.0 1.5 2.0
01
23
45
6
θ1
Den
isty
Exact
BSL−RW
ABSL−U
−2.0 −1.5 −1.0 −0.5 0.0
0.0
0.5
1.0
1.5
θ2
Den
isty
Exact
BSL−RW
ABSL−U
1.5 2.0 2.5 3.0
01
23
45
θ3
Den
isty
Exact
BSL−RW
ABSL−U
(SMC, ABC-RW and AABC-U) have very similar estimated densities, densities of Synthetic
Likelihood methods are also similar. For the second component there is quite large difference
between exact and approximate posteriors which may be due to non-informative summary
statistics.
A more general study, where results are averaged over 100 independent replicates, is
shown in Table 9.2. Again, the proposed strategies clearly outperforms in overall efficiency
(ESS/CPU). For instance, AABC-U is about 10 times more efficient than standard SMC and
ABSL-U is 6 times more efficient than BSL-RW. At the same time DIM, DIC, TV and MSE
are generally smaller for approximate methods than for standard ones. Therefore it is evident
that for this model the improvement of sampler’s efficiency (or number of independent draws
per CPU time) does not decrease accuracy and precision of posterior’s moments.
96
Table 9.2: Simulation Results (Ricker’s model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
When analyzing stationary time series it is frequently observed that there are periods of high
and periods of low volatility. Such phenomenon is called volatility clustering, see for example
[59]. One way to model such a behaviour is through a Stochastic Volatility (SV) model,
where variances of the observed time series depend on hidden states that themselves form
a stationary time series. Consider the following model which depends on three parameters
(θ1, θ2, θ3):
x1 ∼ N (0, 1/(1− θ21)); vi
iid∼ N (0, 1); wiiid∼ N (0, 1); i = {1, · · · , n},
xi = θ1xi−1 + vi; i = {2, · · · , n},
yi =√
exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.
(9.6)
Only y = (y1, · · · , yn) is observed while (x1, · · · , xn) are hidden states. First parameter
θ1 must be between -1 and 1 and controls auto-correlation of hidden states, θ2 and θ3 are
unrestricted and relate to the way hidden states influence variability of the observed series.
Note that for fixed hidden states the distribution of the observed variable is normal which
might not be appropriate in some examples. We introduce the following priors, independently
97
for each parameter:
θ1 ∼ Unif(0, 1),
θ2 ∼ N (0, 1),
θ3 ∼ N (0, 1).
(9.7)
We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1) and length of the time series
n = 500. Since the marginal distribution of y is not known in closed-form, standard MCMC
strategy cannot be implemented. We use Particle MCMC (PMCMC) as the Exact sampling
scheme. Since pseudo-data sets can be easily generated for every parameter value, the SV
is a good example to demonstrate the performances of the algorithms considered here. For
summary statistics we use a 7-dimensional vector whose components are:
(C1) #{i : y2i > quantile(y2
0, 0.99)},
(C2) Average of y2,
(C3) Standard deviation of y2,
(C4) Sum of the first 5 auto-correlations of y2,
(C5) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.1)}}ni=1,
(C6) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.5)}}ni=1,
(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.
Here quantile(y, τ) is defined as τ -quantile of the sequence y. As was shown in [81] and [24]
the auto-correlation of indicators (under different quantiles) can be very useful in charac-
terizing a time series and that is why we have added (C5),(C6) and (C7) to the summary
statistic. We focus here on y2 and its auto-correlations since model parameters only affect
variability of y (auto-correlation of y is zero for any lag). Figures 9.9, 9.10 and 9.11 show
trace-plots, histograms and ACF function for AABC-U, ABSL-U and ABC-RW samplers
respectively for each component (red lines correspond to the true parameter). The major
observation is that mixing of AABC-U is much better than in ABC-RW with smaller auto-
correlation values. ABSL-U has higher auto-correlations than AABC-U but still performs
well. In Figure 9.12 we compare the sample-based kernel smoothing density estimates ob-
tained from BSL-IS and BSL-RW. We note that all samples obtained from the approximate
98
Figure 9.9: SV model: AABC-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.6
0.8
1.0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−3.
0−
2.0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−3.0 −2.5 −2.0 −1.5
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
5−
0.5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
010
0020
00
0 10 20 30 40 50
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.10: SV model: ABSL-U Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.6
0.8
1.0
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
040
0080
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−4.
0−
3.0
−2.
0
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−4.0 −3.5 −3.0 −2.5 −2.0 −1.5
020
0050
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−2.
0−
1.0
0.0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0050
00
0 50 100 150 200
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
algorithms are exact posterior (produced using PMCMC with 100 particles). Generally all
ABC-based samplers perform similarly, on the other hand ABSL-U performs worse than
generic BSL-IS in this run as it is shifted away from the exact posterior for θ1 and θ3.
99
Figure 9.11: SV model: ABC-RW Sampler. Each row corresponds to parameters θ1 (toprow), θ2 (middle row) and θ3 (bottom row) and shows in order from left to right: Trace-plot,Histogram and Auto-correlation function. Red lines represent true parameter values.
0 10000 20000 30000 40000
0.80
0.90
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.80 0.85 0.90 0.95 1.00
010
0020
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
8−
2.2
−1.
6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6
020
0040
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−2.
0−
1.0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.0 −1.5 −1.0 −0.5
020
0040
00
0 500 1000 1500 2000
0.0
0.4
0.8
Lag
AC
F
ACF for θ3
Figure 9.12: SV model: Estimated densities for each component. First row compares Exact,SMC, ABC-RW and AABC-U samplers. Second row compares Exact, BSL-IS and ABSL-U.Columns correspond to parameter’s components, from left to right: θ1, θ2 and θ3.
0.5 0.6 0.7 0.8 0.9 1.0
02
46
810
12
θ1
Den
isty
Exact
SMC
ABC−RW
AABC−U
−4 −3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ2
Den
isty
Exact
SMC
ABC−RW
AABC−U
−3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
Exact
SMC
ABC−RW
AABC−U
0.5 0.6 0.7 0.8 0.9 1.0
02
46
810
12
θ1
Den
isty
Exact
BSL−IS
ABSL−U
−4 −3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ2
Den
isty
Exact
BSL−IS
ABSL−U
−3 −2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
Exact
BSL−IS
ABSL−U
To get more general conclusions we show average results in Table 9.3 over 100 data repli-
cates. Again we note that proposed algorithms outperform the benchmark samplers by 8
times in ESS/CPU. Moreover AABC-U and AABC-L have very similar or smaller values
100
Table 9.3: Simulation Results (SV model): Average Difference in mean, Difference incovariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm.
Diff with exact Diff with true parmater Efficiency
for DIM, TV and MSE, which demonstrates that these samplers are much more efficient
than standard methods and at the same produce as accurate (or more accurate) parameter
estimates as generic algorithms.
ABSL-U and ABSL-L on the other hand did not perform well for this model, TV and MSE
for these samplers are larger by 10% than generic ones.
9.6 Stochastic Volatility with α-Stable errors
As was pointed out in the previous sub-section, standard SV model assumes that conditional
on hidden states observed variables have a normal distribution, which is a strong assumption.
Frequently, in financial time series, a large sudden drop occurs that is very unlikely under
Gaussianity. Therefore, it is suggested to use heavy tailed distributions (instead of Gaussian)
to model financial data. We consider a family of distributions named α-Stable (Stab(α, β))
with two parameters α ∈ (0, 2] (stability parameter) and β ∈ [−1, 1] (skew parameter). Two
special cases are α = 1 and α = 2 which correspond to Cauchy and Gaussian distribution
respectively, note that for α < 2 the distribution has undefined variance. We define the
101
following SV model with α-Stable errors with four parameter (θ1, θ2, θ3, θ4):
x1 ∼ N (0, 1/(1− θ21)); vi
iid∼ N (0, 1); wiiid∼ Stab(θ4,−1); i = {1, · · · , n},
xi = θ1xi−1 + vi; i = {2, · · · , n},
yi =√
exp(θ2 + exp(θ3)xi)wi; i = {1, · · · , n}.
(9.8)
This model is very similar to the simple SV with only difference that emission errors follow
α-Stable distribution with unknown stable parameter and fixed skew of −1. We generally
prefer negative skew emission probability to model large negative financial returns. As in
the previous simulation example θ2 and θ3 are unrestricted. The prior distribution for this
model is (independently for each parameter):
θ1 ∼ Unif(0, 1),
θ2 ∼ N (0, 1),
θ3 ∼ N (0, 1),
θ4 ∼ Unif(1.5, 2).
(9.9)
We set the true parameters to (θ1 = 0.95, θ2 = −2, θ3 = −1, θ4 = 1.8) and length of the
time series n = 500. The major challenge with this model is that there are no closed-form
densities for α-Stable distributions. Hence, most MCMC samplers, including PMCMC and
ensemble MCMC, cannot be used to sample from the posterior. However sampling from this
family of distributions is feasible which makes it particularly amenable for simulation based
methods like ABC and BSL. For summary statistics we use a 7-dimensional vector whose
components are:
(C1) #{i : y2i > quantile(y2
0, 0.99)},
(C2) Average of y2,
(C3) Standard deviation of y2,
(C4) Sum of the first 5 auto-correlations of y2,
(C5) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.1)}}ni=1,
(C6) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.5)}}ni=1,
(C7) Sum of the first 5 auto-correlations of {1{y2i<quantile(y2,0.9)}}ni=1.
102
Figures 9.13,9.14 and 9.15 show trace-plots, histograms and ACF function for AABC-U,
ABSL-U and ABC-RW samplers respectively for each component (red lines correspond to
the true parameters). As in previous examples the mixing of AABC-U and ABSL-U is
Figure 9.13: SV α-Stable model: AABC-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
0 10000 20000 30000 40000
0.4
0.8
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy0.4 0.5 0.6 0.7 0.8 0.9 1.0
020
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
6−
1.6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.8 −2.6 −2.4 −2.2 −2.0 −1.8 −1.6 −1.4
030
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
00.
5
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0 0.5
015
00
0 20 40 60 80 100
0.0
0.6
Lag
AC
F
ACF for θ3
0 10000 20000 30000 40000
1.5
1.8
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
030
00
0 20 40 60 80 1000.
00.
6
Lag
AC
F
ACF for θ4
much better than of ABC-RW. Since exact sampling is not feasible in this example we
compare samplers to SMC (instead of exact samples), the plotted estimated densities are
in Figure 9.16, here we have chosen BSL-IS over BSL-RW because it has better general
performance in this model. Generally all simulation-based samplers have similar densities in
this example.
For more general conclusions we show average results in Table 9.4 over 100 data replicates.
Here to calculate DIM, DIC and TV, samplers are compared to SMC since exact draws
cannot be obtained. As in previous examples ESS/CPUs for AABC-U, AABC-L, ABSL-
U and ABSL-L are roughly 8 times larger than benchmark algorithms. For this example
looking at DIM, DIC and TV maybe misleading since approximated samplers are compared
to another approximate sampler. Much more informative is MSE measure, it is very similar
across ABC-based and BSL-based algorithms. Therefore we can conclude that proposed
samplers perform very well in this example.
103
Figure 9.14: SV α-Stable model: ABSL-U Sampler. Each row corresponds to parameters θ1
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
(top row), θ2 (second top row), θ3 (second bottom row), θ4 (bottom row) and shows in orderfrom left to right: Trace-plot, Histogram and Auto-correlation function. Red lines representtrue parameter values.
0 10000 20000 30000 40000
0.6
0.9
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.6 0.7 0.8 0.9 1.0
015
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ1
0 10000 20000 30000 40000
−2.
4−
1.6
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−2.4 −2.2 −2.0 −1.8 −1.6
030
00
0 500 1000 1500 2000 2500 3000
0.0
0.8
Lag
AC
F
ACF for θ2
0 10000 20000 30000 40000
−1.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−1.5 −1.0 −0.5 0.0
020
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ3
0 10000 20000 30000 40000
1.7
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.7 1.8 1.9 2.0
020
00
0 500 1000 1500 2000 2500 3000
0.0
0.6
Lag
AC
F
ACF for θ4
104
Figure 9.16: SV α-Stable model: Estimated densities for each component. First row com-pares SMC, ABC-RW and AABC-U samplers. Second row compares SMC, BSL-IS andABSL-U. Columns correspond to parameter’s components, from left to right: θ1, θ2, θ3 andθ4.
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
SMC
ABC−RW
AABC−U
−3.5 −3.0 −2.5 −2.0 −1.5
0.0
0.5
1.0
1.5
2.0
θ2
Den
isty
SMC
ABC−RW
AABC−U
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
θ3
Den
isty
SMC
ABC−RW
AABC−U
1.5 1.6 1.7 1.8 1.9 2.0
01
23
4
θ4
Den
isty
SMC
ABC−RW
AABC−U
0.0 0.2 0.4 0.6 0.8 1.0
01
23
4
θ1
Den
isty
SMC
BSL−IS
ABSL−U
−3.5 −3.0 −2.5 −2.0 −1.5
0.0
0.5
1.0
1.5
θ2
Den
isty
SMC
BSL−IS
ABSL−U
−2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
θ3
Den
isty
SMC
BSL−IS
ABSL−U
1.5 1.6 1.7 1.8 1.9 2.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
θ4
Den
isty
SMC
BSL−IS
ABSL−U
Table 9.4: Simulation Results (SV α-Stable model): Average Difference in mean, Differencein covariance, Total variation, square roots of Bias, variance and MSE, Effective sample sizeand Effective sample size per CPU time, for every sampling algorithm. In DIM, DIC andTV, samplers are compared to SMC.
For real world example we consider Dow-Jones index daily log returns from January 1, 2010
until December 31, 2018. The data was downloaded from Yahoo Finance1 website. Given a
time series of prices Pi, i = 1, · · · , n, log returns are calculated in the following way:
ri = log(Pi)− log(Pi−1), i = 2, · · · , n.
The resulting time series is of length 2262. To make log returns more suitable for analysis,
we standardize rt by subtracting its mean and then multiply each return by 200, so that
absolute values were not too small, Figure 10.1 shows transformed returns. This time series
(y0) has mean zero by construction, its auto-correlations and partial auto-correlations are
insignificant for any lag however it is obvious that variances are correlated. There are periods
of low and high variability, therefore to analyze its properties we apply Stochastic Volatility
model with α-Stable errors as described in previous chapter. Since likelihood does not exist
for this class of models, simulation-based methods are probably the only available tools for
the inference.
10.2 Analysis
The evolution of time series described by equation (9.8) (note that skewed parameter of
Stable distribution is fixed at value of −1) and parameters’ prior as in equation (9.9). To
1https://ca.finance.yahoo.com/
106
Figure 10.1: Dow Jones daily transformed log return for a period of Jan 2010 - Dec 2018.−
10−
50
510
Tran
sfor
med
log
retu
rn
Jan 10 May 10 Sep 10 Jan 11 May 11 Sep 11 Jan 12 Apr 12 Aug 12 Dec 12 Apr 13 Aug 13 Dec 13 Apr 14 Aug 14 Dec 14 Apr 15 Aug 15 Dec 15 Apr 16 Aug 16 Dec 16 Apr 17 Jul 17 Nov 17 Mar 18 Jul 18 Oct 18
estimate posterior distribution we run AABC-U and ABLS-U samplers. Summary statistic
for both methods is 7-dimensional vector as in section 9.6. Each chain was run for 100
thousand iterations with last 80 thousands used for inference. Figures 10.2 and 10.3 show
trace-plots and histograms for AABC-U and ABSL-U samplers respectively for each param-
eter. We observe that similar to simulation results, the mixing of AABC-U is generally
better than of ABSL-U. However posterior draws of ABSL-U for the first 3 components are
uni-modal, symmetric and bell-shaped, very similar to Gaussian distributions, which is not
surprising since when Gaussian priors are used the posterior of BSL algorithms must also
be Gaussian by conjugacy. Table 10.1 reports posterior mean and 95% credible intervals
for every parameter and for both samplers. AABC-U and ABSL-U produce similar results.
Table 10.1: Dow Jones log return stochastic volatility: 95% credible intervals and posterioraverages for 4 parameters for two proposed samplers (AABC-U and ABSL-U).
We see that estimated correlation between adjacent variables in hidden layer of Stochastic
107
Figure 10.2: Dow Jones log returns: AABC-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.
0 20000 40000 60000 80000
0.65
0.75
0.85
0.95
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
010
0030
0050
00
0 20000 40000 60000 80000
−1.
0−
0.5
0.0
0.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.0 −0.5 0.0 0.5
010
0020
0030
0040
0050
0060
00
0 20000 40000 60000 80000
−2.
5−
2.0
−1.
5−
1.0
−0.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0040
0060
00
0 20000 40000 60000 80000
1.5
1.6
1.7
1.8
1.9
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
010
0030
0050
00
Figure 10.3: Dow Jones log returns: ABSL-U Sampler. Every column corresponds to aparticular parameter component from left to right: θ1, θ2, θ3, θ4 and shows trace-plot on topand histogram on bottom.
0 20000 40000 60000 80000
0.65
0.75
0.85
0.95
Trace−plot for θ1
Iteration
θ 1
Histogram for θ1
θ1
Fre
quen
cy
0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
020
0040
0060
0080
0012
000
0 20000 40000 60000 80000
−1.
0−
0.5
0.0
0.5
Trace−plot for θ2
Iteration
θ 2
Histogram for θ2
θ2
Fre
quen
cy
−1.0 −0.5 0.0 0.5
020
0060
0010
000
0 20000 40000 60000 80000
−2.
5−
2.0
−1.
5−
1.0
−0.
50.
0
Trace−plot for θ3
Iteration
θ 3
Histogram for θ3
θ3
Fre
quen
cy
−2.5 −2.0 −1.5 −1.0 −0.5 0.0
020
0040
0060
0080
00
0 20000 40000 60000 80000
1.5
1.6
1.7
1.8
1.9
2.0
Trace−plot for θ4
Iteration
θ 4
Histogram for θ4
θ4
Fre
quen
cy
1.5 1.6 1.7 1.8 1.9 2.0
020
0060
0010
000
volatility is about 0.9 and parameter of α-Stable emission noise is 1.91 which can produce
more extreme values than predicted by standard Gaussian noise. Moreover since 0 is inside
the credible interval for θ2, we can disregard this parameter. Overall this example shows
that the proposed samplers AABC-U and ABSL-U can be implemented successfully for real
world data problems.
108
Chapter 11
Theoretical Justifications
In this chapter we show that our novel approximated ABC MCMC and BSL samplers with
independent proposals exhibit ergodic properties in a long run. In other words we want to
show that as number of MCMC iterations increases marginal distribution of {θ(t)} converges
to appropriate posterior distribution in total variation and sample averages converge to the
true expectations.
11.1 Preliminary Theorems
We start by reviewing our notation. Let p(θ), q(θ) represent the prior and proposal distri-
butions for θ ∈ Θ respectively. For AABC we define a function h(θ) as P (δ < ε|θ) where
δ = δ(y,y0) and y ∼ f(y|θ). Then given a proposed ζ∗ the acceptance probability is:
a(θ, ζ∗) = min(1, α(θ, ζ∗)),
α(θ, ζ∗) =p(ζ∗)q(θ)h(ζ∗)
p(θ)q(ζ∗)h(θ).
(11.1)
This MH procedure defines an exact transition kernel which we call P (·, ·). Since h(θ) is not
available in closed form we will estimate it using k-Nearest-Neighbor approach.
Let ZN = {ζn,1{δn<ε}}Nn=1 represent N independent samples from q(ζ)P (1{δ<ε}|ζ) for AABC.
Actually ZN contains past generated samples that were saved before Nth iteration. Given θ
and ζ∗ we apply kNN to approximate h(θ) and h(ζ∗) by calculating local weighted averages
of 1{δn<ε} for ζn that are close to θ or ζ∗. We denote such estimate h(θ;ZN), and the
probability of proposal acceptance for this perturbed algorithm (more on perturbed MCMC
109
can be found in [77, 71, 46]) is:
a(θ, ζ∗;ZN) = min(1, α(θ, ζ∗;ZN)),
α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)
p(θ)q(ζ∗)h(θ;ZN).
(11.2)
The approximate kernel transition is PN(·, ·) = EZN
[PN(·, ·;ZN)
], the goal is to show that
as N →∞ the distance between this transition and the exact one converges to zero, where
distance is defined as:
‖PN(·, ·)− P (·, ·)‖ = supθ‖PN(θ, ·)− P (θ, ·)‖TV , (11.3)
where the last distance is ”total variation” distance between two measures. First we show
that under strong consistency assumption of h(θ;ZN), perturbed kernel converges to the
exact one.
Theorem 11.1.1. Suppose Θ is compact, supθ ‖h(θ;ZN)− h(θ)‖ → 0 with probability 1 and
h(θ) > 0 for all θ ∈ Θ. Then for any ε > 0 there exists C such that for all N > C,
‖PN − P‖ < ε.
Next let Pε = {PN : ‖PN − P‖ < ε} be a collection of perturbed kernels each ε distance
from the exact kernel. For illustration consider an example when auxiliary set ZN grows
with number of iterations, in this case at each iteration a new kernel PN ∈ Pε is used in the
chain. We want to show that this procedure will results in ergodic chain with appropriate
convergence results. For most of the presented results below we refer to the work of [47] on
convergence properties of perturbed kernels.
To obtain useful convergence results we need to make additional Doeblin Condition assump-
tion about the exact kernel P :
Definition 11.1.1 (Doeblin Condition). Given a kernel P , there exists 0 < α < 1 such that
sup(θ,ζ∗)∈Θ×Θ
‖P (θ, ·)− P (ζ∗, ·)‖TV < 1− α.
We also choose ε so that α∗ = α + 2ε < 1 and ε < α/2 which by Remark 2.1 in [47]
guarantees that every member of Pε satisfies Doeblin Condition with α = α∗ and has a
unique invariant measure. Thus we define the following 3 assumptions:
(A1) Exact transition kernel P satisfies Doeblin Condition,
110
(A2) For any P ∈ Pε, ‖P − P‖ < ε,
(A3) ε < min(α/2, (1− α)/2).
Now, let µ be invariant measure of the exact kernel P , and the perturbed chain θ(0), θ(1), · · · , θ(t)
is a Markov chain with θ(0) ∼ ν = µ0. Also define marginal distribution of θ(t) denoted by
µt, t = 1, 2, , and equal to µt = νP0P1 · · · Pt with each Pt ∈ P , t = 1, 2, · · · and P0 being
identity transition (for convenience). First we need to examine the total variation distance
between µ and average measure∑M−1
t=0 µt/M , in other words:∥∥∥∥∥µ−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
, where P0 = I. (11.4)
Then we have the following important convergence result:
Theorem 11.1.2. Suppose that exact kernel P satisfies (A1), every member of Pε (A2) and
ε is chosen to satisfy (A3). Let ν be any probability measure on (Θ,F0), then∥∥∥∥∥µ−∑M−1
t=0 νP0 · · · PtM
∥∥∥∥∥TV
≤ (1− (1− α)M)‖µ− ν‖TVMα
− ε(1− (1− α)M)
Mα2+ε
α, (11.5)
which implies that this difference can be arbitrary small for sufficiently large M and small
enough ε.
Next we focus on the following mean squared error (MSE):
E
(µf − ∑M−1t=0 f(θ(t))
M
)2 ,
where f is bounded function and µf = Eµ[f(θ)]. The main objective here is to find the upper
bound for this MSE when perturbed MCMC is used and how it depends on the sample size
M . To obtain the main result we introduce the following lemma:
Lemma 11.1.3. Suppose θ(0) ∼ ν with ν being any distribution, µt = νP1 · · · Pt be marginal
distribution of θ(t), t = 1, 2, · · · with Pt ∈ Pε and ε satisfying (A2) and (A3) respectively.
Moreover let f(θ) and g(θ) be bounded functions with |f | = supθ f(θ) and |g| = supθ g(θ),
then
cov(f(θ(k)), f(θ(j))) ≤ 8|f ||g|(1− α∗)|k−j|.
111
For the proofs we will also utilize the following two theorems, one is about the strong
uniform consistency of kNN estimators the later one is about uniform ergodicity of Hastings
algorithm with independent proposal.
Theorem 11.1.4 (Uniform Consistency of kNN - [16]). Given independent {ζn, δn}Nn=1, let Θ
be support of distribution of ζ, h(ζ) = E(δ|ζ) and hN(ζ) =∑N
j=1WNj δj (kNN estimator)
(here j are permuted indices that order distances between ζn and ζ from smallest to largest).
Suppose weights WNj satisfy
(i)∑N
j=1 WNj = 1,
(ii) WNj = 0 for j > K, and K = K(N) with K →∞ and K/N → 0,
(iii) supN K maxjWNj <∞.
If
(i) Θ is compact,
(ii) h(ζ) is continuous function,
(iii) V ar(δ|ζ) is bounded random variable,
(iv) K(N) satisfies K/√N log(N)→∞,
then supζ∈Θ |hN(ζ)− h(ζ)| → 0 with probability 1.
Note that the uniform and linear weights satisfy WNj assumptions above.
Theorem 11.1.5 (Independent Metropolis sampler - [62]). Suppose θ(t) is a MH Markov Chain
with invariant distribution π(θ), independent proposal q(θ) and acceptance probabilities
a(θ, ζ∗) = min(
1, π(ζ∗)q(θ)π(θ)q(ζ∗)
).
If there exists β > 0 such that q(θ)/π(θ) > β for all θ ∈ Θ, then the algorithm is uniformly
ergodic so that ‖P n(θ, ·)− π‖TV < (1− β)n (here P n(θ, ·) is conditional distribution of θ(n)
given θ(0) = θ).
11.2 Main Results
The next important convergence results follows (similar to Theorem 2.5 of [47]):
112
Theorem 11.2.1 (Approximation of MSE). Suppose P , Pε and ε satisfy (A1), (A2) and
(A3) respectively. Letting µ represent invariant measure of P , f(θ) be a bounded function
and θ(0) ∼ ν. Then
E
(µf − 1
M
M−1∑t=0
f(θ(t))
)2
≤ 4|f |2(
(1− (1− α)M)
Mα− ε(1− (1− α)M)
Mα2+ε
α
)2
+ 8|f |2(
1
M+
2
(α∗)2
((1− α∗)M+1 − (1− α∗)
M2+
(1− α∗)− (1− α∗)2
M
)).
(11.6)
In other words this expectation can be made arbitrary small for sufficiently large M and
small enough ε.
Based on these theorems we now can obtain convergence results for AABC and ABSL
algorithms.
Theorem 11.2.2 (Ergodicity of AABC). Consider the proposed AABC sampler (with ε thresh-
old), let p(θ) represent prior measure on Θ, ZN simulated pairs {ζn,1{δn<ε}}Nn=1 (ζn ∼ q(ζ))
with the following assumptions:
(B1) Θ being compact set.
(B2) q(θ) > 0 continuous density of independent proposal distribution.
(B3) p(θ) > 0 continuous density of prior distribution.
(B4) h(θ) = P (δ < ε|θ) > 0 and continuous function of θ.
(B5) In kNN estimation assume that K(N) =√N with uniform or linear weights.
Then for sufficiently large N (number of past simulations) and M (number of chain itera-
tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.
Corollary 11.2.2.1 (Ergodicity of ABSL). Consider the proposed ABSL algorithm, let p(θ)
represent prior measure on Θ, h(θ) = N (s0;µθ,Σθ), ZN simulated pairs {ζn, sn}Nn=1 (ζn ∼q(ζ) and sn is summary statistics) with the following assumptions:
(B1) Θ being compact set.
(B2) q(θ) > 0 continuous density of independent proposal distribution.
113
(B3) p(θ) > 0 continuous density of prior distribution.
(B4) h(θ) continuous function of θ.
(B5) |Σθ| > a0 where Σθ = V ar(s|θ) for every θ ∈ Θ.
(B6) E[sj|θ] and E[sjsk|θ] are continuous functions of θ for every 1 ≤ j, k ≤ p with sj
representing jth component of summary statistic s.
(B7) V ar[sj|θ] and V ar[sjsk|θ] are bounded functions.
(B8) In kNN estimation assume that K(N) =√N with uniform or linear weights.
Then for sufficiently large N (number of past simulations) and M (number of chain itera-
tions), (A1)-(A3) are satisfied and error bounds of Theorems 11.1.2 and 11.2.1 follow.
11.3 Proofs of theorems
Proof. [Proof of Theorem 11.1.1] Note that supθ ‖h(θ;ZN)− h(θ)‖ → 0 w.p.1 implies that
for all θ and ζ∗ in Θ:
h(θ;ZN)p→ h(θ),
h(ζ∗;ZN)p→ h(ζ∗),
therefore by Slutsky’s theorem we obtain
h(ζ∗;ZN)
h(θ;ZN)
p→ h(ζ∗)
h(θ),
for all (θ, ζ∗) in Θ×Θ. therefore
α(θ, ζ∗;ZN) =p(ζ∗)q(θ)h(ζ∗;ZN)
p(θ)q(ζ∗)h(θ;ZN)
p→ p(ζ∗)q(θ)h(ζ∗)
p(θ)q(ζ∗)h(θ)= α(θ, ζ∗).
Since min(1, x) is a continuous function, Continuous Mapping Theorem implies that