Stick-Breaking Autoregressive Processes J.E. Griffin * School of Mathematics, Statistics and Actuarial Science, University of Kent, U.K. M. F. J. Steel Department of Statistics, University of Warwick, U.K. Abstract This paper considers the problem of defining a time-dependent nonparametric prior for use in Bayesian nonparametric modelling of time series. A recursive construction allows the definition of priors whose marginals have a general stick-breaking form. The processes with Poisson-Dirichlet and Dirichlet process marginals are investigated in some detail. We develop a general conditional Markov Chain Monte Carlo (MCMC) method for inference in the wide subclass of these models where the parameters of the marginal stick-breaking process are nondecreasing sequences. We derive a generalized P´ olya urn scheme type representation of the Dirichlet process construction, which allows us to develop a marginal MCMC method for this case. We apply the proposed methods to financial data to develop a semi-parametric stochastic volatility model with a time-varying nonparametric returns distribution. Finally, we present two examples concerning the analysis of regional GDP and its growth. * Corresponding author: Jim Griffin, School of Mathematics, Statistics and Actuarial Science, University of Kent, Canterbury, CT2 7NF, U.K. Tel.: +44-1227-823627; Fax: +44-1227-827932; Email: J.E.Griffin- [email protected]. The authors would like to acknowledge helpful comments from the Co-Editor and anonymous reviewers and from seminar audiences at the Universities of Newcastle, Nottingham, Bath, Sheffield and Bristol, Imperial College London and the Gatsby Computational Neuroscience group 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Stick-Breaking Autoregressive Processes
J.E. Griffin∗
School of Mathematics, Statistics and Actuarial Science,
University of Kent, U.K.
M. F. J. Steel
Department of Statistics, University of Warwick, U.K.
Abstract
This paper considers the problem of defining a time-dependent nonparametric prior for
use in Bayesian nonparametric modelling of time series. A recursive construction allows
the definition of priors whose marginals have a general stick-breaking form. The processes
with Poisson-Dirichlet and Dirichlet process marginals are investigated in some detail. We
develop a general conditional Markov Chain Monte Carlo (MCMC) method for inference
in the wide subclass of these models where the parameters of the marginal stick-breaking
process are nondecreasing sequences. We derive a generalized Polya urn scheme type
representation of the Dirichlet process construction, which allows us to develop a marginal
MCMC method for this case. We apply the proposed methods to financial data to develop
a semi-parametric stochastic volatility model with a time-varying nonparametric returns
distribution. Finally, we present two examples concerning the analysis of regional GDP
and its growth.
∗Corresponding author: Jim Griffin, School of Mathematics, Statistics and Actuarial Science, University
Nonparametric estimation is an increasingly important element in the modern econometri-
cian’s toolbox. This paper focuses on nonparametric Bayesian models, which, in spite of
the name, involve infinitely many parameters, and are typically used to express uncertainty
in distribution spaces 1. We will concentrate on the infinite mixture model, which for an
observable y, can be written as
f(y) =∞∑
i=1
pi k(y|θi)
where k(y|θ) is a probability density function for y, while p1, p2, p3, . . . is an infinite se-
quence of positive numbers such that∑∞
i=1 pi = 1 and θ1, θ2, θ3, . . . is an infinite sequence
of parameter values for k. The model represents the distribution of y as an infinite mixture
which can flexibly represent a wide range of distributional shapes and generalizes the stan-
dard finite mixture model. The Bayesian analysis of this model is completed by the choice
of a prior for p and θ. Typically θ1, θ2, θ3, . . . are assumed independent and identically
distributed from a distribution H . The model is often expressed more compactly in terms
of a mixing measure G,
f(y) =∫
k(y|ϕ) dG(ϕ) (1)
where G is a random probability measure
G =∞∑
i=1
pi δθi
while δθ is the Dirac delta function which places measure 1 on the point θ. The distribution
G is almost surely discrete and each element is often called an “atom”. Many priors have
been proposed for G including the Dirichlet process (Lo, 1984), Stick-Breaking (Ishwaran
1Surveys of Bayesian nonparametric methods are provided in Walker et al. (1999) and Muller and Quintana
(2004). Early applications in economics include autoregressive panel data models (Hirano, 2002), longitudinal
data treatment models (Chib and Hamilton, 2002) and stochastic frontiers (Griffin and Steel, 2004).
2
and James, 2001, 2003) (which will be discussed in section 2), and Normalized Random
Measures (James et al., 2009).
If covariates are observed with y, the infinite mixture model can be extended by allow-
ing the kernel k to depend on x (Leslie et al., 2007; Chib and Greenberg, 2009), the mixing
measure G to depend on x (De Iorio et al., 2004; Muller et al., 2004; Griffin and Steel,
2006; Dunson et al., 2007) or both (Geweke and Keane, 2007). This allows posterior infer-
ence about the unknown distribution at a particular covariate value to borrow strength from
the distribution at other covariate values and so posterior estimates of the distributions at
different covariate values are smoothed. Similarly, if observation are made over time then
the mixture model can be extended to allow time dependence in the kernel k or the mix-
ing measure G. We will concentrate on the second case and consider an extended infinite
mixture model to define the distribution ft(y) at time t by
ft(y) =∫
k(y|ϕ)dGt(ϕ) (2)
Gt =∞∑
j=1
pj(t)δθj
where∑∞
j=1 pj(t) = 1 for all t. The kernels have fixed parameters but their probabilities
are allowed to change over time.
There are several models which fit into the framework of (2). One possible approach ex-
presses p1(t), p2(t), p3(t), . . . as the stick-breaking transformation of V (t) = (V1(t), V2(t), V3(t), . . . )
where 0 < Vj(t) < 1 for all j and t, so that
pj(t) = Vj(t)∏
j<i
(1− Vj(t)).
The “arrivals” construction of Griffin and Steel (2006) defines a sequence of times τ1, τ2, τ3, . . .
which follow a Poisson process and increases the size of V (t) at time τj by introducing
an extra variable 0 < V ?j < 1. This process defines distributions that change in con-
tinuous time and whose autocorrelation can be controlled by the intensity of the Poisson
process. The Time Series Dependent Dirichlet Process of Nieto-Barajas et al. (2008) de-
fines a stochastic process for Vj(t) independent from Vk(t) for k 6= j using an auxiliary
sequence of binomial random variables. This process does not include the introduction of
new atoms and so has rather different areas of application than the “arrivals” construction
and the models developed in this paper. Alternatively, Zhu et al. (2005) model the distri-
bution function of the observables directly by defining a Time-Sensitive Dirichlet Process
3
which generalizes the Polya urn scheme of the Dirichlet process. However, the process
is not consistent under marginalisation of the sample. A related approach is described by
Caron et al. (2007) who define a time-dependent nonparametric prior with Dirichlet pro-
cess marginals by allowing atoms to be deleted (unlike Griffin and Steel (2006)) as well
as added at each discrete time point. A careful construction of the addition and deletion
process allows a Polya urn scheme to be derived and defines a process which is consistent
under marginalisation.
In discrete time, Dunson (2006) proposes the evolution
Gt = πGt−1 + (1− π)εt
where π is a parameter and εt is a realisation from a Dirichlet process. This defines an AR-
process type model and an explicit Polya urn-type representation allows efficient inference.
This has some similarities to the “arrivals” π-DDP in discrete time, where the model is
generalized to
Gt = πt−1Gt−1 + (1− πt−1)εt−1
where εt is a discrete distribution with a Poisson-distributed number of atoms and πt is a
random variable correlated with εt. Griffin and Steel (2006) show how to ensure that the
marginal law of Gt follows a Dirichlet process for all t.
An alternative approach is to assume that pj(t) = pj for all t and the time dependence is
instead introduced through the atoms θj ; in that case we obtain an infinite mixture of time-
series model in the framework of single-p DDP models. Rodriguez and ter Horst (2008)
develop methods in this direction. Other time-dependent nonparametric models have been
developed as generalisations of hidden Markov models. Fox et al. (2008) develop a hidden
Markov model with an infinite number of states and apply their method to regime-switching
in financial time series. Taddy and Kottas (2009) propose a Markov-switching model where
there are a finite number of states each of which is associated with a nonparametric distri-
bution.
In this paper we develop a process of distributions {Gt} which evolve by adding new
atoms in continuous time and can be used in (2). An infinite mixture model, as opposed
to a finite mixture model, is natural in this context since it can be thought of as being
generated by the introduction of atoms from infinitely far in the past. The processes are
strictly stationary and have the same stick-breaking process prior marginally for all t. The
4
specific cases where the marginal is a Dirichlet process and a Poisson-Dirichlet2 process are
discussed (and represent two special cases of the construction). A Polya urn-type scheme is
derived for the important special case of a Dirichlet process marginal. Markov chain Monte
Carlo (MCMC) schemes for inference with both the general process and the special cases
are described. This generalizes the work of Griffin and Steel (2006) from Dirichlet process
marginals to more general stick-breaking processes.
The paper is organised in the following way: Section 2 briefly discusses stick-breaking
processes and Section 3 describes the link between stick-breaking processes and time-
varying nonparametric models and presents a method for constructing stationary processes
with a given marginal stick-breaking process. This section also discusses two important
special cases: with Dirichlet process and Poisson-Dirichlet marginals. Section 4 briefly
describes the proposed computational methods. In Section 5 we explore three econometric
examples using the two leading processes. In particular, we develop a stochastic volatility
model with a time-varying nonparametric returns distribution, and analyse models for re-
gional GDP and growth of regional GDP. Proofs of all Theorems are grouped in Appendix
A, Appendix B presents details of the MCMC algorithms and Appendix C compares the
marginal and the conditional algorithms for the case with Dirichlet process marginals.
2 Stick-Breaking Processes
The choice of prior for G defines the mixture model. We will concentrate on stick-breaking
processes which are defined as follows for the static case:
Definition 1 Suppose that a = (a1, a2, . . . ) and b = (b1, b2, . . . ) are sequences of positive
real numbers. A random probability measure G follows a stick-breaking process if
Gd=
∞∑
i=1
piδθi
pi = Vi
∏
j<i
(1− Vj)
where θ1, θ2, θ3, · · · i.i.d.∼ H and V1, V2, V3, . . . is a sequence of independent random vari-
ables for which Vi ∼ Be(ai, bi) for all i. This process will be denoted by G ∼ Π(a,b,H).
2This process is also sometimes called a Pitman-Yor process, after Pitman and Yor (1997).
5
Ishwaran and James (2001) shows that the process is well-defined (i.e.∑∞
i=1 pi = 1 almost
surely) if∞∑
i=1
log(1 + ai/bi) = ∞.
Note that the ordering of the atoms θi in a stick-breaking process matters, since the mean
weight E[pi] is decreasing with i; thus the atoms later in the ordering will tend to have less
weight.
Two specific stick-breaking priors have been well-studied: the Dirichlet process where
ai = 1 and bi = M denoted by G ∼ DP (MH) and the Poisson-Dirichlet process where
ai = 1− a, bi = M + ai (with 0 ≤ a < 1,M > −a).
The Dirichlet process can be expressed as a special case of the Poisson-Dirichlet process
where a = 0. The Poisson-Dirichlet process is an attractive generalisation of the Dirichlet
process where E[pi] can decay at a sub-geometric rate (in contrast to the geometric rate
associated with the Dirichlet process). This also affects the distribution of the number of
clusters3 in a sample of n values. The mean for the Dirichlet process is approximately
M log n whereas the mean for the Poisson-Dirichlet process is Sa,Mna (Pitman, 2003)
where Sa,M depends on a and M . The parameter a also affects the shape of the distribution
a = 0 a = 0.1 a = 0.2 a = 0.3
0 5 10 150
1000
2000
3000
4000
0 5 10 150
1000
2000
3000
4000
0 5 10 150
1000
2000
3000
4000
0 5 10 150
1000
2000
3000
4000
Figure 1: Number of clusters in n = 20 observations, using M = 1 and different values of a for the
Poisson-Dirichlet process (a = 0 corresponds to the Dirichlet process)
of the number of clusters. Figure 1 shows this distribution for different values of a with M
fixed at 1. Clearly, the variance of the distribution increases with a. This extra flexibility
allows a wide range of possible variances for a given prior mean number of clusters.
3An important property of these Bayesian nonparametric models is that the same atom can be assigned to more
than one observation; in other words, they induce clustering of the n observations into typically less than n groups.
6
3 Stick-Breaking Autoregressive Processes
In this section, we will construct {Gt}t∈[0,T ] which are strictly stationary with stick-breaking
marginals and can be used in the time-dependent mixture model in (2). We start with a pro-
cess in continuous time defined by
Gt = GN(t)
where N(t) is a Poisson process with intensity λ and Gs is defined by the recursion
Gs = VsGs−1 + (1− Vs)δθs (3)
where θs ∼ H and Vs ∼ Be(1, M). The process Gt evolves through jumps in the dis-
tribution at the arrival times of the Poisson process. This occurs in a specific way: a new
atom θs is introduced into Gt (so a new cluster is introduced into the mixture model for y
in (2)) and the previous atoms are discounted by Vs (so all old clusters are downweighted
in the mixture model). Griffin and Steel (2006) show that both G and G are strictly station-
ary processes and that Gs and Gs follow a Dirichlet process with mass parameter M and
centring distribution H . They also show that for any set B
Corr(Gt(B), Gt+k(B)) = exp{− λk
M + 1
}≡ ρk (4)
and so the dependence between measures on any set decreases exponentially at the same
rate.
The form of (3) defines a simple recursion but it also restrict the form of the marginal
Gt to a Dirichlet or two-parameter beta process (Ishwaran and Zarepour, 2000). The pro-
cess Gt is stationary and so the choice of marginal process will control the distribution of
clusters over time. The Dirichlet process has a relatively small variance of the number of
clusters for a fixed mean. Therefore, the number of clusters will be relatively stable over
time. Alternative marginal processes, such as a Poisson-Dirichlet process as defined in the
previous section, offer the potential for larger variances and so more flexibility in modelling
the number of clusters over time.
The distribution Gs defined by (3) can be expressed in stick-breaking form as
Gs =s∑
i=−∞δθiVi
s∏
j=i+1
(1− Vj). (5)
The stick-breaking is applied “backwards in time” which, combined with the ordering of
the expected weights, means that weights tend to be larger for atoms introduced at later
times.
7
Because the new atoms which “refresh” the distribution are placed at the front of the
ordering (i.e. with the highest expected weight), the Vi’s in (5) are associated with a differ-
ent place in the ordering whenever a new atom appears. If the distribution of Vi depends
on its place in this ordering (as is the case for the Poisson-Dirichlet marginals), this needs
to be taken into account. Thus, for more general marginals for Gt, we need to consider the
more general model
Gs =s∑
i=−∞δθi
Vi,s−i+1
s∏
j=i+1
(1− Vj,s−j+1), (6)
where Vj,1, Vj,2, . . . denotes the value of the break introduced at time j as it evolves over
time and Vj,t represents its value at time j + t. The stochastic processes {Vi,t}∞t=1 and
{Vj,t}∞t=1 must be independent for i 6= j if the process Gt is to have a stick-breaking form.
If we want Gt to follow a stick-breaking process with parameters a and b (see Definition
1) then the marginal distribution of Vi,t has to be distributed as Be(at, bt) for all i and t
and clearly we need time dependence of the breaks. The following theorem allows us to
define a reversible stochastic process Vi,t which has the correct distributions at all time
points to define a stationary process whose marginal process has a given stick-breaking
form. In the following theorem we define that if X is distributed as Be(a, 0) then X = 1
with probability 1 and if X ∼ Be(0, b) then X = 0 with probability 1.
Theorem 1 Suppose that at+1 ≥ at and bt+1 ≥ bt then
Vj,t+1 = wj,t+1Vj,t + (1− wj,t+1)εj,t+1
where wj,t+1 ∼ Be(at + bt, at+1 + bt+1 − at − bt), εj,t+1 ∼ Be(at+1 − at, bt+1 − bt) and
wj,t+1 is independent of εj,t+1 implies that the marginal distribution of Vj,t and Vj,t+1 are
Vj,t ∼ Be(at, bt) and Vj,t+1 ∼ Be(at+1, bt+1).
The application of this theorem allows us to construct stochastic processes with the cor-
rect margins for any stick-breaking process for which a1, a2, . . . and b1, b2, . . . form non-
decreasing sequences. Other choices can be accommodated, but nondecreasing sequences
allow for the important case of Poisson-Dirichlet marginals. In addition, they are the most
computationally convenient since the transitions of the stochastic process are mixtures, for
which we have well-developed simulation techniques. In this case, Theorem 1 leads to a
stick-breaking representation for Vj,t, which will be formalized in a continuous-time setting
in the following definition for our general class of nonparametric models.
8
Definition 2 Assume that τ1, τ2, τ3, . . . follow a homogeneous Poisson process with inten-
sity λ and the count mj(t) is the number of arrivals of the point process between τj and t,
so that
mj(t) = #{k|τj < τk < t}.
Let Π be a stick-breaking process (see Def. 1) with parameters a,b,H where a and b are
nondecreasing sequences. Define
Gt =∞∑
j=1
pj(t)δθj
where θ1, θ2, θ3, · · · i.i.d.∼ H and
pj(t) =
0 τj > t
Vj(t)∏
k|τj<τk<t(1− Vk(t)) τj < t
Vj(t) =mj(t)+1∑
l=1
εj,l(1− wj,l)mj(t)+1∏
i=l+1
wj,i
with εj,1 ∼ Be(a1, b1), wj,1 = 0 and wj,m ∼ Be(am−1 + bm−1, am + bm − am−1 −bm−1), εj,m ∼ Be(am − am−1, bm − bm−1) for m ≥ 2. Then we call Gt a Π-AR (for
Π-autoregressive) process, denoted as Π-AR(a,b,H; λ).
This proces is strictly stationary and so it will also mean revert. A new atom (or a
new cluster in the mixture model) is introduced at each jump point of the Poisson process
and other atoms are downweighted (as with the process with Dirichlet process marginal).
However, the values of Vj,t will also decay as t increases, in line with Theorem 1.
The mixture with the Π-AR as the mixing distribution defines a standard change-point
model (Carlin et al., 1992) if Gt is a single atom for all t. Therefore the new hierachical
model defines a generalisation of change-point models that allow observations to be drawn
from a mixture distribution where components will change according to a point process.
Alternatively, as λ → 0, the process tends to the corresponding static nonparametric model
and as λ → ∞ the process becomes uncorrelated in time. We have restricted attention to
nondecreasing sequences since efficient inferential methods can be developed. This defi-
nition could be extended to define priors where the marginal process is stick-breaking but
one or both parameter sequences are decreasing4.
4Details can be found in a working paper version entitled “Time-Dependent Stick-Breaking Processes” which
is freely available at http://www.warwick.ac.uk/go/crism/research/2009/paper09-05.
9
3.1 Special cases
3.1.1 Dirichlet process marginal
The Dirichlet process is the most commonly used nonparametric prior for the mixing dis-
tribution which arises when aj = 1 and bj = M for all j. So here we do not need to
apply Theorem 1. The Π-AR with a Dirichlet process marginal (denoted as DPAR model)
has a simple form with εj,1 ∼ Be(1,M) and wj,m = 1, εj,m = 0 for all m ≥ 2 so that
Vj(t) = εj,1 for all t. Writing Vj = εj,1 motivates the following definition.
Definition 3 Let τ1, τ2, . . . follow a homogeneous Poisson process with intensity λ and
V1, V2, · · · i.i.d.∼ Be(1,M). Then we say that {Gt}∞t=−∞ follows a DPAR(M, H;λ) if
Gt =∞∑
j=1
pj(t)δθj
where θ1, θ2, · · · i.i.d.∼ H and
pj(t) =
0 τj > t
Vj∏{k|τj<τk<t}(1− Vk) τj < t.
An important property of the Dirichlet process is the availability of a Polya urn scheme
representation of a sample drawn from a distribution with a Dirichlet process prior. This
representation implies that if x1, x2, x3, . . . are i.i.d. with distribution G and G ∼ DP(MH)
then xn+1|x1, x2, . . . , xn is drawn from the probability measure MM+nH +
∑ni=1
1M+nδxi .
Thus, this means that xn+1 will either share its value with one of the previously obtained
xi’s or will be a new value drawn from the distribution H . Importantly, this is a mixture
distribution with a finite number of components, and this representation typically leads to
a considerable simplification in our computations, since it allows us to marginalize with
respect to G. We shall now show that the DPAR model can also be represented sequentially
through mixture distributions with a finite number of components.
Let x1, x2, . . . , xn be a sequence of values drawn at times t1, t2, . . . , tn respectively.
Suppose that xi ∼ Gti and {Gt} ∼ DPAR(M,H; λ). We will construct a representation
of the process by sequentially drawing xi conditional on x1, x2, . . . , xi−1 and t1, t2, . . . , ti.
The DPAR model assumes that each value x1, x2, . . . , xn is introduced into Gt at particular
time points and the set of these times is Tn. The size of Tn will be denoted by kn. It will
be useful to prove the following lemma before the representation. This concerns the effect
of the previous observations on the distribution of T = (τ1, τ2, τ3, . . . ). We will work with
10
allocation variables s1, s2, . . . , sn which are chosen so that xi = θsi and define an active
set,
An(t) = {i|ti ≥ t and τsi < t for 1 ≤ i ≤ n},
which contains the observations available to be allocated at time t. If all observations are
made at the same time, the model reduces to the standard Dirichlet process and An(t) will
be an increasing process. However, once we make observations at different times then the
process will be increasing between observed times but it can be decreasing at these times.
To derive a convenient representation of the DPAR it is useful to consider the following
lemma which derives the distribution of τ conditional on Tn and a set of m ≥ n times
t1, t2, . . . , tm.
Lemma 1 Let Sn,m = Tn ∪ {t1, . . . , tm} where m ≥ n, which has size ln,m, and T Cn be
the set difference of T and Tn. If φ1 < φ2 < · · · < φln,m are the elements of Sn,m, the
distribution of T Cn conditional on s1, s2, . . . , sn and Sn,m is an inhomogeneous Poisson
process with a piecewise constant intensity, f(·), where
f(u) =
λ −∞ < u ≤ φ1
λ(
MM+An(φi)
)φi−1 < u ≤ φi, 2 ≤ i ≤ ln,m
λ φln,m < u < ∞.
This lemma implies that the intensity of the underlying Poisson process falls as An(φi)
increases with larger values of M associated with smaller decreases. Intuitively, the knowl-
edge that xi arrives in Gt at time τsi reduces the chance of observing new values between
τsi and ti. The following theorem now gives a finite representation of a sample from a
DPAR prior.
Theorem 2 Let τ?1 < τ?
2 < · · · < τ?kn
be the elements of Tn and φ1 < φ2 < · · · < φln,m+1
be the elements of Sn,m+1. We define
Ci = exp{− λM2(φi+1 − φi)
(M + An(φi+1))2(1 + M + An(φi+1))
}
= ρM2(M+1)
(M+An(φi+1))2(1+M+An(φi+1))(φi+1−φi)
, 1 ≤ i < ln,m+1
and
Di =M + An(τ?
i )1 + ηi + M + An(τ?
i ), 1 ≤ i ≤ kn
11
where ηi =∑n
j=1 I(sj = i). Let φp = tm+1 and τ?q be the largest element of Tn smaller
than tm+1. The distribution of sn+1 given Sn,m, s1, . . . , sn, tn+1 can be represented by
We will use the notation Ga(a) to denote a Gamma distribution with shape a and unitaryscale. A standard property of beta random variables implies that Vj,t = qj,t
qj,t+rj,t, where
qj,t ∼ Ga(at), rj,t ∼ Ga(bt) and qj,t and rj,t are independent for all j and t. Let qj,t+1 =qj,t+xj,t+1 and rj,t+1 = rj,t+zj,t+1 where xj,t+1 ∼ Ga(at+1−at), zj,t+1 ∼ Ga(bt+1−bt)and xj,t+1 and zj,t+1 are independent then qj,t+1 ∼ Ga(at+1), rj,t+1 ∼ Ga(bt+1) and qj,t+1
and rj,t+1 are independent. Writing
Vj,t+1 =qj,t+1
qj,t+1 + rj,t+1= wj,t+1Vj,t + (1− wj,t+1)εj,t+1
where wj,t+1 = qj,t+rj,t
qj,t+rj,t+xj,t+zj,tand εj,t+1 = xj,t
xj,t+zj,timplies that Vj,t+1 is beta dis-
tributed with the correct parameters. Standard properties of beta and gamma distribu-tions show that xj,t + zj,t is independent of εj,t+1, wj,t+1 is independent of εj,t+1 andwj,t+1 ∼ Be(at+1 + bt+1, at+1 + bt+1 − at − bt) and εj,t+1 ∼ Be(at+1 − at, bt+1 − bt).
A.2 Proof of Lemma 1
Let ki = #{i|φi < τi < φi+1} for 1 ≤ i ≤ ln,m − 1.
which shows that the number of points on (φi, φi+1) is Poisson distributed with mean(M
M+An(φi+1
)λ(φi+1 − φi). The position of the points is unaffected by the likelihood and
so the posterior is a Poisson process. There is no likelihood contribution for the intervals(−∞, φ1) and (φln,m ,∞). Since the Poisson process has independent increment then theposterior distribution on these intervals is also a Poisson process with intensity λ.
A.3 Proof of Theorem 2
In order to calculate the predictive distribution we need to calculate the probability of gen-erating the sample s1, s2, . . . , sn which is given by
Finally, since τkn+1 is uniformly distributed on (φi, φi−1) if τkn+1 ∈ (φi, φi−1) then
p(τ?kn+1 = φi − x|τ?
kn+1 ∈ (φi−1, φi))
∝ E
[(M
M + An(φi)
#{j|φi−1<τj<φi})]
∝ exp{− λM2
(M + An(φi))2(M + An(φ) + 1)(φi − τkn+1)
}
which implies that x = φi − τkn+1 follows the distribution given in the Theorem.
28
B Computational DetailsWe will write the times in reverse time-order T > τ1 > τ2 > τ3 > · · · > τk whereT = max{ti} and k = max{sj}. Let k−i = maxj 6=i{sj}. We will use the notation fromDefinition 2, mj(t) = #{k|τj < τk < t}.
B.1 General sampler
Updating s
We update si using a retrospective sampler (see Papaspiliopoulos and Roberts, 2008). Let∆ = {θ, w, ε, τ}. This method proposes a new value of (si, ∆), which will be referredto as (s′i,∆
′), that are either accepted or rejected in a Metropolis-Hastings sampler. Theproposal is made in the following way:Let θ′i = θi, ε′i = εi, w′i = wi and τ ′i = τi for 1 ≤ i ≤ k−i, α = maxj≤k−i{k(yi|θ′j)} anddefine
qj =p(si = j)k(yi|θ′j)
α(1−∑k−i
j=1 p(si = j))
+∑
p(si = j)k(yi|θ′j), 1 ≤ j ≤ k−i.
Simulate u ∼ U(0, 1). If u <∑k−i
j=1 qj then find the m for which∑m−1
j=1 qj < u <∑mj=1 qj . Otherwise, we simulate in the following way. Let
qj = αp(si = j), j > k−i
and sequentially simulate ∆′k−i+1,∆
′k−i+2, . . . , ∆
′m until we meet the condition that u <∑m
j=1 qj . We can simulate ∆′j given ∆′
j−1 using the relation τ ′j = τ ′j−1 − νj where νj ∼Ex(λ) and simulating θ′j , ε′j and w′j from their prior. The new state (s′i, ∆
′) is accepted withprobability {
1 if m ≤ k−i
min{
1, k(yi|θ′m)α
}if m > k−i
.
Updating ζ
If j = si, then the full conditional distribution of ζijk is given by
p(ζisisi = l) ∝ εsi,l(1− wj,l)mj(ti)∏
h=l+1
wj,h
p(ζisik = l) = (1− εsi,l)(1− wj,l)mj(ti)∏
h=l+1
wj,h, k < si
Otherwise ζijk is sampled from its prior distribution.
29
Updating ε
The full conditional distribution of εj,k is
Be
ak − ak−1 +
∑
{i|si=j}I(ζijj = k), bk − bk−1 +
∑
{i|si<j and τj<ti}
ri∑
p=si+1
I(ζijp = k)
.
Updating w
The full conditional distribution of wj,l is Be (a?, b?) , where
a? = al−1 + bl−1 +n∑
i=1
ri∑
j=si
ri∑
k=j
I(ζijk + 1 ≤ l ≤ mj(ti))
and
b? = al + bl − al−1 − bl−1 +ri∑
h=si
n∑
i=1
I(h ≤ j ≤ ri)I(ζihj = l).
Updating τ
The point process τ can be updated using a Reversible Jump MCMC step. We have threepossible move: 1) Add a point to the process, 2) delete a point from the process and 3)Move a point. The first two moves are proposed with the same probability qCHANGE
(where qCHANGE < 0.5) and the third move is proposed with probability 1− 2qCHANGE .The Add move proposes the addition of a point to the process by uniformly sampling τk+1
from (min{τi}, max{ti}), θk+1 ∼ H and simulating the necessary extra ε’s, w’s and ζ’sfrom their prior. To improve acceptance rates we also update some allocations s. A point,j?, is chosen uniformly at random from {1, . . . , k} and we propose new values s′i if si = j?
and p′ is calculated using the proposal and p is calculated using the current state.
30
The delete move proposes to remove a point of the process by uniformly selecting twodistinct points j1 from the {1, 2, . . . , k}/{i|τi ≤ τj for all j} and j2 from {1, 2, . . . , k}. Wepropose to remove τj1 , θj1 , and the vectors wj1 and εj1 . For a points τi < τj1 , we proposenew vectors ε′i and w′i by deleting the element εi,m where m = #{j|τj > τi} from εi andthe element wi,m where m = #{j|τj > τi} from wi. Finally, we set s′i = j2 if si = j1.The acceptance probability is zero if τj2 > ti for any i such that si = j1. Otherwise, theacceptance probability is
min
1,
λ(max(ti)−min(τi))k
∏
{i|s′i 6=si}
k(yi|θj2)k(yi|θj1)
n∏
i=1
p′(s′i)p(si)
q?
where
q? =∏
{i|si=j1 or si=j2}
(1
qi,1
)I(s′i=j1) (1
qi,2
)I(s′i=j2)
.
The reverse proposals qi,1 and qi,2 are calculated as
The Move step uses a Metropolis-Hastings random walk proposal. A distinct point arechosen at random from the set {1, 2, . . . , k}/{i|τi ≤ τj for all j}, say j?, and a new valueτ ′j? = τj?+ε where ε ∼ N(0, σ2
PROP ). The move is rejected if τ ′j < min(τj), τ ′j > max(ti)or τ ′j? > ti for any i such that si = j?. Otherwise, the acceptance probability is
min
{1,
n∏
i=1
p′(s′i)p(si)
}.
Updating λ
The parameter λ can be updated in the following way. Let τ (old) and λ(old) be the currentvalues in the Markov chain. Simulate λ from the distribution proportional to
p(λ)λ#{i|τi>min{ti}} exp{−λ(max(ti)−min(ti)}
and set τi = min(ti)− λ(old)
λ (min(ti)− τ(old)i ) if τ
(old)i < min(ti).
Updating θ
The parameter θj can update from the full conditional distribution
h(θj)n∏
{i|si=j}k(yi|θj).
31
B.2 Poisson-Dirichlet process
In this case the general sampler can be simplified to a method that generalizes the computa-tional approach described by Dunson et al (2007) for a process where each break is formedby the product of two beta random variables. In our more general case we can write
Vj(t) = εj
mj(t)+1∏
h=2
wj,h.
We introduce latent variables rijk which takes values 0 or 1 where
p (rij1 = 1) = εj , p (rijk = 1) = wj,k for k = 2, . . . ,mj(ti) + 1
which are linked to the usual allocation variables si by the relationship
si = min{j|rijk = 1 for all 1 ≤ k ≤ mj(ti) + 1}.
Thus
p(si = j) = p(rijk = 1 for all 1 ≤ k ≤ mj(ti))∏
l≤j
p(there exists k such that rilk = 0)
= p(rijk = 1 for all 1 ≤ k ≤ mj(ti))∏
l≤j
(1− p(rilk = 1 for all 1 ≤ k ≤ ml(ti)))
= εj
mj(ti)∏
i=2
wj,i
1− εl
ml(ti)∏
i=2
wj,i
.
Conditional on r the full conditional distributions of ε and w will be beta distributionsand any hyperparameters of the stick-breaking process can be updated using standard tech-niques. Updating of the other parameters proceeds by marginalising over r but conditioningon s.
Updating s
We could update s using the method in Appendix B.1 but we find that this can run veryslowly in the Poisson-Dirichlet case. This is because at each update of si we potentiallysimulate a proposed value s′i which is much bigger than max{si} (due to the slow decay ofPoisson-Dirichlet processes) and generate very many values for w. This section describesan alternative approach which updates in two steps: 1) update si marginalising over anynew ε and w vectors and 2) simulate the new ε and w vectors conditional on the new valueof si. The algorithm is much more efficient since many proposed values of si are rejectedat stage 1) and extensive simulation is avoided. We make the following changes to thealgorithm
qj = α1− b
1 + a + b(j − 1)
∏
k−i<l<j
a + bl
1 + a + b(l − 1)
∏
l≤k−i
(1− Vj(ti)), j > k−i
32
and sequentially simulate (θ′k−i+1, τ′k−i+1), (θ
′k−i+2, τ
′k−i+2), . . . , (θ
′m, τ ′m) in the same way
as before until we meet the condition that u <∑m
j=1 qj . The new state (θ′, τ ′, s′i) is ac-cepted with probability {
1 if max{si} ≤ kk(yi|θ′m)
α if max{si} > k.
If the move is accepted we simulate εj and wj for j > k−i in the following way. Simulaterijk where k = #{l|τj < τl < ti} for k−i < j ≤ si and simulate εj and wj using themethod for updating ε and w.
Updating ε and w
We can generate r conditional on s using the following scheme. For the i-th observation,we can simulate rij1, rij2, . . . , rijk where k = #{l|τj < τl < ti} sequentially. Initially,
p(rij1 = 1) = εj1−∏k
h=1 wj,h
1− εj∏k
h=1 wj,h
.
To simulate rijl, then if rijh = 1, 1 ≤ h ≤ l then
p(rijl = 1) = wj,l
1−∏kh=l+1 wj,h
1−∏kh=l wj,h
.
Otherwise p(rijl = 1) = wj,l. Finally, we set ri(k+1)1 = 1, . . . , ri(k+1)k = 1. Then the fullconditional distribution of εj is
Be
1− b +
∑
{i|1<#{τk|τj≤τk<ti}}rij1, a + b +
∑
{i|1<#{τk|τj≤τk<ti}}(1− rij1)
and the full conditional distribution of wj,k for k ≥ 1 is
Be
1 + a + (k − 2)b +
∑
{i|k<#{τk|τj≤τk<ti}}rijk, b +
∑
{i|k<#{τk|τj≤τk<ti}}(1− rijk)
.
Updating τ , θ and λ
These can be updated using the methods in Appendix B.1.
B.3 Dirichlet process - Marginal method
Updating s
We can update sj conditional on s1, . . . , sj−1, sj+1, . . . , sn. We define A(t) to be the activeset defined using the allocations s1, . . . , sj−1, sj+1, . . . , sn, T = {τsl
The probability that τ?i ∈ (φj , φj+1) is proportional to
Aj
p∏
h=j+1
P ′h
Ph
∏
{i|φj<τ?i ≤φp}
Q′i
Qi, j ≤ K
and the probability that τ?i < φ1
Γ(M + 1)ηiΓ(ηi + M)
p∏
h=1
P ′h
Ph
q∏
i=1
Q′i
Qi.
This distribution is finite and discrete and draws can be simply simulated. Conditional onthe atom being allocated to the region (φj−1, φj), then the simulated value τ?
i = φj − x
where x is distributed TEx(0,φj−φj−1)
(λ(φj − φj−1) M
M+A(φj)ηi
M+A(φj)+ηi
)and if τ?
i <
φ1 then τ?i = φ1 − x where x ∼ Ex(λ/(M + 1)).
Updating λ and M
To update these parameters from their full conditional distribution we first simulate thenumber of atoms, ci, between (φi, φi+1) from a Poisson distribution with mean Mλ
M+An(φi+1)(φi+1 − φi)
and V1, V2, . . . , Vkn where Vi ∼ Be(1 + ηi,M + An(τi)). The full conditional distributionof λ is proportional to
p(λ) λkn+∑kn
i=1 ci exp {−λ(max{ti} −min{τi})} .
The full conditional distribution of M is proportional to
p(M) Mkn+∑kn
i=1 ci
kn∏
i=1
M + An(τi)M + An(τi) + 1 + ηi
kn+1∏
i=1
(M + An(φi+1)
1 + M + An(φi+1)
)ci
.
C Comparison of MCMC algorithmsWe compare the marginal (see Section B.3) and the general conditional algorithms withDirichlet process marginals by analysing three simulated data sets and looking at the be-haviour of the chain for the two parameters λ and M . The integrated autocorrelation timeis used as a measure of the mixing of the two chains since an effective sample size canbe estimated by sample size divided by integrated autocorrelation time (Liu, 2001). Weintroduce three simple, simulated datasets to compare performance over a range of possibledata.
In all cases, we make a single observation at each time point for t = 1, 2, . . . , 100. Thedata sets are simulated from the following models. The first model has a single changepoint at time 50
Table 2: The integrated autocorrelation times for M and λ using the two sampling schemes
The second model has a linear trend over time
p(yi) = N(
40(i− 1)99
− 20, 1)
.
The third model has a linear trend before time 40 and then follows a mixture of threeregressions after time 40
p(yi) =
N(
40(i−1)99 − 20, 1
)if i < 40
310N
(40(i−1)
99 − 20, 1)
+ 25N (−4, 1) + 3
10N(12− 40(i−1)
99 , 1)
if i ≥ 40.
These data sets are fitted by a mixture of normals model
yt ∼ N(µt, 1)
µt ∼ Gt
Gt ∼ DPAR(M, H;λ).
where H(µ) = N(µ|0, 100). Table 2 shows the results for the three data sets, using Ex-ponential priors with unitary mean for M and λ. The mixing of λ is much better usingthe marginal sampler for each dataset (particularly data sets 2 and 3). The mixing of M issimilar for the first two datasets but better for dataset 3.