Gaussian Process Structural Equation Models with Latent
Variables
Ricardo Silva
Department of Statistical ScienceUniversity College London
Robert B. Gramacy
Statistical laboratoryUniversity of CambridgeGaussian Process
Structural Equation Models with Latent
[email protected]@statslab.cam.ac.ukSummaryA
Bayesian approach for graphical models with measurement errorModel:
nonparametric DAG + linear measurement modelRelated literature:
structural equation models (SEM), error-in-variables
regressionApplications: dimensionality reduction, density
estimation, causal inferenceEvaluation: social sciences/marketing
data, biological domainApproach: Gaussian process prior +
MCMCBayesian pseudo-inputs model + space-filling priorsAn Overview
of Measurement Error ProblemsMeasurement Error ProblemsCalorie
intakeWeightMeasurement Error ProblemsWeightCalorie intakeReported
calorie intakeLatentObservedNotation corner:Error-in-variables
RegressionWeightCalorie intakeReported calorie intakeTask: estimate
error and f()Error estimation can be treated separatelyCaveat
emptor: outrageously hard in theoryIf errors are Gaussian, best (!)
rate of convergence is O((1/log N)2), N sample sizeDont
panicReported calorie intake = Calorie intake + errorWeight =
f(Calorie intake) + error(Fan and Truong, 1993)Error in
Response/Density EstimationReported weightCalorie intakeReported
calorie intakeWeightMultiple Indicator ModelsWeight recorded in the
morningCalorie intakeSelf-reported calorie intakeWeightWeight
recorded in the eveningAssisted report ofcalorie intakeChains of
Measurement Error Widely studied as Structural Equations Models
(SEMs) with latent variablesReported weightCalorie intakeReported
calorie intakeWeightWell-beingReported time to fall asleep(Bollen,
1989)GNP etc.Quick Sidenote: VisualizationIndustrialization Level
1960DemocratizationLevel 1960DemocratizationLevel 1965GNP etc.GNP
etc.Fairness of elections etc.GNP etc.Fairness of elections
etc.Quick Sidenote: Visualization(Palomo et al., 2007)
Non-parametric SEM: Model and InferenceTraditional SEMSome
assumptionsassume DAG structureassume (for simplicity only) no
observed variable has children in the Linear functional
relationships:
Parentless vertices ~ GaussianXYNotation corner:Xi = i0 +
XTP(i)Bi + iYj = j0 + XTP(j)j + jOur Nonparametric SEM:
LikelihoodFunctional relationships:
where each fi() belongs to some functional space.
Parentless latent variables follow a mixture of Gaussians, error
terms are GaussianXi = fi(XP(i)) + iYj = j0 + XTP(j)j + ji ~ N(0,
vi)j ~ N(0, vj)Related IdeasGP Networks (Friedman and Nachman,
2000):Reduces to our likelihood for Yi = XiGaussian process latent
variable model (Lawrence, 2005):
Module networks (Segal et al., 2005):Shared non-linearities
e.g., Y4 = 40+41f(IL) + error, Y5 = 50+51f(IL) + errorDynamic
models (e.g., Ko and Fox, 2009)Functions between different data
points, symmetry
Identifiability ConditionsGiven observed marginal M(Y) and DAG,
are M(X), {}, {v} unique?Relevance for causal inference and
embeddingEmbedding: problematic MCMC for latent variable
interpretation if unidentifiableCausal effect estimation: not
resolved from dataNote: barring possible MCMC problems, not
essential for predictionIllustration:Yj = X1 + error, for j = 1, 2,
3; Yj = 2X2 + error, j = 4, 5, 6X2 = 4X12 + error
Identifiable Model: Walkthrough(In this model, regression
coefficients are fixed for Y1 and Y4.)Assumed structure
Non-Identifiable Model: Walkthrough(Nothing fixed, and all Y
freely depend on both X1 and X2.)Assumed structureThe
Identifiability ZooMany roads to identifiability via different sets
of assumptionsWe will ignore estimation issues in this
discussion!One generic approach boils down to a reduction to
multivariate deconvolution
so that the density of X can be uniquely obtained from the
(observable) density of Y and (given) density of errorBut we have
to nail the measurement error identification problem first.
Y = X + errorHazelton and Turlach (2009)Our Path in The
Identifiability ZooThe assumption of three or more pure
indicators:
Scale, location and sign of Xi is arbitrary, so fix Y1i = Xi +
i1It follows that remaining linear coefficients inYji = 0ji + 1jiXi
+ji are identifiable, and so is the variance of each error term
XiY1iY2iY3i(Bollen, 1989)Our Path in The Identifiability
ZooSelect one pure indicator per latent variable to form set Y1
(Y11, Y12, ..., Y1L) and E1 ( 11, 12, ..., 1L)From
obtain the density of X, since Gaussian assumption for error
terms results in density of E1 being knownNotice: since density of
X is identifiable, identifiability of directionality Xi Xj vs. Xj
Xi is achievable in theoryY1 = X + E1(Hoyer et al., 2008)Quick
Sidenote: Other PathsThree pure indicators per variable might not
be reasonableAlternatives:Two pure indicators, non-zero correlation
between latent variablesRepeated measurements (e.g., Schennach
2004)X* = X + errorX** = X + errorY = f(X) + errorAlso related:
results on detecting presence of measurement error (Janzing et al.,
2009)For more: Econometrica, etc.Priors: Parametric
ComponentsMeasurement model: standard linear regression priorse.g.,
Gaussian prior for coefficients, inverse gamma for conditional
varianceCould use the standard normal-gamma priors so that
measurement model parameters are marginalized
In the experiments, we wont use such normal-gamma priors,
though, because we want to evaluate mixing in general
Samples using P(Y | X, f(X))p(X, f(X)) instead of P(Y | X, f(X),
)p(X, f(X))p()Priors: Nonparametric ComponentsFunction f(XPa(i)):
Gaussian process priorf(XPa(i)(1)), f(XPa(i) (2)), ..., f(XPa(i)
(N)) ~ jointly Gaussian with particular kernel function
Computational issues:Scales as O(N3), N being sample
sizeStandard MCMC might converge poorly due to high conditional
association between latent variables
The Pseudo-Inputs ModelHierarchical approachRecall: standard GP
from {X(1), X(2), ..., X(N)}, obtain distribution over {f(X(1)),
f(X(2)), ..., f(X(N))}Predictions of future observations f(X*(1)),
f(X*(2)), ..., etc. are jointly conditionally Gaussian
tooIdea:imagine you see a pseudo training set Xyour actual training
set {f(X(1)), f(X(2)), ..., f(X(N))} is conditionally Gaussian
given Xhowever, drop all off-diagonal elements of the conditional
covariance matrix(Snelson and Ghahramani, 2006; Banerjee et al.,
2008)The Pseudo-Inputs Model: SEM ContextStandard
modelPseudo-inputs model
Bayesian Pseudo-Inputs TreatmentSnelson and Ghaharamani (2006):
empirical Bayes estimator for pseudo-inputsPseudo-inputs rapidly
amounts to many more free parameters sometimes prone to
overfittingHere: space-filling priorLet pseudo-inputs X have
bounded supportSet p(Xi) det(D), where D is some kernel matrixA
priori, spreads points in some hyper-cubeNo fitting: pseudo-inputs
are sampled tooEssentially no (asymptotic) extra cost since we have
to sample latent variables anywayPossible mixing
problems?REFERENCES HEREDemonstrationSquared exponential kernel,
hyperparameter lexp(|xi xj|2 / l)1-dimensional pseudo-input space,
2 pseudo-data pointsX(1), X(2)Fix X(1) to zero, sample X(2)NOT
independent. It should differ from the uniform distribution at
different degrees according to lDemonstration
More on Priors and Pseudo-PointsHaving a prior treats
overfittingblurs pseudo-inputs, which theoretically leads to a
bigger coverageif number of pseudo-inputs is insufficient, might
provide some edge over models with fixed pseudo-inputs, but care
should be exercisedExampleSynthetic data with quadratic
relationshipPredictive SamplesSampling 150 latent points from the
predictive distribution, 2 fixed pseudo-inputs
(Average predictive log-likelihood: -4.28)Predictive
SamplesSampling 150 latent points from the predictive distribution,
2 fixed pseudo-inputs
(Average predictive log-likelihood: -4.47)Predictive
SamplesSampling 150 latent points from the predictive distribution,
2 free pseudo-inputs with priors
(Average predictive log-likelihood: -3.89)Predictive SamplesWith
3 free pseudo-inputs(Average predictive log-likelihood: -3.61)
MCMC UpdatesMetropolis-Hastings, low parent dimensionality ( 3
parents in our examples)Mostly standard. Main points:It is possible
to integrate away pseudo-functions.Sampling function values
{f(Xj(1)), ... f(Xj(N))} is done in two-stages:Sample
pseudo-functions for Xj conditioned on all but function
valuesConditional covariance of pseudo-functions (true functions
marginalized)
Then sample {f(Xj(1)), ... f(Xj(N))} (all conditionally
independent)
(N = number of training points, M = number of pseudo-points)MCMC
UpdatesWhen sampling pseudo-input variable XPa(i)(d)Factors:
pseudo-functions and regression weightsMetropolis-Hastings
step:
Warning: for large number of pseudo-points, p(fi(d) | f\i(d), X)
can be highly peakedAlternative: propose and sample fi(d)()
jointly
MCMC UpdatesIn order to calculate the ratio iteratively
fast submatrix updates are necessary for
to obtain O(NM) cost per pseudo-point, i.e., total of O(NM2)
ExperimentsSetupEvaluation of Markov chain behaviourObjective
model evaluation via predictive log-likelihoodQuick detailsSquared
exponential kernel
Prior for a (and b): mixture of Gamma (1, 20) + Gamma(20, 20)M =
50
Synthetic ExampleOur old friendYj = X1 + error, for j = 1, 2, 3;
Yj = 2X2 + error, j = 4, 5, 6X2 = 4X12 + error
Synthetic ExampleVisualization: comparison against
GPLVMNonparametric factor-analysis, independent Gaussian marginals
for latent variables
GPLVM: (Lawrence, 2005)MCMC BehaviourExample: consumer
dataIdentify the factors that affect willingness to pay more to
consume environmentally friendly products16 indicators of
environmental beliefs and attitudes, measuring 4 hidden
variablesX1: Pollution beliefsX2: Buying habitsX3: Consumption
habitsX4: Willingness to spend more333 datapoints.Latent
structureX1 X2, X1 X3, X2 X3, X3 X4
(Bartholomew et al., 2008)MCMC Behaviour
MCMC Behaviour
UnidentifiablemodelSparseModel 1.1Predictive Log-likelihood
ExperimentGoal: compare predictive loglikelihood of Pseudo-input
GPSEM, linear and quadratic polynomial models, GPLVM and subsampled
full GPSEMDataset 1: Consumer dataDataset 2: Abalone (also found in
UCI)Postulate two latent variables, Size and Weight. Size has as
indicators the length, diameter and height of each abalone
specimen, while Weight has as indicators the four weight variables.
3000+ points.Dataset 3: Housing (also found in UCI)Includes
indicators about features of suburbs in Boston that are relevant
for the housing market. 3 latent variables, ~400 points
Abalone: Example
Housing: Example
Results
Pseudo-input GPSEM at least an order of magnitude faster than
full GPSEM model (undoable in Housing). Even when subsampled to 300
points, full GPSEM still slower.Predictive Samples
Conclusion and Future WorkEven Metropolis-Hastings does a
somewhat decent job (for sparse models)Potential problems with
ordinal/discrete data.Evaluation of high-dimensional
modelsStructure learningHierarchical modelsComparisons against
random projection approximationsmixture of Gaussian processes with
limited mixture sizeFull MATLAB code
availableAcknowledgementsThanks to Patrik Hoyer, Ed Snelson and
Irini Moustaki.Extra References (not in the paper)S. Banerjee, A.
Gelfand, A. Finley and H. Sang (2008). Gaussian predictive process
models for large spatial data sets. JRSS B.D. Janzing, J. Peters,
J. M. Mooij andB. Schlkopf. (2009). Identifying confounders using
additive noise models. UAI.M. Hazelton and B. Turlach (2009).
Nonparametric density deconvolution by weighted kernel estimators.
Statistics and Computing.S. Schennack (2004). Estimation of
nonlinear models with measurement error. Econometric 72.