Gaussian Process Structural Equation Models with Latent Variables

Gaussian Process Structural Equation Models with Latent Variables

Ricardo Silva

Department of Statistical ScienceUniversity College London

Robert B. Gramacy

Statistical laboratoryUniversity of CambridgeGaussian Process Structural Equation Models with Latent [email protected]@statslab.cam.ac.ukSummaryA Bayesian approach for graphical models with measurement errorModel: nonparametric DAG + linear measurement modelRelated literature: structural equation models (SEM), error-in-variables regressionApplications: dimensionality reduction, density estimation, causal inferenceEvaluation: social sciences/marketing data, biological domainApproach: Gaussian process prior + MCMCBayesian pseudo-inputs model + space-filling priorsAn Overview of Measurement Error ProblemsMeasurement Error ProblemsCalorie intakeWeightMeasurement Error ProblemsWeightCalorie intakeReported calorie intakeLatentObservedNotation corner:Error-in-variables RegressionWeightCalorie intakeReported calorie intakeTask: estimate error and f()Error estimation can be treated separatelyCaveat emptor: outrageously hard in theoryIf errors are Gaussian, best (!) rate of convergence is O((1/log N)2), N sample sizeDont panicReported calorie intake = Calorie intake + errorWeight = f(Calorie intake) + error(Fan and Truong, 1993)Error in Response/Density EstimationReported weightCalorie intakeReported calorie intakeWeightMultiple Indicator ModelsWeight recorded in the morningCalorie intakeSelf-reported calorie intakeWeightWeight recorded in the eveningAssisted report ofcalorie intakeChains of Measurement Error Widely studied as Structural Equations Models (SEMs) with latent variablesReported weightCalorie intakeReported calorie intakeWeightWell-beingReported time to fall asleep(Bollen, 1989)GNP etc.Quick Sidenote: VisualizationIndustrialization Level 1960DemocratizationLevel 1960DemocratizationLevel 1965GNP etc.GNP etc.Fairness of elections etc.GNP etc.Fairness of elections etc.Quick Sidenote: Visualization(Palomo et al., 2007)

Non-parametric SEM: Model and InferenceTraditional SEMSome assumptionsassume DAG structureassume (for simplicity only) no observed variable has children in the Linear functional relationships:

Parentless vertices ~ GaussianXYNotation corner:Xi = i0 + XTP(i)Bi + iYj = j0 + XTP(j)j + jOur Nonparametric SEM: LikelihoodFunctional relationships:

where each fi() belongs to some functional space.

Parentless latent variables follow a mixture of Gaussians, error terms are GaussianXi = fi(XP(i)) + iYj = j0 + XTP(j)j + ji ~ N(0, vi)j ~ N(0, vj)Related IdeasGP Networks (Friedman and Nachman, 2000):Reduces to our likelihood for Yi = XiGaussian process latent variable model (Lawrence, 2005):

Module networks (Segal et al., 2005):Shared non-linearities e.g., Y4 = 40+41f(IL) + error, Y5 = 50+51f(IL) + errorDynamic models (e.g., Ko and Fox, 2009)Functions between different data points, symmetry

Identifiability ConditionsGiven observed marginal M(Y) and DAG, are M(X), {}, {v} unique?Relevance for causal inference and embeddingEmbedding: problematic MCMC for latent variable interpretation if unidentifiableCausal effect estimation: not resolved from dataNote: barring possible MCMC problems, not essential for predictionIllustration:Yj = X1 + error, for j = 1, 2, 3; Yj = 2X2 + error, j = 4, 5, 6X2 = 4X12 + error

Identifiable Model: Walkthrough(In this model, regression coefficients are fixed for Y1 and Y4.)Assumed structure

Non-Identifiable Model: Walkthrough(Nothing fixed, and all Y freely depend on both X1 and X2.)Assumed structureThe Identifiability ZooMany roads to identifiability via different sets of assumptionsWe will ignore estimation issues in this discussion!One generic approach boils down to a reduction to multivariate deconvolution

so that the density of X can be uniquely obtained from the (observable) density of Y and (given) density of errorBut we have to nail the measurement error identification problem first.

Y = X + errorHazelton and Turlach (2009)Our Path in The Identifiability ZooThe assumption of three or more pure indicators:

Scale, location and sign of Xi is arbitrary, so fix Y1i = Xi + i1It follows that remaining linear coefficients inYji = 0ji + 1jiXi +ji are identifiable, and so is the variance of each error term

XiY1iY2iY3i(Bollen, 1989)Our Path in The Identifiability ZooSelect one pure indicator per latent variable to form set Y1 (Y11, Y12, ..., Y1L) and E1 ( 11, 12, ..., 1L)From

obtain the density of X, since Gaussian assumption for error terms results in density of E1 being knownNotice: since density of X is identifiable, identifiability of directionality Xi Xj vs. Xj Xi is achievable in theoryY1 = X + E1(Hoyer et al., 2008)Quick Sidenote: Other PathsThree pure indicators per variable might not be reasonableAlternatives:Two pure indicators, non-zero correlation between latent variablesRepeated measurements (e.g., Schennach 2004)X* = X + errorX** = X + errorY = f(X) + errorAlso related: results on detecting presence of measurement error (Janzing et al., 2009)For more: Econometrica, etc.Priors: Parametric ComponentsMeasurement model: standard linear regression priorse.g., Gaussian prior for coefficients, inverse gamma for conditional varianceCould use the standard normal-gamma priors so that measurement model parameters are marginalized

In the experiments, we wont use such normal-gamma priors, though, because we want to evaluate mixing in general

Samples using P(Y | X, f(X))p(X, f(X)) instead of P(Y | X, f(X), )p(X, f(X))p()Priors: Nonparametric ComponentsFunction f(XPa(i)): Gaussian process priorf(XPa(i)(1)), f(XPa(i) (2)), ..., f(XPa(i) (N)) ~ jointly Gaussian with particular kernel function

Computational issues:Scales as O(N3), N being sample sizeStandard MCMC might converge poorly due to high conditional association between latent variables

The Pseudo-Inputs ModelHierarchical approachRecall: standard GP from {X(1), X(2), ..., X(N)}, obtain distribution over {f(X(1)), f(X(2)), ..., f(X(N))}Predictions of future observations f(X*(1)), f(X*(2)), ..., etc. are jointly conditionally Gaussian tooIdea:imagine you see a pseudo training set Xyour actual training set {f(X(1)), f(X(2)), ..., f(X(N))} is conditionally Gaussian given Xhowever, drop all off-diagonal elements of the conditional covariance matrix(Snelson and Ghahramani, 2006; Banerjee et al., 2008)The Pseudo-Inputs Model: SEM ContextStandard modelPseudo-inputs model

Bayesian Pseudo-Inputs TreatmentSnelson and Ghaharamani (2006): empirical Bayes estimator for pseudo-inputsPseudo-inputs rapidly amounts to many more free parameters sometimes prone to overfittingHere: space-filling priorLet pseudo-inputs X have bounded supportSet p(Xi) det(D), where D is some kernel matrixA priori, spreads points in some hyper-cubeNo fitting: pseudo-inputs are sampled tooEssentially no (asymptotic) extra cost since we have to sample latent variables anywayPossible mixing problems?REFERENCES HEREDemonstrationSquared exponential kernel, hyperparameter lexp(|xi xj|2 / l)1-dimensional pseudo-input space, 2 pseudo-data pointsX(1), X(2)Fix X(1) to zero, sample X(2)NOT independent. It should differ from the uniform distribution at different degrees according to lDemonstration

More on Priors and Pseudo-PointsHaving a prior treats overfittingblurs pseudo-inputs, which theoretically leads to a bigger coverageif number of pseudo-inputs is insufficient, might provide some edge over models with fixed pseudo-inputs, but care should be exercisedExampleSynthetic data with quadratic relationshipPredictive SamplesSampling 150 latent points from the predictive distribution, 2 fixed pseudo-inputs

(Average predictive log-likelihood: -4.28)Predictive SamplesSampling 150 latent points from the predictive distribution, 2 fixed pseudo-inputs

(Average predictive log-likelihood: -4.47)Predictive SamplesSampling 150 latent points from the predictive distribution, 2 free pseudo-inputs with priors

(Average predictive log-likelihood: -3.89)Predictive SamplesWith 3 free pseudo-inputs(Average predictive log-likelihood: -3.61)

MCMC UpdatesMetropolis-Hastings, low parent dimensionality ( 3 parents in our examples)Mostly standard. Main points:It is possible to integrate away pseudo-functions.Sampling function values {f(Xj(1)), ... f(Xj(N))} is done in two-stages:Sample pseudo-functions for Xj conditioned on all but function valuesConditional covariance of pseudo-functions (true functions marginalized)

Then sample {f(Xj(1)), ... f(Xj(N))} (all conditionally independent)

(N = number of training points, M = number of pseudo-points)MCMC UpdatesWhen sampling pseudo-input variable XPa(i)(d)Factors: pseudo-functions and regression weightsMetropolis-Hastings step:

Warning: for large number of pseudo-points, p(fi(d) | f\i(d), X) can be highly peakedAlternative: propose and sample fi(d)() jointly

MCMC UpdatesIn order to calculate the ratio iteratively

fast submatrix updates are necessary for

to obtain O(NM) cost per pseudo-point, i.e., total of O(NM2)

ExperimentsSetupEvaluation of Markov chain behaviourObjective model evaluation via predictive log-likelihoodQuick detailsSquared exponential kernel

Prior for a (and b): mixture of Gamma (1, 20) + Gamma(20, 20)M = 50

Synthetic ExampleOur old friendYj = X1 + error, for j = 1, 2, 3; Yj = 2X2 + error, j = 4, 5, 6X2 = 4X12 + error

Synthetic ExampleVisualization: comparison against GPLVMNonparametric factor-analysis, independent Gaussian marginals for latent variables

GPLVM: (Lawrence, 2005)MCMC BehaviourExample: consumer dataIdentify the factors that affect willingness to pay more to consume environmentally friendly products16 indicators of environmental beliefs and attitudes, measuring 4 hidden variablesX1: Pollution beliefsX2: Buying habitsX3: Consumption habitsX4: Willingness to spend more333 datapoints.Latent structureX1 X2, X1 X3, X2 X3, X3 X4

(Bartholomew et al., 2008)MCMC Behaviour

MCMC Behaviour

UnidentifiablemodelSparseModel 1.1Predictive Log-likelihood ExperimentGoal: compare predictive loglikelihood of Pseudo-input GPSEM, linear and quadratic polynomial models, GPLVM and subsampled full GPSEMDataset 1: Consumer dataDataset 2: Abalone (also found in UCI)Postulate two latent variables, Size and Weight. Size has as indicators the length, diameter and height of each abalone specimen, while Weight has as indicators the four weight variables. 3000+ points.Dataset 3: Housing (also found in UCI)Includes indicators about features of suburbs in Boston that are relevant for the housing market. 3 latent variables, ~400 points

Abalone: Example

Housing: Example

Results

Pseudo-input GPSEM at least an order of magnitude faster than full GPSEM model (undoable in Housing). Even when subsampled to 300 points, full GPSEM still slower.Predictive Samples

Conclusion and Future WorkEven Metropolis-Hastings does a somewhat decent job (for sparse models)Potential problems with ordinal/discrete data.Evaluation of high-dimensional modelsStructure learningHierarchical modelsComparisons against random projection approximationsmixture of Gaussian processes with limited mixture sizeFull MATLAB code availableAcknowledgementsThanks to Patrik Hoyer, Ed Snelson and Irini Moustaki.Extra References (not in the paper)S. Banerjee, A. Gelfand, A. Finley and H. Sang (2008). Gaussian predictive process models for large spatial data sets. JRSS B.D. Janzing, J. Peters, J. M. Mooij andB. Schlkopf. (2009). Identifying confounders using additive noise models. UAI.M. Hazelton and B. Turlach (2009). Nonparametric density deconvolution by weighted kernel estimators. Statistics and Computing.S. Schennack (2004). Estimation of nonlinear models with measurement error. Econometric 72.

Gaussian Process Structural Equation Models with Latent Variables

Documents