Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muñoz 1 Antonio Artés-Rodríguez 1 Mauricio A. Álvarez 2 1 Universidad Carlos III de Madrid, Spain 2 University of Sheffield, UK {pmoreno, antonio}@tsc.uc3m.es mauricio.alvarez@sheffield.ac.uk Introduction A novel extension of multi-output Gaussian processes (MOGPs) for handling heterogeneous outputs (bi- nary, real, categorical, ... ). Each output has its own likelihood distribution and we use a MOGP prior to jointly model the parameters in all likelihoods as latent functions. We are able to obtain tractable variational bounds amenable to stochastic variational inference (SVI). Multi-output GPs We will use a linear model of corregionalisation type of covariance function for expressing correlations between latent parameter functions f d,j (x) (LPFs). Each LPF is a linear combination of independent latent functions U = {u q (x)} Q q =1 . Each u q (x) is assummed to be drawn from a GP prior such that u q (·) ∼ GP (0,k q (·, ·)), where k q can be any valid covariance function. f d,j (x)= Q X q =1 R q X i=1 a i d,j,q u i q (x), We assume that R q =1, meaning that the corregionalisation matrices are rank-one. In the literature such model is known as the semiparametric latent factor model. Heterogeneous Likelihood Model Consider a set of output functions Y = {y d (x)} D d=1 , with x ∈ R p , that we want to jointly model using GPs. Let y(x)=[y 1 (x),y 2 (x), ··· ,y D (x)] > be a vector-valued function. If outputs are conditionally independent given the vector of parameters θ (x)=[θ 1 (x),θ 2 (x), ··· ,θ D (x)] > , we may define p(y(x)|θ (x)) = p(y(x)|f (x)) = D Y d=1 p(y d (x)|θ d (x)) = D Y d=1 p(y d (x)| e f d (x)), where e f d (x)=[f d,1 (x), ··· ,f d,J d (x)] > ∈ R J d ×1 are the set of LPFs that specify the parameters in θ d (x) for an arbitrary number D of likelihood functions. Variational Bounds Sparse Approximations in MOGPs: We define the set of M inducing variables per latent function u q (x) as u q =[u q (z 1 ), ··· ,u q (z M )] > , evaluated at a set of inducing inputs Z = {z m } M m=1 ∈ R M ×p . We also define u =[u > 1 , ··· , u > Q ] > ∈ R QM ×1 . We approximate the posterior p(f , u|y, X) as follows: p(f , u|y, X) ≈ q (f , u)= p(f |u)q (u)= D Y d=1 J d Y j =1 p(f d,j |u) Q Y q =1 q (u q ), Variational Inference: Exact posterior inference is intractable in our model due to the presence of an arbi- trary number of non-Gaussian likelihoods. We use variational inference to compute a lower bound L for the marginal log-likelihood log p(y), and for approximating the posterior distribution p(f , u|D ). L = D X d=1 E q ( e f d ) log p(y d (x n )| e f d ) - Q X q =1 KL ( q (u q )||p(u q ) ) Acknowledgements: PMM is supported by a doctoral FPI grant (BES2016-077626) under the project Macro-ADOBE (TEC2015-67719-P), MINECO, Spain. AAR acknowledges the projects ADVENTURE (TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845). MAA has been partially financed by the Engineering and Physical Research Council (EPSRC) Research Projects EP/N014162/1 and EP/R034303/1. Results Code → github.com/pmorenoz/HetMOGP Missing Gap Prediction: We predict observations in one output (binary classification) using training information from another one (Gaussian regression). Multi-output test-NLPD value: 32.5 ± 0.2 × 10 -2 / Single-output test-NLPD value: 40.51 ± 0.08 × 10 -2 . 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -6 -4 -2 0 2 4 6 Real Input Real Output Output 1: Gaussian Regression 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Real Input Binary Output Output 2: Binary Classification 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Real Input Binary Output Single Output: Binary Classification London House Price: Complete register of properties sold in the Greater London County during 2017. All properties addresses were translated to latitude-longitude points. For each spatial input, we considered two observations, one binary (property type) and one real (sale price). -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Property Type Flat Other -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Sale Price 79K£ 167K£ 351K£ 738K£ 1.5M£ -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Probability of Flat House 0 0.2 0.4 0.6 0.8 1 -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 51.37 51.45 51.53 51.61 51.69 Longitude Latitude Log-price Variance 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 TEST-NLPD [L ONDON ] Bernoulli Heteroscedastic Global HetMOGP 6.38 ± 0.46 10.05 ± 0.64 16.44 ± 0.01 ChainedGP 6.75 ± 0.25 10.56 ± 1.03 17.31 ± 1.06 Human Behavior Data: We model human behavior in psy- chiatric patients. Our data comes from a medical study that uses the monitoring app (eB2). Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0 0.2 0.4 0.6 0.8 1 Output 1: Binary Presence/Absence at Home Monday Tuesday Wednesday Thursday Friday Saturday Sunday -4 -2 0 2 4 Output 2: Log-distance Distance from Home (Km) Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0 0.2 0.4 0.6 0.8 1 Output 3: Binary Use/non-use of Whatsapp Conclusions We present a MOGP model for handling heterogeneous obser- vations that is able to work on large scale datasets. Experimental results show relevant improvements with respect to indepen- dent learning. References Y. W. Teh et al. Semiparametric latent factor models. AISTATS, 2005 M. A. Álvarez et al., Sparse convolved Gaussian processes for multi-output regres- sion. NIPS, 2008 J. D. Hadfield, MCMC methods for multi-response GLMMs. JSS, 2010 J. Hensman et al., Gaussian processes for big data. UAI, 2013 A. Saul et al., Chained Gaussian processes. AISTATS, 2016 Likelihood Linked Parameters Gaussian μ(x)= f , σ (x) Het. Gaussian μ(x)= f 1 , σ (x) = exp(f 2 ) Bernoulli ρ(x)= exp(f ) 1+exp(f ) Categorical ρ k (x)= exp(f k ) 1+ ∑ K-1 k 0 =1 exp(f k 0 ) Poisson λ(x) = exp(f ) Gamma a(x) = exp(f 1 ),b(x) = exp(f 2 )