Data Fusion based on Coupled Matrix and Tensor Factorizations Evrim Acar University of Copenhagen March 11, 2014, Ilulissat, Greenland
Data Fusion based on
Coupled Matrix and Tensor Factorizations
Evrim Acar
University of Copenhagen
March 11, 2014, Ilulissat, Greenland
use
rs
activities
use
rs
locations features
Data Fusion Applications of Interest:
Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations
Missing Data Estimation
NMR
EEM LC-MS
sam
ple
s
peaks
excitation chemical shifts
Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.
Capturing underlying structures accurately and uniquely
Coupled Matrix and Tensor Factorizations (CMTF)
The problem can be formulated as:
+ + … + + …
Tensor Factorization: CANDECOMP/PARAFAC(CP) Matrix Factorization:
Joint analysis of heterogeneous data from multiple sources can be formulated as a coupled matrix and tensor factorization problem. In CMTF, higher-order tensors and matrices are simultaneously factorized by fitting a CP model to higher-order tensors and factorizing matrices in a coupled manner.
Step1: Define the objective function
CMTF-OPT is a gradient–based optimization approach for joint factorization of coupled matrices and higher-order tensors.
Step2: Compute the gradient
Vectorize and concatenate the partials
Step3: Pick a first-order optimization method e.g., Nonlinear Conjugate Gradient (NCG) and Limited-memory BFGS (L-BFGS) from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]
All-at-once Optimization for CMTF [Acar, Dunlavy and Kolda, KDD Workshop MLG, 2011]
Matricization
Khatri-Rao Product
We fit the model only to the known data entries and ignore the missing entries (in higher-order tensors and/or matrices)
Our objective:
Gradient: Let
Vectorize and concatenate the partials
CMTF easily extends to incomplete data sets
Metabolomics: We have plasma samples measured using different analytical techniques, i.e., NMR and Fluorescence Spectroscopy.
Coupling improves missing data estimation!
Randomly set tensor entries
to missing
sam
ple
s
chemical shifts emission
Mis
sin
g D
ata
R
ecov
ery
Erro
r
Missing Data Recovery Error
Missing data recovery error is lower using the coupled approach at high amounts of missing data.
[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]
Coupling can handle structured missing data!
activities
use
rs
locations features
We cannot use low-rank approximation of a tensor to fill in the missing slice. However, we can make use of additional sources of information through the coupled model.
We face with the cold-start problem when a new user starts using an application, e.g., location-activity recommender system. This will correspond to a missing slice for the new user.
For the missing slice i (for i=1,2,...,I):
Original values Estimated values
using CMTF
Average ROC curve for I=146 users
For coupled analysis of this data set using different tensor models and loss functions, see [Ermis, Acar and Cemgil, Data Mining and Knowledge Discovery, 2013].
When to fuse vs. Missing data estimation performance of CMTF
Missing data estimation of CMTF can be used to answer the question ”when to fuse” only partially because even when there are shared factors, CMTF may perform worse in terms of missing data estimation.
I
M K J
Mis
sin
g D
ata
Rec
over
y Er
ror
[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]
NMR
EEM LC-MS
sam
ple
s
peaks
excitation chemical shifts
activities
use
rs
locations features
Data Fusion Applications of Interest:
Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations
Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.
We have the data mining tools for this application
This is still a challenge!
+
+
+
+
CMTF assumes that all factors are shared!
Generate factor matrices
Construct coupled data sets Minimize f using CMTF-OPT
Example 1: All components are common:
True structure captured!
In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.
components
+
+
+
+
Generate factor matrices
Construct coupled data sets Minimize f using CMTF-OPT
Example 2: One shared and one unshared component in each data set:
Fails to identify shared and unshared components!
CMTF fails to identify shared/unshared factors!
components
In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.
+ +
R components R components
Component weights are important as they can indicate what is common and what is not common!
Reformulating CMTF...
We reformulate coupled matrix and tensor factorizations by normalizing the columns of the factor matrices to unit norm and taking the weights out.
… + + …
ACMTF: Structure-Revealing CMTF
Optimization Problem:
Define the objective function: Add as quadratic penalty terms
Smooth Approximation: replace sparsity penalties with differentiable approximations
Compute the gradient and pick a first order optimization method Nonlinear Conguate Gradient from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]
l1-norm
Goal: To jointly analyze coupled data sets and identify their shared and unshared components [Acar, Lawaetz, Rasmussen, and Bro, IEEE EMBC, 2013]
+
+
+
+
Generate factor matrices
Construct coupled data sets Minimize
One shared and one unshared component in each data set:
Sparsity penalties enable us to capture the true structure!
CMTF:
ACMTF:
Fails to identify shared/unshared components! components
CMTF
components
ACMTF
True structure captured!
1 2 30
0.5
1
1.5
2
2.5
3
6̂<̂
Joint Analysis of LC-MS and NMR measurements of chemical mixtures
8 gradient
levels
29
m
ixtu
res
1591 chemical shifts
168 peaks
Goal: To identify shared/unshared factors in each data set Data: 29 chemical mixtures measured using DOSY-NMR and LC-MS. Mixtures are prepared using the following five chemicals:
• Val-Try-Val • Trp – Gly • Phe • Maltoheptaose • Propanol
NMR LC-MS
Four chemicals are expected to show up in both types of measurements!
Can only be captured by NMR
[Acar, Papalexakis, Gurdeniz, Rasmussen, Lawaetz, Nilsson, Bro, Submitted, 2014]
ACMTF can capture the chemicals!
+ + … + + …
0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07
NMR LC-MS
0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07
ACMTF can capture the chemicals and the design!
+ + … + + …
NMR
Val –Tyr - Val
Trp - Gly Phe
Maltoheptaose
Propanol Noise
LC-MS
0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07
ACMTF can identify shared/unshared components!
+ + … + + …
NMR
Val –Tyr - Val
Trp - Gly Phe
Maltoheptaose
Propanol Noise
Component weights indicate what is common
and what is not! LC-MS
ISSUES: • UNIQUENESS PROPERTIES! • PARAMETER TUNING!
Summary
• Coupled matrix and tensor factorizations is effective in terms of incorporating additional sources of information for missing data estimation.
Challenges: • A systematic approach for weighting may improve the missing data estimation performance even further.
• We have reformulated coupled matrix and tensor factorizations in order to capture shared/unshared factors accurately/uniquely. Challenges:
• Need better understanding of the uniqueness properties • Parameter tuning
NMR EEM LC-MS
Thank you!
Evrim Acar www.models.life.ku.dk/~acare
Project Webpage http://www.models.life.ku.dk/joda
Data Fusion based on Coupled Matrix and Tensor Factorizations:
All-at-once Optimization Approach: E. Acar, T. G. Kolda, and D. M. Dunlavy. All-at-once Optimization for Coupled Matrix and Tensor Factorizations. KDD Workshop on Mining and Learning with Graphs, 2011 (arXiv:1105.3422) (Code available with MATLAB CMTF Toolbox v.1.0)
Shared/Unshared Factors:
• E. Acar, A. J. Lawaetz, M. A. Rasmussen, and R. Bro, Structure Revealing Data Fusion Model with Applications in Metabolomics, IEEE EMBC, 2013. (Code available with MATLAB CMTF Toolbox v.1.0)
• E. Acar, E. E. Papalexakis, G. Gurdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro, Structure Revealing Data Fusion, Submitted for Publication, 2014.
Missing Data Estimation: E. Acar, M. A. Rasmussen, F. Savorani, T. Næs, and R. Bro. Understanding Data Fusion within the Framework of Coupled Matrix and Tensor Factorizations, Chemometrics and Intelligent Laboratory Systems, 129: 53-63, 2013.
Link Prediction: B. Ermis, E. Acar, and A. T. Cemgil. Link Prediction in Heterogeneous Data via Generalized Coupled Tensor Factorization, Data Mining and Knowledge Discovery, 2013.