Data Fusion based on Coupled Matrix and Tensor ... - ku

Data Fusion based on

Coupled Matrix and Tensor Factorizations

Evrim Acar

University of Copenhagen

March 11, 2014, Ilulissat, Greenland

use

rs

activities

use

rs

locations features

Data Fusion Applications of Interest:

Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations

Missing Data Estimation

NMR

EEM LC-MS

sam

ple

s

peaks

excitation chemical shifts

Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.

Capturing underlying structures accurately and uniquely

Coupled Matrix and Tensor Factorizations (CMTF)

The problem can be formulated as:

+ + … + + …

Tensor Factorization: CANDECOMP/PARAFAC(CP) Matrix Factorization:

Joint analysis of heterogeneous data from multiple sources can be formulated as a coupled matrix and tensor factorization problem. In CMTF, higher-order tensors and matrices are simultaneously factorized by fitting a CP model to higher-order tensors and factorizing matrices in a coupled manner.

Step1: Define the objective function

CMTF-OPT is a gradient–based optimization approach for joint factorization of coupled matrices and higher-order tensors.

Step2: Compute the gradient

Vectorize and concatenate the partials

Step3: Pick a first-order optimization method e.g., Nonlinear Conjugate Gradient (NCG) and Limited-memory BFGS (L-BFGS) from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]

All-at-once Optimization for CMTF [Acar, Dunlavy and Kolda, KDD Workshop MLG, 2011]

Matricization

Khatri-Rao Product

We fit the model only to the known data entries and ignore the missing entries (in higher-order tensors and/or matrices)

Our objective:

Gradient: Let

Vectorize and concatenate the partials

CMTF easily extends to incomplete data sets

Metabolomics: We have plasma samples measured using different analytical techniques, i.e., NMR and Fluorescence Spectroscopy.

Coupling improves missing data estimation!

Randomly set tensor entries

to missing

sam

ple

s

chemical shifts emission

Mis

sin

g D

ata

R

ecov

ery

Erro

r

Missing Data Recovery Error

Missing data recovery error is lower using the coupled approach at high amounts of missing data.

[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]

Coupling can handle structured missing data!

activities

use

rs

locations features

Coupling can handle structured missing data!

activities

use

rs

locations features

We cannot use low-rank approximation of a tensor to fill in the missing slice. However, we can make use of additional sources of information through the coupled model.

We face with the cold-start problem when a new user starts using an application, e.g., location-activity recommender system. This will correspond to a missing slice for the new user.

For the missing slice i (for i=1,2,...,I):

Original values Estimated values

using CMTF

Average ROC curve for I=146 users

For coupled analysis of this data set using different tensor models and loss functions, see [Ermis, Acar and Cemgil, Data Mining and Knowledge Discovery, 2013].

When to fuse vs. Missing data estimation performance of CMTF

Missing data estimation of CMTF can be used to answer the question ”when to fuse” only partially because even when there are shared factors, CMTF may perform worse in terms of missing data estimation.

I

M K J

Mis

sin

g D

ata

Rec

over

y Er

ror

[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]

NMR

EEM LC-MS

sam

ple

s

peaks

excitation chemical shifts

activities

use

rs

locations features

Data Fusion Applications of Interest:

Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations

Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.

We have the data mining tools for this application

This is still a challenge!

+

+

+

+

CMTF assumes that all factors are shared!

Generate factor matrices

Construct coupled data sets Minimize f using CMTF-OPT

Example 1: All components are common:

True structure captured!

In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.

components

+

+

+

+


Construct coupled data sets Minimize f using CMTF-OPT

Example 2: One shared and one unshared component in each data set:

Fails to identify shared and unshared components!

CMTF fails to identify shared/unshared factors!

components

In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.

+ +

R components R components

Component weights are important as they can indicate what is common and what is not common!

Reformulating CMTF...

We reformulate coupled matrix and tensor factorizations by normalizing the columns of the factor matrices to unit norm and taking the weights out.

… + + …

ACMTF: Structure-Revealing CMTF

Optimization Problem:

Define the objective function: Add as quadratic penalty terms

Smooth Approximation: replace sparsity penalties with differentiable approximations

Compute the gradient and pick a first order optimization method Nonlinear Conguate Gradient from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]

l1-norm

Goal: To jointly analyze coupled data sets and identify their shared and unshared components [Acar, Lawaetz, Rasmussen, and Bro, IEEE EMBC, 2013]

+

+

+

+


Construct coupled data sets Minimize

One shared and one unshared component in each data set:

Sparsity penalties enable us to capture the true structure!

CMTF:

ACMTF:

Fails to identify shared/unshared components! components

CMTF

components

ACMTF

True structure captured!

1 2 30

0.5

1

1.5

2

2.5

3

6̂<̂

Joint Analysis of LC-MS and NMR measurements of chemical mixtures

8 gradient

levels

29

m

ixtu

res

1591 chemical shifts

168 peaks

Goal: To identify shared/unshared factors in each data set Data: 29 chemical mixtures measured using DOSY-NMR and LC-MS. Mixtures are prepared using the following five chemicals:

• Val-Try-Val • Trp – Gly • Phe • Maltoheptaose • Propanol

NMR LC-MS

Four chemicals are expected to show up in both types of measurements!

Can only be captured by NMR

[Acar, Papalexakis, Gurdeniz, Rasmussen, Lawaetz, Nilsson, Bro, Submitted, 2014]

+ + … + + …

CMTF fails to capture the true underlying structure!

ACMTF can capture the chemicals!

+ + … + + …

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

NMR LC-MS

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

ACMTF can capture the chemicals and the design!

+ + … + + …

NMR

Val –Tyr - Val

Trp - Gly Phe

Maltoheptaose

Propanol Noise

LC-MS

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

ACMTF can identify shared/unshared components!

+ + … + + …

NMR

Val –Tyr - Val

Trp - Gly Phe

Maltoheptaose

Propanol Noise

Component weights indicate what is common

and what is not! LC-MS

ISSUES: • UNIQUENESS PROPERTIES! • PARAMETER TUNING!

Summary

• Coupled matrix and tensor factorizations is effective in terms of incorporating additional sources of information for missing data estimation.

Challenges: • A systematic approach for weighting may improve the missing data estimation performance even further.

• We have reformulated coupled matrix and tensor factorizations in order to capture shared/unshared factors accurately/uniquely. Challenges:

• Need better understanding of the uniqueness properties • Parameter tuning

NMR EEM LC-MS

Thank you!

Evrim Acar www.models.life.ku.dk/~acare

Project Webpage http://www.models.life.ku.dk/joda

Data Fusion based on Coupled Matrix and Tensor Factorizations:

All-at-once Optimization Approach: E. Acar, T. G. Kolda, and D. M. Dunlavy. All-at-once Optimization for Coupled Matrix and Tensor Factorizations. KDD Workshop on Mining and Learning with Graphs, 2011 (arXiv:1105.3422) (Code available with MATLAB CMTF Toolbox v.1.0)

Shared/Unshared Factors:

• E. Acar, A. J. Lawaetz, M. A. Rasmussen, and R. Bro, Structure Revealing Data Fusion Model with Applications in Metabolomics, IEEE EMBC, 2013. (Code available with MATLAB CMTF Toolbox v.1.0)

• E. Acar, E. E. Papalexakis, G. Gurdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro, Structure Revealing Data Fusion, Submitted for Publication, 2014.

Missing Data Estimation: E. Acar, M. A. Rasmussen, F. Savorani, T. Næs, and R. Bro. Understanding Data Fusion within the Framework of Coupled Matrix and Tensor Factorizations, Chemometrics and Intelligent Laboratory Systems, 129: 53-63, 2013.

Link Prediction: B. Ermis, E. Acar, and A. T. Cemgil. Link Prediction in Heterogeneous Data via Generalized Coupled Tensor Factorization, Data Mining and Knowledge Discovery, 2013.

http://www.models.life.ku.dk/~acare

http://www.models.life.ku.dk/joda

http://www.models.life.ku.dk/~acare/hd_logo_en_158_108.png

http://arxiv.org/abs/1105.3422

http://www.models.life.ku.dk/joda/CMTF_Toolbox

http://www.models.life.ku.dk/joda/CMTF_Toolbox

Data Fusion based on Coupled Matrix and Tensor ... - ku

Documents