Top Banner
Data Fusion based on Coupled Matrix and Tensor Factorizations Evrim Acar University of Copenhagen March 11, 2014, Ilulissat, Greenland
22

Data Fusion based on Coupled Matrix and Tensor ... - ku

Dec 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Fusion based on Coupled Matrix and Tensor ... - ku

Data Fusion based on

Coupled Matrix and Tensor Factorizations

Evrim Acar

University of Copenhagen

March 11, 2014, Ilulissat, Greenland

Page 2: Data Fusion based on Coupled Matrix and Tensor ... - ku

use

rs

activities

use

rs

locations features

Data Fusion Applications of Interest:

Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations

Missing Data Estimation

NMR

EEM LC-MS

sam

ple

s

peaks

excitation chemical shifts

Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.

Capturing underlying structures accurately and uniquely

Page 3: Data Fusion based on Coupled Matrix and Tensor ... - ku

Coupled Matrix and Tensor Factorizations (CMTF)

The problem can be formulated as:

+ + … + + …

Tensor Factorization: CANDECOMP/PARAFAC(CP) Matrix Factorization:

Joint analysis of heterogeneous data from multiple sources can be formulated as a coupled matrix and tensor factorization problem. In CMTF, higher-order tensors and matrices are simultaneously factorized by fitting a CP model to higher-order tensors and factorizing matrices in a coupled manner.

Page 4: Data Fusion based on Coupled Matrix and Tensor ... - ku

Step1: Define the objective function

CMTF-OPT is a gradient–based optimization approach for joint factorization of coupled matrices and higher-order tensors.

Step2: Compute the gradient

Vectorize and concatenate the partials

Step3: Pick a first-order optimization method e.g., Nonlinear Conjugate Gradient (NCG) and Limited-memory BFGS (L-BFGS) from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]

All-at-once Optimization for CMTF [Acar, Dunlavy and Kolda, KDD Workshop MLG, 2011]

Matricization

Khatri-Rao Product

Page 5: Data Fusion based on Coupled Matrix and Tensor ... - ku

We fit the model only to the known data entries and ignore the missing entries (in higher-order tensors and/or matrices)

Our objective:

Gradient: Let

Vectorize and concatenate the partials

CMTF easily extends to incomplete data sets

Page 6: Data Fusion based on Coupled Matrix and Tensor ... - ku

Metabolomics: We have plasma samples measured using different analytical techniques, i.e., NMR and Fluorescence Spectroscopy.

Coupling improves missing data estimation!

Randomly set tensor entries

to missing

sam

ple

s

chemical shifts emission

Mis

sin

g D

ata

R

ecov

ery

Erro

r

Missing Data Recovery Error

Missing data recovery error is lower using the coupled approach at high amounts of missing data.

[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]

Page 7: Data Fusion based on Coupled Matrix and Tensor ... - ku

Coupling can handle structured missing data!

activities

use

rs

locations features

Page 8: Data Fusion based on Coupled Matrix and Tensor ... - ku

Coupling can handle structured missing data!

activities

use

rs

locations features

We cannot use low-rank approximation of a tensor to fill in the missing slice. However, we can make use of additional sources of information through the coupled model.

We face with the cold-start problem when a new user starts using an application, e.g., location-activity recommender system. This will correspond to a missing slice for the new user.

For the missing slice i (for i=1,2,...,I):

Original values Estimated values

using CMTF

Average ROC curve for I=146 users

For coupled analysis of this data set using different tensor models and loss functions, see [Ermis, Acar and Cemgil, Data Mining and Knowledge Discovery, 2013].

Page 9: Data Fusion based on Coupled Matrix and Tensor ... - ku

When to fuse vs. Missing data estimation performance of CMTF

Missing data estimation of CMTF can be used to answer the question ”when to fuse” only partially because even when there are shared factors, CMTF may perform worse in terms of missing data estimation.

I

M K J

Mis

sin

g D

ata

Rec

over

y Er

ror

[Acar et al., Chemometrics and Intelligent Lab. Systems, 2013]

Page 10: Data Fusion based on Coupled Matrix and Tensor ... - ku

NMR

EEM LC-MS

sam

ple

s

peaks

excitation chemical shifts

activities

use

rs

locations features

Data Fusion Applications of Interest:

Recommender Systems Goal: To recommend users activities that they may be interested in doing at various locations

Metabolomics Goal: To capture underlying patterns uniquely so that these patterns can be used for biomarker discovery.

We have the data mining tools for this application

This is still a challenge!

Page 11: Data Fusion based on Coupled Matrix and Tensor ... - ku

+

+

+

+

CMTF assumes that all factors are shared!

Generate factor matrices

Construct coupled data sets Minimize f using CMTF-OPT

Example 1: All components are common:

True structure captured!

In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.

components

Page 12: Data Fusion based on Coupled Matrix and Tensor ... - ku

+

+

+

+

Generate factor matrices

Construct coupled data sets Minimize f using CMTF-OPT

Example 2: One shared and one unshared component in each data set:

Fails to identify shared and unshared components!

CMTF fails to identify shared/unshared factors!

components

In real applications, coupled data sets often have both shared and unshared factors. However, CMTF formulation focuses on modeling only the shared factors and fails to identify shared/unshared factors.

Page 13: Data Fusion based on Coupled Matrix and Tensor ... - ku

+ +

R components R components

Component weights are important as they can indicate what is common and what is not common!

Reformulating CMTF...

We reformulate coupled matrix and tensor factorizations by normalizing the columns of the factor matrices to unit norm and taking the weights out.

… + + …

Page 14: Data Fusion based on Coupled Matrix and Tensor ... - ku

ACMTF: Structure-Revealing CMTF

Optimization Problem:

Define the objective function: Add as quadratic penalty terms

Smooth Approximation: replace sparsity penalties with differentiable approximations

Compute the gradient and pick a first order optimization method Nonlinear Conguate Gradient from Poblano Toolbox [Dunlavy, Kolda and Acar, 2010]

l1-norm

Goal: To jointly analyze coupled data sets and identify their shared and unshared components [Acar, Lawaetz, Rasmussen, and Bro, IEEE EMBC, 2013]

Page 15: Data Fusion based on Coupled Matrix and Tensor ... - ku

+

+

+

+

Generate factor matrices

Construct coupled data sets Minimize

One shared and one unshared component in each data set:

Sparsity penalties enable us to capture the true structure!

CMTF:

ACMTF:

Fails to identify shared/unshared components! components

CMTF

components

ACMTF

True structure captured!

1 2 30

0.5

1

1.5

2

2.5

3

6̂<̂

Page 16: Data Fusion based on Coupled Matrix and Tensor ... - ku

Joint Analysis of LC-MS and NMR measurements of chemical mixtures

8 gradient

levels

29

m

ixtu

res

1591 chemical shifts

168 peaks

Goal: To identify shared/unshared factors in each data set Data: 29 chemical mixtures measured using DOSY-NMR and LC-MS. Mixtures are prepared using the following five chemicals:

• Val-Try-Val • Trp – Gly • Phe • Maltoheptaose • Propanol

NMR LC-MS

Four chemicals are expected to show up in both types of measurements!

Can only be captured by NMR

[Acar, Papalexakis, Gurdeniz, Rasmussen, Lawaetz, Nilsson, Bro, Submitted, 2014]

Page 17: Data Fusion based on Coupled Matrix and Tensor ... - ku

+ + … + + …

CMTF fails to capture the true underlying structure!

Page 18: Data Fusion based on Coupled Matrix and Tensor ... - ku

ACMTF can capture the chemicals!

+ + … + + …

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

NMR LC-MS

Page 19: Data Fusion based on Coupled Matrix and Tensor ... - ku

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

ACMTF can capture the chemicals and the design!

+ + … + + …

NMR

Val –Tyr - Val

Trp - Gly Phe

Maltoheptaose

Propanol Noise

LC-MS

Page 20: Data Fusion based on Coupled Matrix and Tensor ... - ku

0.20 0.28 0.15 0.19 0.17 0.24 0.16 0.77 0.06 0.18 0.65 0.07

ACMTF can identify shared/unshared components!

+ + … + + …

NMR

Val –Tyr - Val

Trp - Gly Phe

Maltoheptaose

Propanol Noise

Component weights indicate what is common

and what is not! LC-MS

ISSUES: • UNIQUENESS PROPERTIES! • PARAMETER TUNING!

Page 21: Data Fusion based on Coupled Matrix and Tensor ... - ku

Summary

• Coupled matrix and tensor factorizations is effective in terms of incorporating additional sources of information for missing data estimation.

Challenges: • A systematic approach for weighting may improve the missing data estimation performance even further.

• We have reformulated coupled matrix and tensor factorizations in order to capture shared/unshared factors accurately/uniquely. Challenges:

• Need better understanding of the uniqueness properties • Parameter tuning

NMR EEM LC-MS

Page 22: Data Fusion based on Coupled Matrix and Tensor ... - ku

Thank you!

Evrim Acar www.models.life.ku.dk/~acare

Project Webpage http://www.models.life.ku.dk/joda

Data Fusion based on Coupled Matrix and Tensor Factorizations:

All-at-once Optimization Approach: E. Acar, T. G. Kolda, and D. M. Dunlavy. All-at-once Optimization for Coupled Matrix and Tensor Factorizations. KDD Workshop on Mining and Learning with Graphs, 2011 (arXiv:1105.3422) (Code available with MATLAB CMTF Toolbox v.1.0)

Shared/Unshared Factors:

• E. Acar, A. J. Lawaetz, M. A. Rasmussen, and R. Bro, Structure Revealing Data Fusion Model with Applications in Metabolomics, IEEE EMBC, 2013. (Code available with MATLAB CMTF Toolbox v.1.0)

• E. Acar, E. E. Papalexakis, G. Gurdeniz, M. A. Rasmussen, A. J. Lawaetz, M. Nilsson, and R. Bro, Structure Revealing Data Fusion, Submitted for Publication, 2014.

Missing Data Estimation: E. Acar, M. A. Rasmussen, F. Savorani, T. Næs, and R. Bro. Understanding Data Fusion within the Framework of Coupled Matrix and Tensor Factorizations, Chemometrics and Intelligent Laboratory Systems, 129: 53-63, 2013.

Link Prediction: B. Ermis, E. Acar, and A. T. Cemgil. Link Prediction in Heterogeneous Data via Generalized Coupled Tensor Factorization, Data Mining and Knowledge Discovery, 2013.