Estimation of low-rank tensors via convex optimizationttic.uchicago.edu/~ryotat/talks/tensor11berlin.pdf · Estimation of low-rank tensors via convex optimization Ryota Tomioka1,

Estimation of low-rank tensors via convex optimization

Ryota Tomioka1, Kohei Hayashi2, Hisashi Kashima1

1The University of Tokyo2Nara Institute of Science and Technology

2011/3/23 @ TU Berlin

Convex low-rank tensor completion

SensorsTime

Feat

ures

Factors(loadings)

Core(interactions)

n1

n2n3

n1

n2

n3

r1

r2

r3r1

r2 r3

Tucker decompositionXijk =r1!

a=1

r2!

b=1

r3!

c=1

CabcU(1)ia U (2)

jb U (c)kc

Conventional formulation (nonconvex)

mode-k productobservation

minimizeC,U1,U2,U3

!! " (Y # C $1 U1 $2 U2 $3 U3) !2F + regularization.

minimizeX

!! " (Y # X ) !2F s.t. rank(X ) $ (r1, r2, r3).

• Alternate minimization• Have to fix the rank beforehand

Our approach

MatrixEstimation of

low-rank matrix(hard)

Trace normminimization(tractable)

[Fazel, Hindi, Boyd 01]

TensorEstimation of

low-rank tensor(hard)

Rank defined in the sense ofTucker decomposition

Extendedtrace norm

minimization(tractable)

Generalization

Trace norm (nuclear norm) regularization

X ! RR!C , m = min(R,C)

Linear sum of singular-values

• Roughly speaking, L1 regulariza6on on the singular‐values.

• Stronger regulariza6on ‐‐> more zero singular‐values ‐‐>

low rank.

• Not obvious for tensors (no singular‐values for tensors)

!X!! =m!

j=1

!j(X)

Spectral soft-threshold operationall observed and matrix --> analytic solution

SV index

Sin

gula

rval

ues

Original spectrum

Thresholded spectrum

where X=USVT

softth(X) = argminZ!RR!C

!12!Z "X!2

F + !!Z!""

= U max(S " !, 0)V #

Mode-k unfolding (matricization)

I1

I2 I3

I1

I2 I2 I2

I3

Mode-1 unfolding

I1

I2 I3

I2

I3 I3 I3

I1

Mode-2 unfolding

X(1)

X(2)

Low-rank tensor is a low-rank matrix

X = C !1 U1 !2 U2 !3 U3

r1

r2 r3C

n1r1

U1 n2

r2U2

n3

r3

U3

Mode-1 unfolding

Mode-2 unfolding

Mode-3 unfolding

X(1) = U1C(1)(U3 !U2)!

X(2) = U2C(2)(U1 !U3)!

X(3) = U3C(3)(U2 !U1)!

rank ≦ r1

rank ≦ r2

rank ≦ r3

The rank of X(k) is no more than the rank of C(k)

Low-rank matrix is a low-rank tensor

• Given X=USVT (low‐rank)

• DefineC = SV !

U1 = U

U2 = In2

U3 = In3

is low-rank

(at least for mode 1)

X = C !1 U1 !2 U2 !3 U3

What it means

• We can use the trace norm of an unfolding of a

tensor X to learn low‐rank X.

Tensor X is low-rank∃k, rk<Ik

Unfolding X(k) is a low-rank

matrix

Matricization

Tensorization

Approach 1: As a matrix

• Pick a mode k, and hope that the tensor to be

learned is low rank in mode k.

minimizeX!RI1!···!IK

12!

!! " (Y # X ) !2F + !X(k)!",

Pro: Basically a matrix problem → Theoretical guarantee (Candes & Recht 09)Con: Have to be lucky to pick the right mode.

Approach 2: Constrained optimization

• Constrain so that each unfolding of X is

simultaneously low rank.

minimizeX!RI1!···!IK

12!

!! " (Y # X ) !2F +

K!

k=1

"k!X(k)!".

γk: tuning parameter usually set to 1.

Pro: Jointly regularize every modeCon: Strong constraint

(See also Signoretto et al.,10; Gandy et al. 11)

Approach 3: Mixture of low-rank tensors

• Each mixture component Zk is regularized to be

low‐rank only in mode‐k.

minimizeZ1,...,ZK

12!

!!!!!! !"Y "

#K

k=1Zk

$!!!!!

2

F

+K#

k=1

"k#Zk(k)#!,

Pro: Each Zk takes care of each modeCon: Sum is not low-rank

Optimization viaAlternating Direction Method of Multipliers (ADMM)

• Useful when we have linear opera6on inside

sparsity penalty

Permutation

(Gabay & Mercier 76)

minimizeX!Rn1!···!nK

12!!!(X )" y!2

F +K!

k=1

"k!X(k)!".

Optimization viaAlternating Direction Method of Multipliers (ADMM)

• Useful when we have linear opera6on inside

sparsity penalty

Total-variation image reconstruction:

Permutation

2D derivative at jth pixel

(Gabay & Mercier 76)

minimizex!Rn

12!!!(x)" y!2 +

n!

j=1

!Djx!

minimizeX!Rn1!···!nK

12!!!(X )" y!2

F +K!

k=1

"k!X(k)!".

• Split Bregman Itera6on (Goldstein & Osher) is also an ADMM

ADMM preliminaries

• Problem

Linear operation

minimizex

f(x) + g(Ax)

ADMM preliminaries

• Problem

Linear operation

minimizex

f(x) + g(Ax)

ADMM preliminaries

• Problem

• Step 1: Split & Augment Linear operation

minimizex

f(x) + g(Ax)

minimizex,z

f(x) + g(z) +!

2!Ax" z!2

subject to z = Ax

ADMM preliminaries

• Problem


minimizex

f(x) + g(Ax)

minimizex,z

f(x) + g(z) +!

2!Ax" z!2

subject to z = Ax

ADMM preliminaries

• Problem


minimizex

f(x) + g(Ax)

minimizex,z

f(x) + g(z) +!

2!Ax" z!2

subject to z = Ax

ADMM preliminaries

• Problem

• Step 1: Split & Augment

• Step 2: Augmented Lagrangian func6on

Linear operation

L!(x, z, !) = f(x) + g(z) + !!(Ax" z) +!

2#Ax" z#2

Ordinary Lagrangian Augmented

minimizex

f(x) + g(Ax)

minimizex,z

f(x) + g(z) +!

2!Ax" z!2

subject to z = Ax

ADMM algorithm (Gabay & Mercier 76)

• Minimize the AL func6on wrt X

• Minimize the AL func6on wrt Z

• Update the mul6plier vector

xt+1 = argminx!Rn

L!(x, zt, !t),

!t+1 = !t + !(Axt+1 ! zt+1).

zt+1 = argminz!Rm

L!(xt+1, z, !t),

ADMM algorithm (Gabay & Mercier 76)



• Update the mul6plier vector

xt+1 = argminx!Rn

L!(x, zt, !t),

!t+1 = !t + !(Axt+1 ! zt+1).

Every limit point of ADMM is a minimizer of

the original problem. [Eckstein & Bertsekas 92]

zt+1 = argminz!Rm

L!(xt+1, z, !t),

For approach “Constraint”

• Move the permuta6on out of the regularizer

• Augmented Lagrangian:

minimizeX ,Z1,...,ZK

12!!!(X )" y!2 +

K!

k=1

"k!Zk!!,

subject to X(k) = Zk (k = 1, . . . ,K),

L!(X , {Zk}Kk=1, {Ak}K

k=1) =12!!!(X )" y!2 +

K!

k=1

"k!Zk!!

+ #K!

k=1

"#Ak,X(k) "Zk

$+

12!X(k) "Zk!2

F

%.

ADMM for “Constraint”



• Update mul6pliers

(!! 0)

!"

#!(X t+1) = y (observed elem.)

!̄(X t+1) = !̄$

1K

%Kk=1 tensork(Z

tk !At

k)&

(unobserved elem.)

Zt+1k = softth!k/"

!Xt+1

(k) + Atk

"(k = 1, . . . , K)

At+1k = At

k +!Xt+1

(k) !Zt+1k

"(k = 1, . . . , K)

Numerical experiment

• True tensor: Size 50x50x20, rank 7x8x9. No noise (λ=0).

• Random train/test split.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

10−4

10−2

100

102

Fraction of observed elements

Gen

eral

izatio

n er

ror

As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker (large)Tucker (exact)Optimization tolerance

Tucker=EM algo

(nonconvex)

Computation time

• Convex formula6on is also fast

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

10

20

30

40

50


Com

puta

tion

time

(s)

As a MatrixConstraintMixtureTucker (large)Tucker (exact)

Phase transition behaviour

0 20 40 60 800

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

[10 12 13]

[10 12 9][16 6 8]

[17 13 15]

[20 20 5]

[20 3 2]

[30 12 10]

[32 3 4]

[40 20 6]

[40 3 2]

[40 9 7]

[4 42 3][5 4 3]

[7 8 15]

[7 8 9][8 5 6]

Sum of true ranks

Frac

tion

requ

ired

for p

erfe

ct re

cons

truct

ion

• Sum of true ranks= min(r1,r2r3)+ min(r2,r3r1)+ min(r3,r1r2)

Phase transition (vs Shatten-1 norm)

0 100 200 3000

0.2

0.4

0.6

0.8

[10 12 13]

[10 12 9][16 6 8]

[17 13 15][20 20 5]

[20 3 2]

[30 12 10]

[32 3 4]

[40 20 6]

[40 3 2]

[40 9 7]

[4 42 3][5 4 3]

[7 8 15]

[7 8 9][8 5 6]

Shatten−1 norm

Frac

tion

requ

ired

for r

econ

stru

ctio

n

y=0.0463x0.4958

“Mixture” is sometimes better

• True tensor: Size 50x50x20, rank 50x50x5. No noise (λ=0).

•

0 0.2 0.4 0.6 0.8 110−3

10−2

10−1

100


Gen

eral

izatio

n er

ror

As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker (large)Tucker (exact)Optimization tolerance

Amino acid fluorescence data [Bro & Andersson]

• Size 201x61x5.

• Five solu6ons with different amount of three amino

acids (tyrosine, tryptophan, phenylalanine)

• Rank=3 PARAFAC is correct.

• Interested in

‐ Generaliza6on performance

‐ Number of components

‐ Interpreta6on

Amino acid: Generalization performance

• “Constraint” performs comparable to PARAFAC with

the correct rank.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110−3

10−2

10−1

100

101


Gen

eral

izat

ion

erro

r

As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker [4 4 4]Tucker [3 3 3]PARAFAC(3)PARAFAC(4)

Amino acid: Singular-value spectra

50 100 150 2000

2

4x 104 Mode 1

True

20 40 600

5x 104 Mode 2

1 2 3 4 50

2

4x 104 Mode 3

50 100 150 2000

2

4x 104

Estim

ated

20 40 600

5x 104

1 2 3 4 50

2

4x 104

Estimated spetra from half of the entries are almost identical to the truth.

Improving Interpretability

• Apply PARAFAC on the core (4x4x5) obtained by the

proposed “constraint” approach.

• Separate imputa6on problem and interpreta6on

problem.

= (U1A(1))! (U2A(2))! (U3A(3))

= (A(1) !A(2) !A(3))"1 U1 "2 U2 "3 U3

X = C !1 U1 !2 U2 !3 U3

PARAFAC

Obtained factors

250 300 350 400 450−2000

0

2000

4000

6000Proposed(4)

240 260 280 3000

0.05

0.1

0.15

0.2

0.25

1 2 3 4 5−0.5

0

0.5

1

250 300 350 400 450−2000

0

2000

4000

6000PARAFAC(4)

240 260 280 300−0.4

−0.2

0

0.2

0.4

1 2 3 4 5−0.5

0

0.5

1

250 300 350 400 450−2000

0

2000

4000

6000PARAFAC(3)

Emis

sion

load

ings

240 260 280 3000

0.05

0.1

0.15

0.2

0.25

Exci

tatio

n lo

adin

gs

1 2 3 4 5−0.5

0

0.5

1

Sam

ple

load

ings

Summary

• Low‐rank tensor comple6on can be computed in a convex

op6miza6on problem using the trace norm regulariza6on.

‐ No need to specify the rank beforehand.

• Convex formula6on is more accurate and faster than

conven6onal EM‐based Tucker decomposi6on.

• Curious “phase transi6on” found → compressive‐sensing‐

type analysis is an on‐going work.

• Combina6on of proposed+PARAFAC is useful.

• Code:

‐ hnp://www.ibis.t.u‐tokyo.ac.jp/RyotaTomioka/Sopwares/Tensor

Acknowledgment

• This work was supported in part by MEXT KAKENHI

22700138, 80545583, JST PRESTO, and NTT

Communica6on Science Laboratories.

ADMM convergence

• Step 1: ADMM is equivalent to Douglas‐Rachfold

Spliung in the dual

!t+1 = proxg!!prox

f!("A#·)(!t " zt) + zt

"

zt+1 = proxg

!prox

f!("A#·)(!t " zt) + zt

"

Estimation of low-rank tensors via convex optimizationttic.uchicago.edu/~ryotat/talks/tensor11berlin.pdf · Estimation of low-rank tensors via convex optimization Ryota Tomioka1,

Documents