Estimation of low-rank tensors via convex optimizationttic.uchicago.edu/~ryotat/talks/tensor11berlin.pdf · Estimation of low-rank tensors via convex optimization Ryota Tomioka1,
Post on 07-Jul-2018
254 Views
Preview:
Transcript
Estimation of low-rank tensors via convex optimization
Ryota Tomioka1, Kohei Hayashi2, Hisashi Kashima1
1The University of Tokyo2Nara Institute of Science and Technology
2011/3/23 @ TU Berlin
Convex low-rank tensor completion
SensorsTime
Feat
ures
Factors(loadings)
Core(interactions)
n1
n2n3
n1
n2
n3
r1
r2
r3r1
r2 r3
Tucker decompositionXijk =r1!
a=1
r2!
b=1
r3!
c=1
CabcU(1)ia U (2)
jb U (c)kc
Conventional formulation (nonconvex)
mode-k productobservation
minimizeC,U1,U2,U3
!! " (Y # C $1 U1 $2 U2 $3 U3) !2F + regularization.
minimizeX
!! " (Y # X ) !2F s.t. rank(X ) $ (r1, r2, r3).
• Alternate minimization• Have to fix the rank beforehand
Our approach
MatrixEstimation of
low-rank matrix(hard)
Trace normminimization(tractable)
[Fazel, Hindi, Boyd 01]
TensorEstimation of
low-rank tensor(hard)
Rank defined in the sense ofTucker decomposition
Extendedtrace norm
minimization(tractable)
Generalization
Trace norm (nuclear norm) regularization
X ! RR!C , m = min(R,C)
Linear sum of singular-values
• Roughly speaking, L1 regulariza6on on the singular‐values.
• Stronger regulariza6on ‐‐> more zero singular‐values ‐‐>
low rank.
• Not obvious for tensors (no singular‐values for tensors)
!X!! =m!
j=1
!j(X)
Spectral soft-threshold operationall observed and matrix --> analytic solution
SV index
Sin
gula
rval
ues
Original spectrum
Thresholded spectrum
where X=USVT
softth(X) = argminZ!RR!C
!12!Z "X!2
F + !!Z!""
= U max(S " !, 0)V #
Mode-k unfolding (matricization)
I1
I2 I3
I1
I2 I2 I2
I3
Mode-1 unfolding
I1
I2 I3
I2
I3 I3 I3
I1
Mode-2 unfolding
X(1)
X(2)
Low-rank tensor is a low-rank matrix
X = C !1 U1 !2 U2 !3 U3
r1
r2 r3C
n1r1
U1 n2
r2U2
n3
r3
U3
Mode-1 unfolding
Mode-2 unfolding
Mode-3 unfolding
X(1) = U1C(1)(U3 !U2)!
X(2) = U2C(2)(U1 !U3)!
X(3) = U3C(3)(U2 !U1)!
rank ≦ r1
rank ≦ r2
rank ≦ r3
The rank of X(k) is no more than the rank of C(k)
Low-rank matrix is a low-rank tensor
• Given X=USVT (low‐rank)
• DefineC = SV !
U1 = U
U2 = In2
U3 = In3
is low-rank
(at least for mode 1)
X = C !1 U1 !2 U2 !3 U3
What it means
• We can use the trace norm of an unfolding of a
tensor X to learn low‐rank X.
Tensor X is low-rank∃k, rk<Ik
Unfolding X(k) is a low-rank
matrix
Matricization
Tensorization
Approach 1: As a matrix
• Pick a mode k, and hope that the tensor to be
learned is low rank in mode k.
minimizeX!RI1!···!IK
12!
!! " (Y # X ) !2F + !X(k)!",
Pro: Basically a matrix problem → Theoretical guarantee (Candes & Recht 09)Con: Have to be lucky to pick the right mode.
Approach 2: Constrained optimization
• Constrain so that each unfolding of X is
simultaneously low rank.
minimizeX!RI1!···!IK
12!
!! " (Y # X ) !2F +
K!
k=1
"k!X(k)!".
γk: tuning parameter usually set to 1.
Pro: Jointly regularize every modeCon: Strong constraint
(See also Signoretto et al.,10; Gandy et al. 11)
Approach 3: Mixture of low-rank tensors
• Each mixture component Zk is regularized to be
low‐rank only in mode‐k.
minimizeZ1,...,ZK
12!
!!!!!! !"Y "
#K
k=1Zk
$!!!!!
2
F
+K#
k=1
"k#Zk(k)#!,
Pro: Each Zk takes care of each modeCon: Sum is not low-rank
Optimization viaAlternating Direction Method of Multipliers (ADMM)
• Useful when we have linear opera6on inside
sparsity penalty
Permutation
(Gabay & Mercier 76)
minimizeX!Rn1!···!nK
12!!!(X )" y!2
F +K!
k=1
"k!X(k)!".
Optimization viaAlternating Direction Method of Multipliers (ADMM)
• Useful when we have linear opera6on inside
sparsity penalty
Total-variation image reconstruction:
Permutation
2D derivative at jth pixel
(Gabay & Mercier 76)
minimizex!Rn
12!!!(x)" y!2 +
n!
j=1
!Djx!
minimizeX!Rn1!···!nK
12!!!(X )" y!2
F +K!
k=1
"k!X(k)!".
• Split Bregman Itera6on (Goldstein & Osher) is also an ADMM
ADMM preliminaries
• Problem
• Step 1: Split & Augment Linear operation
minimizex
f(x) + g(Ax)
minimizex,z
f(x) + g(z) +!
2!Ax" z!2
subject to z = Ax
ADMM preliminaries
• Problem
• Step 1: Split & Augment Linear operation
minimizex
f(x) + g(Ax)
minimizex,z
f(x) + g(z) +!
2!Ax" z!2
subject to z = Ax
ADMM preliminaries
• Problem
• Step 1: Split & Augment Linear operation
minimizex
f(x) + g(Ax)
minimizex,z
f(x) + g(z) +!
2!Ax" z!2
subject to z = Ax
ADMM preliminaries
• Problem
• Step 1: Split & Augment
• Step 2: Augmented Lagrangian func6on
Linear operation
L!(x, z, !) = f(x) + g(z) + !!(Ax" z) +!
2#Ax" z#2
Ordinary Lagrangian Augmented
minimizex
f(x) + g(Ax)
minimizex,z
f(x) + g(z) +!
2!Ax" z!2
subject to z = Ax
ADMM algorithm (Gabay & Mercier 76)
• Minimize the AL func6on wrt X
• Minimize the AL func6on wrt Z
• Update the mul6plier vector
xt+1 = argminx!Rn
L!(x, zt, !t),
!t+1 = !t + !(Axt+1 ! zt+1).
zt+1 = argminz!Rm
L!(xt+1, z, !t),
ADMM algorithm (Gabay & Mercier 76)
• Minimize the AL func6on wrt X
• Minimize the AL func6on wrt Z
• Update the mul6plier vector
xt+1 = argminx!Rn
L!(x, zt, !t),
!t+1 = !t + !(Axt+1 ! zt+1).
Every limit point of ADMM is a minimizer of
the original problem. [Eckstein & Bertsekas 92]
zt+1 = argminz!Rm
L!(xt+1, z, !t),
For approach “Constraint”
• Move the permuta6on out of the regularizer
• Augmented Lagrangian:
minimizeX ,Z1,...,ZK
12!!!(X )" y!2 +
K!
k=1
"k!Zk!!,
subject to X(k) = Zk (k = 1, . . . ,K),
L!(X , {Zk}Kk=1, {Ak}K
k=1) =12!!!(X )" y!2 +
K!
k=1
"k!Zk!!
+ #K!
k=1
"#Ak,X(k) "Zk
$+
12!X(k) "Zk!2
F
%.
ADMM for “Constraint”
• Minimize the AL func6on wrt X
• Minimize the AL func6on wrt Z
• Update mul6pliers
(!! 0)
!"
#!(X t+1) = y (observed elem.)
!̄(X t+1) = !̄$
1K
%Kk=1 tensork(Z
tk !At
k)&
(unobserved elem.)
Zt+1k = softth!k/"
!Xt+1
(k) + Atk
"(k = 1, . . . , K)
At+1k = At
k +!Xt+1
(k) !Zt+1k
"(k = 1, . . . , K)
Numerical experiment
• True tensor: Size 50x50x20, rank 7x8x9. No noise (λ=0).
• Random train/test split.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
10−4
10−2
100
102
Fraction of observed elements
Gen
eral
izatio
n er
ror
As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker (large)Tucker (exact)Optimization tolerance
Tucker=EM algo
(nonconvex)
Computation time
• Convex formula6on is also fast
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
10
20
30
40
50
Fraction of observed elements
Com
puta
tion
time
(s)
As a MatrixConstraintMixtureTucker (large)Tucker (exact)
Phase transition behaviour
0 20 40 60 800
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
[10 12 13]
[10 12 9][16 6 8]
[17 13 15]
[20 20 5]
[20 3 2]
[30 12 10]
[32 3 4]
[40 20 6]
[40 3 2]
[40 9 7]
[4 42 3][5 4 3]
[7 8 15]
[7 8 9][8 5 6]
Sum of true ranks
Frac
tion
requ
ired
for p
erfe
ct re
cons
truct
ion
• Sum of true ranks= min(r1,r2r3)+ min(r2,r3r1)+ min(r3,r1r2)
Phase transition (vs Shatten-1 norm)
0 100 200 3000
0.2
0.4
0.6
0.8
[10 12 13]
[10 12 9][16 6 8]
[17 13 15][20 20 5]
[20 3 2]
[30 12 10]
[32 3 4]
[40 20 6]
[40 3 2]
[40 9 7]
[4 42 3][5 4 3]
[7 8 15]
[7 8 9][8 5 6]
Shatten−1 norm
Frac
tion
requ
ired
for r
econ
stru
ctio
n
y=0.0463x0.4958
“Mixture” is sometimes better
• True tensor: Size 50x50x20, rank 50x50x5. No noise (λ=0).
•
0 0.2 0.4 0.6 0.8 110−3
10−2
10−1
100
Fraction of observed elements
Gen
eral
izatio
n er
ror
As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker (large)Tucker (exact)Optimization tolerance
Amino acid fluorescence data [Bro & Andersson]
• Size 201x61x5.
• Five solu6ons with different amount of three amino
acids (tyrosine, tryptophan, phenylalanine)
• Rank=3 PARAFAC is correct.
• Interested in
‐ Generaliza6on performance
‐ Number of components
‐ Interpreta6on
Amino acid: Generalization performance
• “Constraint” performs comparable to PARAFAC with
the correct rank.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 110−3
10−2
10−1
100
101
Fraction of observed elements
Gen
eral
izat
ion
erro
r
As a Matrix (mode 1)As a Matrix (mode 2)As a Matrix (mode 3)ConstraintMixtureTucker [4 4 4]Tucker [3 3 3]PARAFAC(3)PARAFAC(4)
Amino acid: Singular-value spectra
50 100 150 2000
2
4x 104 Mode 1
True
20 40 600
5x 104 Mode 2
1 2 3 4 50
2
4x 104 Mode 3
50 100 150 2000
2
4x 104
Estim
ated
20 40 600
5x 104
1 2 3 4 50
2
4x 104
Estimated spetra from half of the entries are almost identical to the truth.
Improving Interpretability
• Apply PARAFAC on the core (4x4x5) obtained by the
proposed “constraint” approach.
• Separate imputa6on problem and interpreta6on
problem.
= (U1A(1))! (U2A(2))! (U3A(3))
= (A(1) !A(2) !A(3))"1 U1 "2 U2 "3 U3
X = C !1 U1 !2 U2 !3 U3
PARAFAC
Obtained factors
250 300 350 400 450−2000
0
2000
4000
6000Proposed(4)
240 260 280 3000
0.05
0.1
0.15
0.2
0.25
1 2 3 4 5−0.5
0
0.5
1
250 300 350 400 450−2000
0
2000
4000
6000PARAFAC(4)
240 260 280 300−0.4
−0.2
0
0.2
0.4
1 2 3 4 5−0.5
0
0.5
1
250 300 350 400 450−2000
0
2000
4000
6000PARAFAC(3)
Emis
sion
load
ings
240 260 280 3000
0.05
0.1
0.15
0.2
0.25
Exci
tatio
n lo
adin
gs
1 2 3 4 5−0.5
0
0.5
1
Sam
ple
load
ings
Summary
• Low‐rank tensor comple6on can be computed in a convex
op6miza6on problem using the trace norm regulariza6on.
‐ No need to specify the rank beforehand.
• Convex formula6on is more accurate and faster than
conven6onal EM‐based Tucker decomposi6on.
• Curious “phase transi6on” found → compressive‐sensing‐
type analysis is an on‐going work.
• Combina6on of proposed+PARAFAC is useful.
• Code:
‐ hnp://www.ibis.t.u‐tokyo.ac.jp/RyotaTomioka/Sopwares/Tensor
Acknowledgment
• This work was supported in part by MEXT KAKENHI
22700138, 80545583, JST PRESTO, and NTT
Communica6on Science Laboratories.
top related