Latent Structure Beyond Sparse Codes - LCSLlcsl.mit.edu/ldr-workshop/Slides/Recht_LDR_MIT_112313.pdf · Latent Structure Beyond Sparse Codes Benjamin Recht Department of EECS and
Post on 25-Aug-2020
9 Views
Preview:
Transcript
Latent Structure Beyond Sparse Codes
Benjamin RechtDepartment of EECS and StatisticsUniversity of California, Berkeley
Sparse Codes
1.25x 2.5x
5x 10x
Figure 1. Learned dictionaries. Each panel shows 100 basis functions selected at random from the dictionary of a givenovercompleteness ratio.
resulting in dictionaries containing more specialized elements such as straight contours, blobs, local curvature, andgratings. The specialized elements are better matched to the structures occurring natural images, as evidencedby the fact that they yield lower L1 norm representations, steeper coe�cient decay, and better denoising. Itseems plausible that they may also result in improved image compression though this remains to be seen.
These results are of relevance to neuroscience because the input layer of V1 is thought to be at least 100x
redundancy
Which mathematical representations can be learned robustly?
robustness and sparsity
Gabor-like thingies...
Sparse Approximation
• Use the fact that images are sparse in wavelet basis to reduce number of measurements required for signal acquisition.
pixels largewaveletcoefficients
widebandsignalsamples
largeGaborcoefficients
time
frequency
Compressed Sensing
• npatients << npeaks
• If very few are needed for diagnosis, search for a sparse set of markers
Lasso
Cardinality Minimization• PROBLEM: Find the vector of lowest cardinality that
satisfies/approximates the underdetermined linear system
• NP-HARD:–Reduce to EXACT-COVER
–Hard to approximate
–Known exact algorithms require enumeration
• HEURISTIC: Replace cardinality with l1 norm
�x = y � : Rp ! Rn
Density Matrix
Seismic Imaging
Geometric Structure
Rank of:
RecommenderSystems
DataMatrix
Quantum Tomography
Rank of:
Rank of:
Rank of: Unfolded Tensor
GramMatrix
Affine Rank Minimization• PROBLEM: Find the matrix of lowest rank that
satisfies/approximates the underdetermined linear system
• NP-HARD:–Reduce to solving polynomial equations
–Hard to approximate
–Exact algorithms are awful
• HUERISTIC: Replace rank with nuclear norm
�(X) = y � : Rp1⇥p2 ! Rn
Heuristic: Gradient Descent
• Step 1: Pick (i,j) and compute residual:
• Step 2: Take a mixture of current model and corrected model (𝛼,β>0):
r x p2
=M LR*
p1 x rp1 x p2
minimize kXk⇤subject to �(X) = b
IDEA: Replace rank with nuclear norm:
Some guy on livejournal, 2006Fazel, Parillo, Recht, 2007Candes and Recht, 2008
Succeeds when number of samples is Õ(r(p1 +p2))
e = (LiRTj �Mij)
Li
Rj
�
↵Li � �eRj
↵Rj � �eLi
�
System Identification: find a dynamical model that agrees with time series data• All linear systems are combinations of single pole filters.• Leverage this structure for new algorithms and analysis.
Observe a time series driven by the inputy1, y2, . . . , yTu1, u2, . . . uT
What is a principled way to build a parsimonious model for the input-output responses?
Na et al, 2012
Shah, Bhaskar, Tang, and Recht 2012
Linear Inverse Problems• Find me a solution of
• Φ n x p, n<p
• Of the infinite collection of solutions, which one should we pick?
• Leverage structure:
• How do we design algorithms to solve underdetermined systems problems with priors?
y = �x
Sparsity Rank Smoothness Symmetry
kxk1 =pX
i=1
|xi|
• 1-sparse vectors of Euclidean norm 1
• Convex hull is the unit ball of the l1 norm
1
1
-1
-1
Sparsity
minimize kxk1
subject to �x = y
x1
x2
Φx=y
Compressed Sensing: Candes, Romberg, Tao, Donoho, Tanner, Etc...
• 2x2 matrices• plotted in 3d
rank 1 x2 + z2 + 2y2 = 1
Rank
• 2x2 matrices• plotted in 3d
rank 1 x2 + z2 + 2y2 = 1
Convex hull:
Rank
kXk⇤ =X
i
�i(X)
• 2x2 matrices• plotted in 3d
Nuclear Norm Heuristic
Fazel 2002. R, Fazel, and Parillo 2007
Rank Minimization/Matrix Completion
kXk⇤ =X
i
�i(X)
• Integer solutions: all components of x
are ±1
• Convex hull is the unit ball of the l1 norm
(1,-1)
(1,1)
(-1,-1)
(-1,1)
Integer Programming
minimize kxk1subject to �x = y
x1
x2
Φx=y
Donoho and Tanner 2008Mangasarian and Recht. 2009.
• Search for best linear combination of fewest atoms• “rank” = fewest atoms needed to describe the model
Parsimonious Models
atomsmodel weights
rank
Atomic Norms• Given a basic set of atoms, , define the function
• When is centrosymmetric, we get a norm
• When can we compute this?• When does this work?
kxkA = inf{X
a2A|ca| : x =
X
a2Acaa}
kxkA = inf{t > 0 : x 2 tconv(A)}
A
minimize kzkAsubject to �z = yIDEA:
A
Hierarchical dictionary for image patches
26/42
Union of Subspaces
• X has structured sparsity: linear combination of elements from a set of subspaces {Ug}.
• Atomic set: unit norm vectors living in one of the Ug
Permutations and Rankings
• X a sum of a few permutation matrices
• Examples: Multiobject Tracking, Ranked elections, BCS
• Convex hull of permutation matrices: doubly stochastic matrices.
• Moments: convex hull of of [1,t,t2,t3,t4,...], t∈T, some basic set.
• System Identification, Image Processing, Numerical Integration, Statistical Inference
• Solve with semidefinite programming
• Cut-matrices: sums of rank-one sign matrices.
• Collaborative Filtering, Clustering in Genetic Networks, Combinatorial Approximation Algorithms
• Approximate with semidefinite programming
• Low-rank Tensors: sums of rank-one tensors
• Computer Vision, Image Processing, Hyperspectral Imaging, Neuroscience
• Approximate with alternating least-squares
Atomic norms in sparse approximation
• Greedy approximations
• Best n term approximation to a function f in the convex hull of A.
• Maurey, Jones, and Barron (1980s-90s)• Devore and Temlyakov (1996)• Random Feature Heuristics (Rahimi and R, 2007)
kf � fnkL2 c0kfkAp
n
• Set of directions that decrease the norm from x form a cone:
• x is the unique minimizer if the intersection of this cone with the null space of Φ equals {0}
Tangent Cones
y = �zx
minimize kzkAsubject to �z = y
{z : kzkA kxkA}TA(x)
TA(x) = {d : kx + ↵dkA kxkA for some ↵ > 0}
Mean Width
d
0x
S
C
(d) = supx2C
d
0x
�d
0x
Support Function:
SC(d) + SC(�d)measures width of C when projected onto span of d.
mean width: w(C) =
Z
Sp�1
SC(u)du
• When does a random subspace, U in , intersect a convex cone C at the origin?
• Gordon (1988): with high probability if
where is the mean width.
• Corollary: For inverse problems, if Φ is a random Gaussian matrix with n rows, need
for exact recovery of x.
codim(U) � pw(C \ Sp�1)
2
w(C \ Sp�1) =
Z
Sp�1
SC(u)du
n � pw(TA(x) \ Sp�1)2
Rp
• Hypercube:
• Sparse Vectors, p vector, sparsity s
• Block sparse, M groups (possibly overlapping), maximum group size B, k active groups
• Low-rank matrices: p1 x p2, (p1<p2), rank r
Ratesn � p/2
n � 2s log�ps
�+
5s4
n � k⇣p
2 log (M � k) +pB⌘2
+ kB
n � 3r(p1 + p2 � r)
• Suppose we observe
• If is an optimal solution, then provided that
Robust Recovery (deterministic)
minimize kzkAsubject to k�z � yk �
kwk2 �
kx� x̂k 2�
✏
x̂
y = �x + w
{z : kzkA kxkA}
k�z � yk �
n � pw(TA(x) \ Sp�1)2
(1� ✏)2
• Suppose we observe
• If is an optimal solution, then provided that
Robust Recovery (statistical)
x̂
y = �x + w
x̂
minimize k�z � yk2 + µkzkA
cone{u : kx+ ukA kxkA + �kuk}
kx� x̂k2 ⌘(x,A,�, �)µAnd under an additional “cone condition”
Bhaskar, Tang, and Recht 2011
µ � Ew[k�⇤wk⇤A]k�x� �x̂k2
pµkxkA
• Sparse Vectors, p vector, sparsity s
• Low-rank matrices: p1 x p2, (p1<p2), rank r
Denoising Rates (re-derivations)
1
pkx̂� x?k22 = O
✓�2s log(p)
p
◆
1
p1p2kx̂� x?k2F = O
✓�2r
p1
◆
Atomic Norm Minimization
• Generalizes existing, powerful methods• Rigorous formula for developing new analysis
algorithms• Tightest bounds on number of measurements
needed for model recovery in all common models• One algorithm prototype for many data-mining
applications
minimize kzkAsubject to �z = yIDEA:
Chandrasekaran, Recht, Parrilo, and Willsky 2010
• Gram matrix of y vectors indicates overlapping support
• Use graph algorithms to identify single dictionary elements at a time
Learning representations
• ASSUME:• very sparse vectors• s<N1/2/log(N)
• very incoherent dictionary (much more than RIP)
• number of observations is much bigger than N
Arora, Ge, and MoitraAgarwal, Anandkumar, and Netrapalli
x z
|��x, �z�| � |�x, z�|
Extended representations
C = �(K � L)convex body
linear map
cone affine space
this non-regular hexagon only has the trivial LP-lift
{y ! R5+ : y1 + y2 + y3 + y5 = 2, y3 + y4 + y5 = 1},
regular hexagon is the projection of a 3-dimlslice of R
5+
C = �(K � L)
(1,-1)
(1,1)
(-1,-1)
(-1,1)
1
1
-1
-1
� =�I �I
�L = {y :
2d�
i=1
yi = 1} L = {Z : trace(Z) = 1}
�
��A B
BT C
��= B
�
��T xxT u
��= x
L =
�y :
yi + yi+d = 11 � i � d
�
� =�I �I
�
L =
�Z =
�T xxT u
�:
T toeplitzT11 = u = 1
�
K = R2d+
K = Sd1+d2+
K = Sd+1+K = R2d
+
Extended representations
C = �(K � L)
linear map
cone affine space
this non-regular hexagon only has the trivial LP-lift
{y ! R5+ : y1 + y2 + y3 + y5 = 2, y3 + y4 + y5 = 1},
regular hexagon is the projection of a 3-dimlslice of R
5+
C� = {y : �x, y� � 1 �x � C}
1 � �x, y� = �A(x), B(y)�A : C � K B : C� � K�
C has a lift into K if there are maps
such that
for all extreme points of x ∈ C and y ∈ C*
polar body
Gouveia, Parrilo, and Thomas
Representation learning becomes matrix factorization
Learning extended representations?
C = �(K � L)convex body
linear map
cone affine space
• Learning representation through NMF?• Ties immediately with gaussian width analysis• Could obviate graph structured arguments• What are the right features?
top related