Convolutional Neural Networks The forward pass algorithm (without pooling): Mathematically: Training stage: Consider the task of classification, for example. Given a set of signals j j and their corresponding labels h j j , the CNN learns an end-to-end mapping. Motivation and Goals Convolutional neural networks (CNN) lead to remarkable results in many fields Clear and profound theoretical understanding is still lacking Sparse representation is a powerful model Enjoys from a vast theoretical study, supporting its success Recently, convolutional sparse coding (CSC) has also been analyzed thoroughly There seems to be a relation between CSC and CNN: Both have a convolutional structure Both use a data driven approach for training the model The most popular non-linearity employed in CNN, called ReLU, is known to be connected to sparsity (and shrinkage) Layered Thresholding Background: Going Deeper: Multi-Layer CSC Model Deep Coding Problem: Deep Learning Problem: {vardanp,yromano,elad}@{campus,tx,cs}.technion.ac.il Convolutional Neural Networks Analyzed via Convolutional Sparse Coding Vardan Papyan* The Computer Science Department Yaniv Romano* The Electrical Engineering Department Michael Elad The Computer Science Department Technion – Israel Institute of Technology *Contributed Equally This research was supported by the European Research Council under EU’s 7th Framework Program, ERC Grant agreement no. 320649. Uniqueness of λ Stability of Layered Thresholding Thus far, our analysis relied on the local sparsity of the underlying solution , which was enforced through the ℓ 0,∞ norm, while the noise was treated globally In what follows, we present stability guarantees for the forward pass of CNN that will also depend on the local energy in the noise vector E This will be enforced via the ℓ 2,∞ norm, defined as: Limitations of the Forward Pass Even in the noiseless case, it is incapable of recovering the solution of the DCP Its success depends on the ratio i min / i max . This is a direct consequence of the forward pass algorithm relying on the simple thresholding operator The distance between the true sparse vector and the estimated one increases exponentially as function of the layer depth What’s Next? Layered Basis Pursuit (BP) Thresholding is the simplest pursuit known in the field of sparsity What about replacing the thresholding with the basis pursuit: Stability of Layered Basis Pursuit Conclusion The relation between CNN and the sparse-land model has been defined via our multi-layer convolutional sparse coding model We have shown that the forward pass of CNN is in fact a pursuit algorithm that aims to decompose the signals belonging to our model into their building blocks Leveraging this connection, we were able to attribute to the CNN architecture theoretical claims such as uniqueness of the representations (feature maps) throughout the network and their stable estimation, all guaranteed under local sparsity conditions Sitting on these theoretical grounds, we have proposed a better pursuit that is shown to be theoretically superior to the forward pass λ : min i i=1 K , ℓ h j , , ⋆ j , j λ : Find a set of representations satisfying = 1 1 1 0,∞ s ≤λ 1 1 = 2 2 2 0,∞ s ≤λ 2 ⋮ ⋮ K−1 = K K K 0,∞ s ≤λ K ⋆ : Deepest representation obtained by solving the DCP Extension of the classic sparse representation, suggest modeling signals as hierarchal sparse representations T T Thresholding algorithm: The layered (soft nonnegative) thresholding and the forward pass algorithm are the same !!! The problem solved by the training stage of CNN and the DLP are equal assuming that the DCP is approximated via the layered thresholding algorithm Theorem: If a set of solutions i i=1 K is found for ( λ ) such that: i 0,∞ s ≤λ i < 1 2 1+ 1 μ i Then these are necessarily the unique solution to this problem. The feature maps CNN aims to recover are unique Background: [Papyan et al. (’16)]: If a solution is found for the first layer such that: 0,∞ s < 1 2 1+ 1 μ Then this is necessarily the unique solution Mutual Coherence: μ = max i≠j |d i T d j | Handling Noise λ ℰ So far, we have assumed an ideal signal However, in practice we usually have =+ where is due to noise or model deviations λ ℰ : Find a set of representations satisfying − 1 1 2 ≤ℰ 0 1 0,∞ s ≤λ 1 1 − 2 2 2 ≤ℰ 1 2 0,∞ s ≤λ 2 ⋮ ⋮ K−1 − K K 2 ≤ℰ K−1 K 0,∞ s ≤λ K Theorem: If the true representations i i=1 K satisfy i 0,∞ s ≤λ i < 1 2 1+ 1 μ i And the error thresholds for ( λ ℰ ) are ℰ 0 2 = 2 2 , ℰ i 2 = 4ℰ i−1 2 1− 2 i 0,∞ s −1μ i Then the set of solutions i i=1 K obtained by solving this problem must be close to the true ones i − i 2 2 ≤ℰ i 2 The problem CNN aims to solve is stable under certain conditions Background: [Papyan et al. (’16)]: If the true representation of the first layer satisfies 0,∞ s =k< 1 2 1+ 1 μ Then a solution for 0,∞ ϵ must be close to it − 2 2 ≤ 4ϵ 2 1− 2k − 1 μ Theorem: If i 0,∞ s < 1 2 1+ 1 μ i ⋅ i min i max − 1 μ i ⋅ ε L i−1 i max Then the layered soft thresholding will* 1. Find the correct supports 2. i LT − i 2,∞ p ≤ε L i We have defined ε L 0 = 2,∞ p and ε L i = i 0,∞ p ⋅ ε L i−1 +μ i i 0,∞ s −1 i max + β i * For correctly chosen thresholds In the case of hard thresholding: The threshold β i does not increase the error ε L i ε L 0 = 2,∞ p = max i i 2 i : patch extraction operator The stability of the forward pass is guaranteed if the underlying representations are locally sparse and the noise is locally bounded Deconvolutional networks [Zeiler et al. ‘10] Guarantee for the Success of Layered BP The layered BP can retrieve the representations in the noiseless case, a task in which the forward pass fails Its success does not depend on the ratio i min / i max Theorem: If a set of solutions i i=1 K of ( λ ) satisfy i 0,∞ s ≤λ i < 1 2 1+ 1 μ i Then the layered BP is guaranteed to find them. Background: [Papyan et al. (’16)]: If a solution of the first layer satisfies: 0,∞ s < 1 2 1+ 1 μ Then global BP is guaranteed to find it Theorem: Assuming that i 0,∞ s < 1 3 1+ 1 μ i Then we are guaranteed that* 1. The support of i LBP is contained in that of i 2. i LBP − i 2,∞ ≤ε L i 3. Every entry in i greater than ε L i / i 0,∞ p will be found ε L i = 7.5 i 2,∞ j 0,∞ p i j=1 * For correctly chosen ξ i i=1 K Background: [Papyan et al. (’16)] For the first layer, if ξ=4 2,∞ p and 0,∞ s < 1 3 1+ 1 μ , then: 1. The support of BP is contained in that of Γ 2. BP − ∞ ≤ 7.5 2,∞ p 3. Every entry greater than 7.5 2,∞ p will be found 4. BP is unique. Layered Basis Pursuit (Lagrangian) 2 LBP = min 2 1 2 1 LBP − 2 2 2 2 +ξ 2 2 1 1 LBP = min 1 1 2 − 1 1 2 2 +ξ 1 1 1 Layered Iterative Thresholding 2 t = ξ 2 /c 2 2 t−1 + 1 c 2 2 T 1 − 2 2 t−1 1 t = ξ 1 /c 1 1 t−1 + 1 c 1 1 T − 1 1 t−1 Can be seen as a recurrent neural network [Gregor & LeCun ‘10] Convolutional Sparse Coding = i = i (2 − 1) i = stripe-dictionary Stripe-vector i-th patch 0,∞ : min 0,∞ s s. t. = 0,∞ s ≜ max i i 0 0,∞ ϵ : min 0,∞ s s. t. − 2 ≤ϵ A global sparse vector is likely if it can represent every patch in the signal sparsely [Papyan et al. (’16)] We link between deep learning and sparse representations, providing mathematical grounds to analyze CNN theoretically. In addition, we propose new insights and architecture, with clear theoretical foundations References Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part I: Theoretical guarantees for convolutional sparse coding, submitted to IEEE-TSP, 2016 Vardan Papyan, Jeremias Sulam, and Michael Elad, Working locally thinking globally - Part II: Stability and algorithms for convolutional sparse coding, submitted to IEEE-TSP, 2016 Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus, Deconvolutional networks, CVPR, 2010 Karol Gregor and Yann LeCun, Learning fast approximations of sparse coding, ICML, 2010 Michael Elad, Sparse and redundant representations: From theory to applications in signal and image processing, Springer, 2010 2 LBP = min 2 2 1 s. t. 1 LBP = 2 2 1 LBP = min 1 1 1 s. t. = 1 1 2 ∈ℝ 2 1 ReLU ReLU 2 T ∈ℝ 2 × 1 1 1 2 1 T ∈ℝ 1 × 1 0 1 ∈ℝ 1 2 ∈ℝ 2 ∈ℝ , i , i = ReLU 2 + 2 T ReLU 1 + 1 T 1 0 0 1 2 1 1 2 i Estimate 1 via the thresholding Estimate 2 via the thresholding 2 = β 2 2 T β 1 1 T min i i=1 K , ℓ h j , , ⋆ j , j Via the layered thresholding min i , i , ℓ h j , , , i , i j Output of last layer Classifier True label ∈ℝ 1 0 1 ∈ℝ × 1 1 1 2 2 ∈ℝ 1 × 2 1 1 ∈ℝ 1 1 ∈ℝ 1 2 ∈ℝ 2 Convolutional sparsity assumes an inherent structure is present in natural signals. Similarly, the representations themselves could also be assumed to have such a structure.