FCV Learn LeCun

Yann LeCun

5yearsfromnow,everyonewilllearn

theirfeatures(youmightaswellstartnow)

5yearsfromnow,everyonewilllearn

theirfeatures(youmightaswellstartnow)

YannLeCunCourantInstituteofMathematicalSciences

andCenterforNeuralScience,NewYorkUniversity

YannLeCunCourantInstituteofMathematicalSciences

andCenterforNeuralScience,NewYorkUniversity

Yann LeCun

I Have a Terrible Confession to MakeI Have a Terrible Confession to Make

I'm interested in vision, but no more in vision than in audition or in other perceptual modalities.

I'm interested in perception (and in control).

I'd like to find a learning algorithm and architecture that could work (with minor changes) for many modalities

Nature seems to have found one.

Almost all natural perceptual signals have a local structure (in space and time) similar to images and videos

Heavy correlation between neighboring variablesLocal patches of variables have structure, and are representable by feature vectors.

I like vision because it's challenging, it's useful, it's fun, we have datathe image recognition community is not yet stuck in a deep local minimum like the speech recognition community.

Yann LeCun

The Unity of Recognition

Architectures

The Unity of Recognition

Architectures

Yann LeCun

Most Recognition Systems Are Built on the Same ArchitectureMost Recognition Systems Are Built on the Same Architecture

First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....

Second stage: K-means, sparse coding, LCC....

Pooling: average, L2, max, max with bias (elastic templates).....

Convolutional Nets: same architecture, but everything is trained.

Filter

Bank

feature

Pooling

Non

LinearityClassifier

Filter

Bank

Non

LinNormPool

Filter

Bank

Non

LinNormPool Classifier

Norma

lization

Yann LeCun

Filter Bank + Non-Linearity + Pooling + NormalizationFilter Bank + Non-Linearity + Pooling + Normalization

This model of a feature extraction stage is biologically-inspired ...whether you like it or not (just ask David Lowe)Inspired by [Hubel and Wiesel 1962]The use of this module goes back to Fukushima's Neocognitron (and even earlier models in the 60's).

FilterBank

SpatialPooling

NonLinearity

Yann LeCun

How well does this work?How well does this work?

Some results on C101 (I know, I know....)SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%

[Lazebnik et al. CVPR 2006]

SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%[Boureau et al. CVPR 2010] [Yang et al. 2008]

SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%[Boureau et al. ICCV 2011]

(Small) supervised ConvNet with sparsity penalty: >71% [rejected from CVPR,ICCV,etc] REAL TIME

OrientedEdges

WinnerTakesAll

Histogram(sum)

Filter

Bank

feature

Pooling

Non

Linearity

Filter

Bank

feature

Pooling

Non

LinearityClassifier

SIFT

KmeansOrSparseCoding

PyramidHistogram.ElasticpartsModels,...

SVMorAnotherSimpleclassifier

Yann LeCun

Convolutional Networks (ConvNets) fits that modelConvolutional Networks (ConvNets) fits that model

Yann LeCun

Why do two stages work better than one stage?Why do two stages work better than one stage?

The second stage extracts mid-level features

Having multiple stages helps the selectivity-invariance dilemma

Filter

Bank

Non

LinNormPool

Filter

Bank

Non

LinNormPool Classifier

Yann LeCun

Learning Hierarchical RepresentationsLearning Hierarchical Representations

I agree with David Lowe: we should learn the features

It worked for speech, handwriting, NLP.....

In a way, the vision community has been running a ridiculously inefficient evolutionary learning algorithm to learn features:

Mutation: tweak existing features in many different waysSelection: Publish the best ones at CVPRReproduction: combine several features from the last CVPRIterate. Problem: Moore's law works against you

TrainableFeature

Transform

TrainableFeature

Transform

TrainableClassifier

LearnedInternalRepresentation

Yann LeCun

Sometimes, Biology gives you

good hints example:

contrast normalization

Sometimes, Biology gives you

good hints example:

contrast normalization

Yann LeCunTHISISONESTAGEOFTHECONVNET

SoftThresholding+AbsNSubtractiveandDivisiveLocalNormalizationPPoolingdownsamplinglayer:averageormax?

CConvolutions(filterbank)Harsh Non-Linearity + Contrast Normalization + SparsityHarsh Non-Linearity + Contrast Normalization + Sparsity

subtr activ e+di visive contr astn orm

a lizat ion

Con vol utio ns

Thr esho ldin g

Rec tific atio n

Pooli ng,s ubsa mpli ng

Yann LeCun

Soft Thresholding Non-LinearitySoft Thresholding Non-Linearity

Yann LeCun

Local Contrast NormalizationLocal Contrast Normalization

Performed on the state of every layer, including the input

Subtractive Local Contrast NormalizationSubtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter)

Divisive Local Contrast NormalizationDivides every value in a layer by the standard deviation of its neighbors over space and over all feature maps

Subtractive + Divisive LCN performs a kind of approximate whitening.

Yann LeCun

C101 Performance (I know, I know)C101 Performance (I know, I know)

Small network: 64 features at stage-1, 256 features at stage-2:

Tanh non-linearity, No Rectification, No normalization: 29%

Tanh non-linearity, Rectification, normalization: 65%

Shrink non-linearity, Rectification, norm, sparsity penalty 71%

Yann LeCun

Results on Caltech101 with sigmoid non-linearityResults on Caltech101 with sigmoid non-linearity

likeHMAXmodel

Yann LeCun

Feature Learning Works Really Well on everything but C101

Feature Learning Works Really Well on everything but C101

Yann LeCun

C101 is very unfavorable to learning-based systemsC101 is very unfavorable to learning-based systems

Because it's so small. We are switching to ImageNet

Some results on NORBNonormalization

Randomfilters

Nonormalization

Unsupfilters

Unsup+SupfiltersSupfilters

Yann LeCun

Sparse Auto-EncodersSparse Auto-Encoders

Inference by gradient descent starting from the encoder output

Z i=argminzE Yi , z ;W

INPUT Y Z

Y i Y2

z j

W d Z

FEATURES

j .

Z Z2ge W e ,Yi

E Y i , Z =Y iW d Z2Zge W e ,Y

i2 j z j

Yann LeCun

Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

Phase 1: train first layer using PSD

FEATURES

Y Z

Y i Y2

z j

W d Z j .

Z Z2ge W e ,Yi

Yann LeCun



Phase 2: use encoder + absolute value as feature extractor

FEATURES

Y z j

ge W e ,Yi

Yann LeCun




Phase 3: train the second layer using PSD

FEATURES

Y z j

ge W e ,Yi

Y Z

Y i Y2

z j

W d Z j .

Z Z2ge W e ,Yi

Yann LeCun





Phase 4: use encoder + absolute value as 2nd feature extractor

FEATURES

Y z j

ge W e ,Yi

z j

ge W e ,Yi

Yann LeCun





Phase 4: use encoder + absolute value as 2nd feature extractor

Phase 5: train a supervised classifier on top

Phase 6 (optional): train the entire system with supervised back-propagation

FEATURES

Y z j

ge W e ,Yi

z j

ge W e ,Yi

classifier

Yann LeCun

Learned Features on natural patches: V1-like receptive fieldsLearned Features on natural patches: V1-like receptive fields

Yann LeCun

Using PSD Features for Object RecognitionUsing PSD Features for Object Recognition

64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal Encoder

Yann LeCun

ConvolutionalSparseCodingConvolutionalSparseCoding

[Kavukcuogluetal.NIPS2010]:convolutionalPSD

[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwithautomaticadjustmentofcodedimension.

Yann LeCun

Convolutional TrainingConvolutional Training

Problem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vectorBut when the filters are used convolutionally, neighboring feature vectors will be highly redundant

Patchleveltrainingproduceslotsoffiltersthatareshiftedversionsofeachother.

Yann LeCun

Convolutional Sparse CodingConvolutional Sparse Coding

Replace the dot products with dictionary element by convolutions.Input Y is a full imageEach code component Zk is a feature map (an image)Each dictionary element is a convolution kernel

Regular sparse coding

Convolutional S.C.

k. * ZkWk

Y =

deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]

Yann LeCun

Convolutional PSD: Encoder with a soft sh() Function Convolutional PSD: Encoder with a soft sh() Function

Convolutional FormulationExtend sparse coding from PATCH to IMAGE

PATCH based learning CONVOLUTIONAL learning

Yann LeCun

Cifar-10 Dataset Cifar-10 Dataset

Dataset of tiny imagesImages are 32x32 color images10 object categories with 50000 training and 10000 testing

Example Images

Yann LeCun

Comparative Results on Cifar-10 DatasetComparative Results on Cifar-10 Dataset

* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto

**Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine. CVPR 2010

Yann LeCun

Road Sign Recognition CompetitionRoad Sign Recognition Competition

GTSRB Road Sign Recognition Competition (phase 1)32x32 imagesThe 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIANo 6 is humans!

Yann LeCun

Pedestrian Detection (INRIA Dataset)Pedestrian Detection (INRIA Dataset)

[Sermanetetal.,RejectedfromICCV2011]]

Yann LeCun

Pedestrian Detection: ExamplesPedestrian Detection: Examples

[Kavukcuogluetal.NIPS2010]

Yann LeCun

LearningInvariantFeatures

LearningInvariantFeatures

Yann LeCun

Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation?

Using an idea from Hyvarinen: topographic square pooling (subspace ICA)1. Apply filters on a patch (with suitable non-linearity)2. Arrange filter outputs on a 2D plane3. square filter outputs4. minimize sqrt of sum of blocks of squared filter outputs

Yann LeCun

Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation?

The filters arrange themselves spontaneously so that similar filters enter the same pool.

The pooling units can be seen as complex cells

They are invariant to local transformations of the inputFor some it's translations, for others rotations, or other transformations.

Yann LeCun

Pinwheels?Pinwheels?

Does that look pinwheely to you?

Yann LeCun

Sparsity throughLateral InhibitionSparsity throughLateral Inhibition

Yann LeCun

Invariant Features Lateral InhibitionInvariant Features Lateral Inhibition

Replace the L1 sparsity term by a lateral inhibition matrix

Yann LeCun


Zeros I S matrix have tree structure

Yann LeCun


Non-zero values in S form a ring in a 2D topologyInput patches are high-pass filtered

Yann LeCun


Non-zero values in S form a ring in a 2D topologyLeft: non high-pass filtering of inputRight: patch-level mean removal

Yann LeCun

Invariant Features Short-Range Lateral Excitation + L1Invariant Features Short-Range Lateral Excitation + L1

l

Yann LeCun

Disentangling the Explanatory Factors

of Images

Disentangling the Explanatory Factors

of Images

Yann LeCun

Separating Separating

I used to think that recognition was all about eliminating irrelevant information while keeping the useful one

Building invariant representationsEliminating irrelevant variabilities

I now think that recognition is all about disentangling independent factors of variations:

Separating what and whereSeparating content from instantiation parametersHinton's capsules; Karol Gregor's what-where auto-encoders

Yann LeCun

Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy

Object is cross-product of object type and instantiation parameters[Hinton 1981]

small medium large

Objecttype Objectsize[KarolGregoretal.]

Yann LeCun


St St1 St2

C1t C

1t1 C

1t2 C

2t

Decoder

W1 W1 W1 W2

Predictedinput

C1t C

1t1 C

1t2 C

2t

St St1 St2

Inferredcode

Predictedcode

InputEncoder

f W 1 f W 1 f W 1

W 2

fW 2W 2

Yann LeCun


C1(where)

C2(what)

Yann LeCun

Input

Generating from the NetworkGenerating from the Network

Yann LeCun

What is the right criterion to train

hierarchical feature extraction

architectures?

What is the right criterion to train

hierarchical feature extraction

architectures?

Yann LeCun

Flattening the Data Manifold?Flattening the Data Manifold?

The manifold of all images of is low-dimensional and highly curvy

Feature extractors should flatten the manifold

Yann LeCun

Flattening the

Data Manifold?

Flattening the

Data Manifold?

Yann LeCun

The Ultimate Recognition SystemThe Ultimate Recognition System

Bottom-up and top-down informationTop-down: complex inference and disambiguationBottom-up: learns to quickly predict the result of the top-down inference

Integrated supervised and unsupervised learningCapture the dependencies between all observed variables

CompositionalityEach stage has latent instantiation variables

TrainableFeature

Transform

TrainableFeature

Transform

TrainableClassifier

LearnedInternalRepresentation

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54

FCV Learn LeCun

Documents