-
Yann LeCun
5yearsfromnow,everyonewilllearn
theirfeatures(youmightaswellstartnow)
5yearsfromnow,everyonewilllearn
theirfeatures(youmightaswellstartnow)
YannLeCunCourantInstituteofMathematicalSciences
andCenterforNeuralScience,NewYorkUniversity
YannLeCunCourantInstituteofMathematicalSciences
andCenterforNeuralScience,NewYorkUniversity
-
Yann LeCun
I Have a Terrible Confession to MakeI Have a Terrible Confession
to Make
I'm interested in vision, but no more in vision than in audition
or in other perceptual modalities.
I'm interested in perception (and in control).
I'd like to find a learning algorithm and architecture that
could work (with minor changes) for many modalities
Nature seems to have found one.
Almost all natural perceptual signals have a local structure (in
space and time) similar to images and videos
Heavy correlation between neighboring variablesLocal patches of
variables have structure, and are representable by feature
vectors.
I like vision because it's challenging, it's useful, it's fun,
we have datathe image recognition community is not yet stuck in a
deep local minimum like the speech recognition community.
-
Yann LeCun
The Unity of Recognition
Architectures
The Unity of Recognition
Architectures
-
Yann LeCun
Most Recognition Systems Are Built on the Same ArchitectureMost
Recognition Systems Are Built on the Same Architecture
First stage: dense SIFT, HOG, GIST, sparse coding, RBM,
auto-encoders.....
Second stage: K-means, sparse coding, LCC....
Pooling: average, L2, max, max with bias (elastic
templates).....
Convolutional Nets: same architecture, but everything is
trained.
Filter
Bank
feature
Pooling
Non
LinearityClassifier
Filter
Bank
Non
LinNormPool
Filter
Bank
Non
LinNormPool Classifier
Norma
lization
-
Yann LeCun
Filter Bank + Non-Linearity + Pooling + NormalizationFilter Bank
+ Non-Linearity + Pooling + Normalization
This model of a feature extraction stage is
biologically-inspired ...whether you like it or not (just ask David
Lowe)Inspired by [Hubel and Wiesel 1962]The use of this module goes
back to Fukushima's Neocognitron (and even earlier models in the
60's).
FilterBank
SpatialPooling
NonLinearity
-
Yann LeCun
How well does this work?How well does this work?
Some results on C101 (I know, I
know....)SIFT->K-means->Pyramid pooling->SVM intersection
kernel: >65%
[Lazebnik et al. CVPR 2006]
SIFT->Sparse coding on Blocks->Pyramid pooling->SVM:
>75%[Boureau et al. CVPR 2010] [Yang et al. 2008]
SIFT->Local Sparse coding on Block->Pyramid
pooling->SVM: >77%[Boureau et al. ICCV 2011]
(Small) supervised ConvNet with sparsity penalty: >71%
[rejected from CVPR,ICCV,etc] REAL TIME
OrientedEdges
WinnerTakesAll
Histogram(sum)
Filter
Bank
feature
Pooling
Non
Linearity
Filter
Bank
feature
Pooling
Non
LinearityClassifier
SIFT
KmeansOrSparseCoding
PyramidHistogram.ElasticpartsModels,...
SVMorAnotherSimpleclassifier
-
Yann LeCun
Convolutional Networks (ConvNets) fits that modelConvolutional
Networks (ConvNets) fits that model
-
Yann LeCun
Why do two stages work better than one stage?Why do two stages
work better than one stage?
The second stage extracts mid-level features
Having multiple stages helps the selectivity-invariance
dilemma
Filter
Bank
Non
LinNormPool
Filter
Bank
Non
LinNormPool Classifier
-
Yann LeCun
Learning Hierarchical RepresentationsLearning Hierarchical
Representations
I agree with David Lowe: we should learn the features
It worked for speech, handwriting, NLP.....
In a way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different
waysSelection: Publish the best ones at CVPRReproduction: combine
several features from the last CVPRIterate. Problem: Moore's law
works against you
TrainableFeature
Transform
TrainableFeature
Transform
TrainableClassifier
LearnedInternalRepresentation
-
Yann LeCun
Sometimes, Biology gives you
good hints example:
contrast normalization
Sometimes, Biology gives you
good hints example:
contrast normalization
-
Yann LeCunTHISISONESTAGEOFTHECONVNET
SoftThresholding+AbsNSubtractiveandDivisiveLocalNormalizationPPoolingdownsamplinglayer:averageormax?
CConvolutions(filterbank)Harsh Non-Linearity + Contrast
Normalization + SparsityHarsh Non-Linearity + Contrast
Normalization + Sparsity
subtr activ e+di visive contr astn orm
a lizat ion
Con vol utio ns
Thr esho ldin g
Rec tific atio n
Pooli ng,s ubsa mpli ng
-
Yann LeCun
Soft Thresholding Non-LinearitySoft Thresholding
Non-Linearity
-
Yann LeCun
Local Contrast NormalizationLocal Contrast Normalization
Performed on the state of every layer, including the input
Subtractive Local Contrast NormalizationSubtracts from every
value in a feature a Gaussian-weighted average of its neighbors
(high-pass filter)
Divisive Local Contrast NormalizationDivides every value in a
layer by the standard deviation of its neighbors over space and
over all feature maps
Subtractive + Divisive LCN performs a kind of approximate
whitening.
-
Yann LeCun
C101 Performance (I know, I know)C101 Performance (I know, I
know)
Small network: 64 features at stage-1, 256 features at
stage-2:
Tanh non-linearity, No Rectification, No normalization: 29%
Tanh non-linearity, Rectification, normalization: 65%
Shrink non-linearity, Rectification, norm, sparsity penalty
71%
-
Yann LeCun
Results on Caltech101 with sigmoid non-linearityResults on
Caltech101 with sigmoid non-linearity
likeHMAXmodel
-
Yann LeCun
Feature Learning Works Really Well on everything but C101
Feature Learning Works Really Well on everything but C101
-
Yann LeCun
C101 is very unfavorable to learning-based systemsC101 is very
unfavorable to learning-based systems
Because it's so small. We are switching to ImageNet
Some results on NORBNonormalization
Randomfilters
Nonormalization
Unsupfilters
Unsup+SupfiltersSupfilters
-
Yann LeCun
Sparse Auto-EncodersSparse Auto-Encoders
Inference by gradient descent starting from the encoder
output
Z i=argminzE Yi , z ;W
INPUT Y Z
Y i Y2
z j
W d Z
FEATURES
j .
Z Z2ge W e ,Yi
E Y i , Z =Y iW d Z2Zge W e ,Y
i2 j z j
-
Yann LeCun
Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a
Hierarchy of Features
Phase 1: train first layer using PSD
FEATURES
Y Z
Y i Y2
z j
W d Z j .
Z Z2ge W e ,Yi
-
Yann LeCun
Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a
Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
FEATURES
Y z j
ge W e ,Yi
-
Yann LeCun
Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a
Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
FEATURES
Y z j
ge W e ,Yi
Y Z
Y i Y2
z j
W d Z j .
Z Z2ge W e ,Yi
-
Yann LeCun
Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a
Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2nd feature
extractor
FEATURES
Y z j
ge W e ,Yi
z j
ge W e ,Yi
-
Yann LeCun
Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a
Hierarchy of Features
Phase 1: train first layer using PSD
Phase 2: use encoder + absolute value as feature extractor
Phase 3: train the second layer using PSD
Phase 4: use encoder + absolute value as 2nd feature
extractor
Phase 5: train a supervised classifier on top
Phase 6 (optional): train the entire system with supervised
back-propagation
FEATURES
Y z j
ge W e ,Yi
z j
ge W e ,Yi
classifier
-
Yann LeCun
Learned Features on natural patches: V1-like receptive
fieldsLearned Features on natural patches: V1-like receptive
fields
-
Yann LeCun
Using PSD Features for Object RecognitionUsing PSD Features for
Object Recognition
64 filters on 9x9 patches trained with PSD with
Linear-Sigmoid-Diagonal Encoder
-
Yann LeCun
ConvolutionalSparseCodingConvolutionalSparseCoding
[Kavukcuogluetal.NIPS2010]:convolutionalPSD
[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwithautomaticadjustmentofcodedimension.
-
Yann LeCun
Convolutional TrainingConvolutional Training
Problem: With patch-level training, the learning algorithm must
reconstruct the entire patch with a single feature vectorBut when
the filters are used convolutionally, neighboring feature vectors
will be highly redundant
Patchleveltrainingproduceslotsoffiltersthatareshiftedversionsofeachother.
-
Yann LeCun
Convolutional Sparse CodingConvolutional Sparse Coding
Replace the dot products with dictionary element by
convolutions.Input Y is a full imageEach code component Zk is a
feature map (an image)Each dictionary element is a convolution
kernel
Regular sparse coding
Convolutional S.C.
k. * ZkWk
Y =
deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]
-
Yann LeCun
Convolutional PSD: Encoder with a soft sh() Function
Convolutional PSD: Encoder with a soft sh() Function
Convolutional FormulationExtend sparse coding from PATCH to
IMAGE
PATCH based learning CONVOLUTIONAL learning
-
Yann LeCun
Cifar-10 Dataset Cifar-10 Dataset
Dataset of tiny imagesImages are 32x32 color images10 object
categories with 50000 training and 10000 testing
Example Images
-
Yann LeCun
Comparative Results on Cifar-10 DatasetComparative Results on
Cifar-10 Dataset
* Krizhevsky. Learning multiple layers of features from tiny
images. Masters thesis, Dept of CS U of Toronto
**Ranzato and Hinton. Modeling pixel means and covariances using
a factorized third order boltzmann machine. CVPR 2010
-
Yann LeCun
Road Sign Recognition CompetitionRoad Sign Recognition
Competition
GTSRB Road Sign Recognition Competition (phase 1)32x32 imagesThe
13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIANo 6
is humans!
-
Yann LeCun
Pedestrian Detection (INRIA Dataset)Pedestrian Detection (INRIA
Dataset)
[Sermanetetal.,RejectedfromICCV2011]]
-
Yann LeCun
Pedestrian Detection: ExamplesPedestrian Detection: Examples
[Kavukcuogluetal.NIPS2010]
-
Yann LeCun
LearningInvariantFeatures
LearningInvariantFeatures
-
Yann LeCun
Why just pool over space? Why not over orientation?Why just pool
over space? Why not over orientation?
Using an idea from Hyvarinen: topographic square pooling
(subspace ICA)1. Apply filters on a patch (with suitable
non-linearity)2. Arrange filter outputs on a 2D plane3. square
filter outputs4. minimize sqrt of sum of blocks of squared filter
outputs
-
Yann LeCun
Why just pool over space? Why not over orientation?Why just pool
over space? Why not over orientation?
The filters arrange themselves spontaneously so that similar
filters enter the same pool.
The pooling units can be seen as complex cells
They are invariant to local transformations of the inputFor some
it's translations, for others rotations, or other
transformations.
-
Yann LeCun
Pinwheels?Pinwheels?
Does that look pinwheely to you?
-
Yann LeCun
Sparsity throughLateral InhibitionSparsity throughLateral
Inhibition
-
Yann LeCun
Invariant Features Lateral InhibitionInvariant Features Lateral
Inhibition
Replace the L1 sparsity term by a lateral inhibition matrix
-
Yann LeCun
Invariant Features Lateral InhibitionInvariant Features Lateral
Inhibition
Zeros I S matrix have tree structure
-
Yann LeCun
Invariant Features Lateral InhibitionInvariant Features Lateral
Inhibition
Non-zero values in S form a ring in a 2D topologyInput patches
are high-pass filtered
-
Yann LeCun
Invariant Features Lateral InhibitionInvariant Features Lateral
Inhibition
Non-zero values in S form a ring in a 2D topologyLeft: non
high-pass filtering of inputRight: patch-level mean removal
-
Yann LeCun
Invariant Features Short-Range Lateral Excitation + L1Invariant
Features Short-Range Lateral Excitation + L1
l
-
Yann LeCun
Disentangling the Explanatory Factors
of Images
Disentangling the Explanatory Factors
of Images
-
Yann LeCun
Separating Separating
I used to think that recognition was all about eliminating
irrelevant information while keeping the useful one
Building invariant representationsEliminating irrelevant
variabilities
I now think that recognition is all about disentangling
independent factors of variations:
Separating what and whereSeparating content from instantiation
parametersHinton's capsules; Karol Gregor's what-where
auto-encoders
-
Yann LeCun
Invariant Features through Temporal Constancy Invariant Features
through Temporal Constancy
Object is cross-product of object type and instantiation
parameters[Hinton 1981]
small medium large
Objecttype Objectsize[KarolGregoretal.]
-
Yann LeCun
Invariant Features through Temporal Constancy Invariant Features
through Temporal Constancy
St St1 St2
C1t C
1t1 C
1t2 C
2t
Decoder
W1 W1 W1 W2
Predictedinput
C1t C
1t1 C
1t2 C
2t
St St1 St2
Inferredcode
Predictedcode
InputEncoder
f W 1 f W 1 f W 1
W 2
fW 2W 2
-
Yann LeCun
Invariant Features through Temporal Constancy Invariant Features
through Temporal Constancy
C1(where)
C2(what)
-
Yann LeCun
Input
Generating from the NetworkGenerating from the Network
-
Yann LeCun
What is the right criterion to train
hierarchical feature extraction
architectures?
What is the right criterion to train
hierarchical feature extraction
architectures?
-
Yann LeCun
Flattening the Data Manifold?Flattening the Data Manifold?
The manifold of all images of is low-dimensional and highly
curvy
Feature extractors should flatten the manifold
-
Yann LeCun
Flattening the
Data Manifold?
Flattening the
Data Manifold?
-
Yann LeCun
The Ultimate Recognition SystemThe Ultimate Recognition
System
Bottom-up and top-down informationTop-down: complex inference
and disambiguationBottom-up: learns to quickly predict the result
of the top-down inference
Integrated supervised and unsupervised learningCapture the
dependencies between all observed variables
CompositionalityEach stage has latent instantiation
variables
TrainableFeature
Transform
TrainableFeature
Transform
TrainableClassifier
LearnedInternalRepresentation
Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide
9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide
17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide
25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide
33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide
41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide
49Slide 50Slide 51Slide 52Slide 53Slide 54