- 1.5yearsfromnow, 5yearsfromnow,
everyonewilllearneveryonewilllearn theirfeaturestheirfeatures
(youmightaswellstartnow)(youmightaswellstartnow)YannLeCun
YannLeCunCourantInstituteofMathematicalSciences
CourantInstituteofMathematicalSciencesand
andCenterforNeuralScience, CenterforNeuralScience,NewYorkUniversity
NewYorkUniversityYann LeCun
2. IIHave aaTerrible Confession to MakeHave Terrible Confession
to MakeIm interested in vision, but no more in vision than in
audition or inother perceptual modalities.Im interested in
perception (and in control).Id like to find a learning algorithm
and architecture that could work(with minor changes) for many
modalitiesNature seems to have found one.Almost all natural
perceptual signals have a local structure (in spaceand time)
similar to images and videosHeavy correlation between neighboring
variablesLocal patches of variables have structure, and are
representableby feature vectors.I like vision because its
challenging, its useful, its fun, we have datathe image recognition
community is not yet stuck in a deeplocal minimum like the speech
recognition community.Yann LeCun 3. The Unity of The Unity
ofRecognitionRecognition Architectures ArchitecturesYann LeCun 4.
Most Recognition Systems Are Built on the Same Architecture Most
Recognition Systems Are Built on the Same Architecture Filter
NonfeatureNormaClassifier BankLinearity Pooling
lizationFilterNonFilterNon Pool NormPool Norm ClassifierBank Lin
Bank LinFirst stage: dense SIFT, HOG, GIST, sparse coding, RBM,
auto-encoders.....Second stage: K-means, sparse coding,
LCC....Pooling: average, L2, max, max with bias (elastic
templates)..... Convolutional Nets: same architecture, but
everything is trained.Yann LeCun 5. Filter Bank + Non-Linearity +
Pooling + NormalizationFilter Bank + Non-Linearity + Pooling +
Normalization Filter NonSpatial BankLinearityPoolingThis model of a
feature extraction stage is biologically-inspired ...whether you
like it or not (just ask David Lowe) Inspired by [Hubel and Wiesel
1962] The use of this module goes back to Fukushimas Neocognitron
(and even earlier models in the 60s).Yann LeCun 6. How well does
this work?How well does this work? Filter Non
featureFilterNonfeatureClassifier Bank Linearity Pooling
BankLinearity Pooling OrientedWinner HistogramPyramidSVMorKmeans
EdgesTakes (sum) Histogram. AnotherOr All Elasticparts
SimpleSparseCodingSIFT Models,... classifierSome results on C101 (I
know, I know....) SIFT->K-means->Pyramid pooling->SVM
intersection kernel: >65%[Lazebnik et al. CVPR
2006]SIFT->Sparse coding on Blocks->Pyramid pooling->SVM:
>75%[Boureau et al. CVPR 2010] [Yang et al. 2008]SIFT->Local
Sparse coding on Block->Pyramid pooling->SVM: >77%[Boureau
et al. ICCV 2011](Small) supervised ConvNet with sparsity penalty:
>71%[rejected from CVPR,ICCV,etc] REAL TIMEYann LeCun 7.
Convolutional Networks (ConvNets) fits that modelConvolutional
Networks (ConvNets) fits that modelYann LeCun 8. Why do two stages
work better than one stage? Why do two stages work better than one
stage?Filter Non Filter NonPool Norm Pool Norm
ClassifierBankLinBankLinThe second stage extracts mid-level
featuresHaving multiple stages helps the selectivity-invariance
dilemmaYann LeCun 9. Learning Hierarchical RepresentationsLearning
Hierarchical RepresentationsTrainableTrainableTrainable
FeatureFeatureClassifierTransformTransform
LearnedInternalRepresentation I agree with David Lowe: we should
learn the features It worked for speech, handwriting, NLP..... In a
way, the vision community has been running a ridiculously
inefficient evolutionary learning algorithm to learn features:
Mutation: tweak existing features in many different ways Selection:
Publish the best ones at CVPR Reproduction: combine several
features from the last CVPR Iterate. Problem: Moores law works
against youYann LeCun 10. Sometimes, Sometimes,Biology gives
youBiology gives yougood hints good hints example:example: contrast
normalization contrast normalizationYann LeCun 11. Harsh
Non-Linearity + Contrast Normalization + Sparsity Harsh
Non-Linearity + Contrast Normalization + Sparsity
CConvolutions(filterbank) SoftThresholding+Abs
NSubtractiveandDivisiveLocalNormalization
PPoolingdownsamplinglayer:averageormax? Pooling,subsampling
contrastnormalizationsubtractive+divisive ThresholdingConvolutions
Rectification THISISONESTAGEOFTHECONVNETYann LeCun 12. Soft
Thresholding Non-Linearity Soft Thresholding Non-LinearityYann
LeCun 13. Local Contrast NormalizationLocal Contrast
NormalizationPerformed on the state of every layer, includingthe
inputSubtractive Local Contrast Normalization Subtracts from every
value in a feature a Gaussian-weighted average of its neighbors
(high-pass filter)Divisive Local Contrast Normalization Divides
every value in a layer by the standard deviation of its neighbors
over space and over all feature mapsSubtractive + Divisive LCN
performs a kind ofapproximate whitening.Yann LeCun 14. C101
Performance (I know, IIknow)C101 Performance (I know, know) Small
network: 64 features at stage-1, 256 features at stage-2: Tanh
non-linearity, No Rectification, No normalization: 29% Tanh
non-linearity, Rectification, normalization: 65% Shrink
non-linearity, Rectification, norm, sparsity penalty 71%Yann LeCun
15. Results on Caltech101 with sigmoid non-linearity Results on
Caltech101 with sigmoid non-linearity likeHMAXmodelYann LeCun 16.
Feature LearningFeature Learning Works Really Well Works Really
Well on everything but C101 on everything but C101Yann LeCun 17.
C101 is very unfavorable to learning-based systemsC101 is very
unfavorable to learning-based systemsBecause its so small. We are
switching to ImageNetSome results on NORBNonormalization
Randomfilters Unsupfilters Supfilters Unsup+SupfiltersYann LeCun
18. Sparse Auto-EncodersSparse Auto-Encoders Inference by gradient
descent starting from the encoder output ii 2i 2E Y , Z =Y W d Z Z
g e W e ,Y j z j ii Z =argmin z E Y , z ; W i Y Y 2WdZ j . INPUT Y
Z z j FEATURESi ge W e ,Y 2Z ZYann LeCun 19. Using PSD to Train
aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features
Phase 1: train first layer using PSDY i Y2 WdZ j . YZ z j ge W e ,Y
i 2 Z ZFEATURESYann LeCun 20. Using PSD to Train aaHierarchy of
FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train
first layer using PSD Phase 2: use encoder + absolute value as
feature extractor Y z j ge W e ,Y i FEATURESYann LeCun 21. Using
PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of
Features Phase 1: train first layer using PSD Phase 2: use encoder
+ absolute value as feature extractor Phase 3: train the second
layer using PSD Y i Y2 WdZ j . Y z j YZ z j ge W e ,Y i ge W e ,Y i
2 Z ZFEATURESYann LeCun 22. Using PSD to Train aaHierarchy of
FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train
first layer using PSD Phase 2: use encoder + absolute value as
feature extractor Phase 3: train the second layer using PSD Phase
4: use encoder + absolute value as 2 nd feature extractor Y z jz j
ge W e ,Y i ge W e ,Y i FEATURESYann LeCun 23. Using PSD to Train
aaHierarchy of FeaturesUsing PSD to Train Hierarchy of
FeaturesPhase 1: train first layer using PSDPhase 2: use encoder +
absolute value as feature extractorPhase 3: train the second layer
using PSDPhase 4: use encoder + absolute value as 2 nd feature
extractorPhase 5: train a supervised classifier on topPhase 6
(optional): train the entire system with supervised
back-propagation Y z jz j classifier ge W e ,Y i ge W e ,Y i
FEATURESYann LeCun 24. Learned Features on natural patches: V1-like
receptive fieldsLearned Features on natural patches: V1-like
receptive fieldsYann LeCun 25. Using PSD Features for Object
RecognitionUsing PSD Features for Object Recognition64 filters on
9x9 patches trained with PSD with Linear-Sigmoid-Diagonal
EncoderYann LeCun 26. ConvolutionalSparseCoding
ConvolutionalSparseCoding[Kavukcuogluetal.NIPS2010]:convolutionalPSD[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwithautomaticadjustmentofcodedimension.Yann
LeCun 27. Convolutional Training Convolutional TrainingProblem:
With patch-level training, the learning algorithm must reconstruct
the entire patch with a single feature vector But when the filters
are used convolutionally, neighboring feature vectors will be
highly redundant Patchleveltrainingproduces
lotsoffiltersthatareshifted versionsofeachother.Yann LeCun 28.
Convolutional Sparse Coding Convolutional Sparse CodingReplace the
dot products with dictionary element by convolutions. Input Y is a
full image Each code component Zk is a feature map (an image) Each
dictionary element is a convolution kernelRegular sparse
codingConvolutional S.C. Y= .* ZkkWk deconvolutional networks
[Zeiler, Taylor, Fergus CVPR 2010]Yann LeCun 29. Convolutional PSD:
Encoder with aasoft sh() Function Convolutional PSD: Encoder with
soft sh() FunctionConvolutional Formulation Extend sparse coding
from PATCH to IMAGE PATCH based learning CONVOLUTIONAL learningYann
LeCun 30. Cifar-10 Dataset Cifar-10 DatasetDataset of tiny images
Images are 32x32 color images 10 object categories with 50000
training and 10000 testingExample ImagesYann LeCun 31. Comparative
Results on Cifar-10 DatasetComparative Results on Cifar-10 Dataset*
Krizhevsky. Learning multiple layers of features from tiny images.
Masters thesis, Dept of CS U of Toronto **Ranzato and Hinton.
Modeling pixel means and covariances using a factorized third order
boltzmann machine. CVPR 2010Yann LeCun 32. Road Sign Recognition
Competition Road Sign Recognition Competition GTSRB Road Sign
Recognition Competition (phase 1)32x32 imagesThe 13 of the top 14
entries are ConvNets, 6 from NYU, 7 from IDSIANo 6 is humans!Yann
LeCun 33. Pedestrian Detection (INRIA Dataset) Pedestrian Detection
(INRIA Dataset)[Sermanetetal.,RejectedfromICCV2011]]Yann LeCun 34.
Pedestrian Detection: Examples Pedestrian Detection: ExamplesYann
LeCun[Kavukcuogluetal.NIPS2010] 35. LearningLearning
InvariantFeaturesInvariantFeaturesYann LeCun 36. Why just pool over
space? Why not over orientation?Why just pool over space? Why not
over orientation? Using an idea from Hyvarinen: topographic square
pooling (subspace ICA)1. Apply filters on a patch (with suitable
non-linearity)2. Arrange filter outputs on a 2D plane3. square
filter outputs4. minimize sqrt of sum of blocks of squared filter
outputsYann LeCun 37. Why just pool over space? Why not over
orientation?Why just pool over space? Why not over orientation?The
filters arrangethemselves spontaneously sothat similar filters
enter thesame pool.The pooling units can be seenas complex
cellsThey are invariant to localtransformations of the input For
some its translations, for others rotations, or other
transformations.Yann LeCun 38. Pinwheels? Pinwheels?Does that
lookpinwheely toyou?Yann LeCun 39. Sparsity throughSparsity through
Lateral Inhibition Lateral InhibitionYann LeCun 40. Invariant
Features Lateral Inhibition Invariant Features Lateral
InhibitionReplace the L1 sparsity term by a lateral inhibition
matrixYann LeCun 41. Invariant Features Lateral Inhibition
Invariant Features Lateral InhibitionZeros I S matrix have tree
structureYann LeCun 42. Invariant Features Lateral Inhibition
Invariant Features Lateral InhibitionNon-zero values in S form a
ring in a 2D topology Input patches are high-pass filteredYann
LeCun 43. Invariant Features Lateral Inhibition Invariant Features
Lateral InhibitionNon-zero values in S form a ring in a 2D topology
Left: non high-pass filtering of input Right: patch-level mean
removalYann LeCun 44. Invariant Features Short-Range Lateral
Excitation + L1 Invariant Features Short-Range Lateral Excitation +
L1lYann LeCun 45. Disentangling theDisentangling the Explanatory
Factors Explanatory Factors of Imagesof ImagesYann LeCun 46.
Separating SeparatingI used to think that recognition was all about
eliminating irrelevantinformation while keeping the useful
oneBuilding invariant representationsEliminating irrelevant
variabilitiesI now think that recognition is all about
disentangling independent factorsof variations:Separating what and
whereSeparating content from instantiation parametersHintons
capsules; Karol Gregors what-where auto-encodersYann LeCun 47.
Invariant Features through Temporal Constancy Invariant Features
through Temporal ConstancyObject is cross-product of object type
and instantiation parameters [Hinton 1981]small mediumlarge
ObjecttypeObjectsize[KarolGregoretal.]Yann LeCun 48. Invariant
Features through Temporal Constancy Invariant Features through
Temporal Constancy Decoder PredictedSt St1 St2 input W1W1 W1 W2 t
t1t2t InferredC 1C 1 C 1C 2 codeCtCt1 Ct2Ct Predicted1 112 code f11
f W12f Wf W W2 W W2tt1t2Yann LeCun EncoderSSS Input 49. Invariant
Features through Temporal Constancy Invariant Features through
Temporal Constancy C1 (where)C2(what)Yann LeCun 50. Generating from
the Network Generating from the Network InputYann LeCun 51. What is
the rightWhat is the rightcriterion to train criterion to train
hierarchical feature hierarchical featureextraction extraction
architectures?architectures?Yann LeCun 52. Flattening the Data
Manifold? Flattening the Data Manifold? The manifold of all images
of is low-dimensional and highly curvy Feature extractors should
flatten the manifoldYann LeCun 53. Flattening the Flattening the
Data Manifold? Data Manifold?Yann LeCun 54. The Ultimate
Recognition SystemThe Ultimate Recognition System Trainable
TrainableTrainableFeature FeatureClassifier Transform
TransformLearnedInternalRepresentation Bottom-up and top-down
informationTop-down: complex inference and disambiguationBottom-up:
learns to quickly predict the result of the top-downinference
Integrated supervised and unsupervised learning Capture the
dependencies between all observed variables CompositionalityEach
stage has latent instantiation variablesYann LeCun