-
How a Part of the Brain Might or Might Not Work: A New
Hierarchical Model of Object Recognition
by
Maximilian Riesenhuber
Diplom-PhysikerUniversität Frankfurt, 1995
Submitted to the Department of Brain and Cognitive Sciencesin
partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computational Neuroscience
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2000
c�Massachusetts Institute of Technology 2000. All rights
reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .Department of Brain and Cognitive Sciences
May 2, 2000
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .Tomaso Poggio
Uncas and Helen Whitaker ProfessorThesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .Earl K. Miller
Co-Chair, Department Graduate Committee
-
How a Part of the Brain Might or Might Not Work: A New
Hierarchical Model
of Object Recognition
by
Maximilian Riesenhuber
Submitted to the Department of Brain and Cognitive Scienceson
May 2, 2000, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy in
Computational Neuroscience
Abstract
The classical model of visual processing in cortex is a
hierarchy of increasingly sophisticated rep-resentations, extending
in a natural way the model of simple to complex cells of Hubel and
Wiesel.Somewhat surprisingly, little quantitative modeling has been
done in the last 15 years to explorethe biological feasibility of
this class of models to explain higher level visual processing,
such asobject recognition in cluttered scenes. We describe a new
hierarchical model, HMAX, that accountswell for this complex visual
task, is consistent with several recent physiological experiments
in in-ferotemporal cortex and makes testable predictions. Key to
achieve invariance and robustness toclutter is a MAX-like response
function of some model neurons which selects (an approximation
to)the maximum activity over all the afferents, with interesting
connections to “scanning” operationsused in recent computer vision
algorithms.
We then turn to the question of object recognition in natural
(“continuous”) object classes, suchas faces, which recent
physiological experiments have suggested are represented by a
sparse dis-tributed population code. We performed two
psychophysical experiments in which subjects weretrained to perform
subordinate level discrimination in a continuous object class —
images of com-puter-rendered cars — created using a 3D morphing
system. By comparing the recognition perfor-mance of trained and
untrained subjects we could estimate the effects of
viewpoint-specific train-ing and infer properties of the object
class-specific representation learned as a result of training.We
then compared the experimental findings to simulations in HMAX, to
investigate the computa-tional properties of a population-based
object class representation. We find experimental
evidence,supported by modeling results, that training builds a
viewpoint- and class-specific representationthat supplements a
pre-existing representation with lower shape discriminability but
greater view-point invariance.
Finally, we show how HMAX can be extended in a straightforward
fashion to perform object cate-gorization and to support arbitrary
class hierarchies. We demonstrate the capability of our
scheme,called “Categorical Basis Functions” (CBF), with the example
domain of cat/dog categorization,and apply it to study some recent
findings in categorical perception.
Thesis Supervisor: Tomaso PoggioTitle: Uncas and Helen Whitaker
Professor
2
-
Acknowledgments
Thanks are due to quite a few people who have contributed in
different ways to the gestation of
this thesis. First, of course, are my parents. I am extremely
grateful for their untiring support and
encouragement over the years — especially to my father for
convincing me to major in physics.
Second is Hans-Ulrich Bauer, who advised my Diplom thesis at the
Institute for Theoretical
Physics of the University of Frankfurt. Without Hans-Ulrich and
his urging to get my PhD in
computational neuroscience in the US I would never have applied
to MIT and would now probably
be a bitter physics PhD working for McKinsey. To him this thesis
is dedicated.
At MIT, I first want to thank my advisor, Tommy Poggio, for
warning me about the smog at
Caltech, and for being the best advisor I could imagine:
Providing a lot of independence while
being very encouraging and supportive. Being exposed to his way
of doing science I consider one
of the biggest assets of my PhD training.
Then there are the people in my thesis committee, all of which
have provided valuable input
and guidance along the way. Special thanks are due to Peter
Dayan for a fine collaboration during
my first year that introduced me to quite a few new ideas. To
Earl Miller for being so open to collab-
orations with wacky theorists. To Mike Tarr for a nice News
& Views commentary [96] on our paper
[82], and a very stimulating visit which might lead to an even
more stimulating collaboration. . .
To Pawan Sinha and Gadi Geiger for introducing me to the
wonderful world of psychophysics
and providing invaluable advice when I ran my first experiment.
To David Freedman and An-
dreas Tolias for introducing me to the wonderful world of monkey
physiology and being great
collaborators. To Christian Shelton for the “mother of all
correspondence algorithms” that was so
instrumental in many parts of this thesis and beyond. To Valerie
Pires, without whose help the
psychophysics in chapter 4 would have been nothing more than
“suggestions for future work”,
and to Mary Pat Fitzgerald, “the mother of CBCL”, whose help in
dealing with the subtleties of the
MIT administration was (and continues to be) greatly
appreciated.
Then there is our fine department that really has it all “under
one roof”. I could not have
imagined a better place to get my PhD.
Last but not least I want to gratefully acknowledge the generous
support provided by a Gerald
J. and Marjorie J. Burnett Fellowship (1996–1998) and a
Merck/MIT Fellowship in Bioinformatics
(1998–2000) that enabled me to pursue the studies described in
this thesis.
3
-
Contents
1 Introduction 8
2 Hierarchical Models of Object Recognition in Cortex 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 11
2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 15
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 21
2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 23
2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 25
3 Are Cortical Models Really Bound by the “Binding Problem”?
26
3.1 Introduction: Visual Object Recognition . . . . . . . . . .
. . . . . . . . . . . . . . . . 27
3.2 Models of Visual Object Recognition and the Binding Problem
. . . . . . . . . . . . . 27
3.3 A Hierarchical Model of Object Recognition in Cortex . . . .
. . . . . . . . . . . . . . 29
3.4 Binding without a problem . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 31
3.4.1 Recognition of multiple objects . . . . . . . . . . . . .
. . . . . . . . . . . . . . 33
3.4.2 Recognition in clutter . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 34
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 36
3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 39
4 The Individual is Nothing, the Class Everything:
Psychophysics and Modeling of Recognition in Object Classes
40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 41
4.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 43
4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 44
4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 48
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 49
4.3 Modeling: Representing Continuous Object Classes in HMAX . .
. . . . . . . . . . . 50
4.3.1 The HMAX Model of Object Recognition in Cortex . . . . . .
. . . . . . . . . 51
4
-
4.3.2 View-Dependent Object Recognition in Continuous Object
Classes . . . . . . 51
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 59
4.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 60
4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 60
4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 61
4.4.3 The Model Revisited . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 63
4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 63
4.5 General Discussion . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . 64
5 A note on object class representation and categorical
perception 66
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 66
5.2 Chorus of Prototypes (COP) . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 67
5.3 A Novel Scheme: Categorical Basis Functions (CBF) . . . . .
. . . . . . . . . . . . . . 68
5.3.1 An Example: Cat/Dog Classification . . . . . . . . . . . .
. . . . . . . . . . . 69
5.3.2 Introduction of parallel categorization schemes . . . . .
. . . . . . . . . . . . 72
5.4 Interactions between categorization and discrimination:
Categorical Perception . . . 72
5.4.1 Categorical Perception in CBF . . . . . . . . . . . . . .
. . . . . . . . . . . . . 73
5.4.2 Categorization with and without Categorical Perception . .
. . . . . . . . . . 75
5.5 COP or CBF? — Suggestion for Experimental Tests . . . . . .
. . . . . . . . . . . . . . 78
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 78
6 General Conclusions and Future Work 81
5
-
List of Figures
2-1 Invariance properties of one IT neuron . . . . . . . . . . .
. . . . . . . . . . . . . . . . 14
2-2 Sketch of the model . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 15
2-3 Illustration of the highly nonlinear shape tuning properties
of the MAX mechanism 17
2-4 Response of a sample model neuron . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 19
2-5 Average neuronal responses to scrambled stimuli . . . . . .
. . . . . . . . . . . . . . 21
3-1 Cartoon of the Poggio and Edelman model of view-based object
recognition . . . . . 30
3-2 Sketch of our hierarchical model of object recognition in
cortex . . . . . . . . . . . . . 32
3-3 Recognition of two objects simultaneously . . . . . . . . .
. . . . . . . . . . . . . . . 34
3-4 Stimulus/background example . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 35
3-5 Model performance: Recognition in clutter . . . . . . . . .
. . . . . . . . . . . . . . . 36
4-1 Natural objects, and artificial objects used in previous
object recognition studies . . . 42
4-2 The eight prototype cars used in the 8 car system . . . . .
. . . . . . . . . . . . . . . . 45
4-3 Training task for Experiments 1 and 2 . . . . . . . . . . .
. . . . . . . . . . . . . . . . 46
4-4 Illustration of match/nonmatch pairs for Experiment 1 . . .
. . . . . . . . . . . . . . 47
4-5 Testing task for Experiments 1 and 2 . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 47
4-6 Average performance of the trained subjects on the test task
of Experiment 1 . . . . . 49
4-7 Average performance of untrained subjects on the test task
of Experiment 1 . . . . . 50
4-8 Our model of object recognition in cortex . . . . . . . . .
. . . . . . . . . . . . . . . . 52
4-9 Recognition performance of the model on the eight car morph
space . . . . . . . . . 53
4-10 Dependence of average (one-sided) rotation invariance, �r,
on SSCU tuning width, � 55
4-11 Dependence of invariance range on the number of afferents
to each SSCU . . . . . . 55
4-12 Dependence of invariance range on the number of SSCUs . . .
. . . . . . . . . . . . . 56
4-13 Effect of addition of noise to the SSCU representation . .
. . . . . . . . . . . . . . . . 57
4-14 Cat/dog prototypes . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 57
4-15 The “Other Class” effect . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . 57
4-16 Car object class-specific features . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . 58
6
-
4-17 Performance of the two layer-model . . . . . . . . . . . .
. . . . . . . . . . . . . . . . 58
4-18 The 15 prototypes used in the 15 car system . . . . . . . .
. . . . . . . . . . . . . . . . 61
4-19 Average performance of trained subjects in Experiment 2 . .
. . . . . . . . . . . . . . 62
4-20 Average performance of untrained subjects in Experiment 2 .
. . . . . . . . . . . . . 63
5-1 Cartoon of the CBF categorization scheme . . . . . . . . . .
. . . . . . . . . . . . . . . 70
5-2 Illustration of the cat/dog stimulus space . . . . . . . . .
. . . . . . . . . . . . . . . . 71
5-3 Response of the cat/dog categorization unit . . . . . . . .
. . . . . . . . . . . . . . . . 71
5-4 Sketch of the model to explain the influence of experience
with categorization tasks
on object discrimination . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . 74
5-5 Average responses over all morph lines for the two networks
. . . . . . . . . . . . . . 76
5-6 Comparison of Euclidean distances of activation patterns . .
. . . . . . . . . . . . . . 77
5-7 Output of the categorization unit trained on the cat/dog
categorization task . . . . . 80
5-8 Same as Fig. 5-7, but for a representation based on 30 units
chosen by k-means . . . 80
7
-
Chapter 1
Introduction
Tell your hairdresser that you are working on vision and he will
likely say “Vision? But that’s
easy!” Indeed, the apparent ease with which we perform object
recognition even in cluttered scenes
and under difficult viewing conditions belies the amazing
complexity of the visual system. This
became apparent with the groundbreaking studies of Hubel and
Wiesel [36, 37] in the primary
visual cortices of cats and monkeys in the late 50s and 60s.
Subsequently, many other visual areas
were discovered, with recent surveys listing over 30 visual
areas linked in an intricate and still
unresolved pattern [22, 35]. This complex connection scheme can
be coarsely divided up into two
pathways, the “What” pathway (the ventral stream) running from
primary visual cortex, V1, over
V2 and V4 to inferotemporal cortex (IT), and the “Where” pathway
(dorsal stream) from V1 to V2,
V3, MT, MST and other parietal areas [103]. In this framework,
the “What” pathway is specialized
for object recognition whereas the “Where” pathway is concerned
with spatial vision. Looking at
the ventral stream in more detail, Kobatake et al. [41] have
reported a progression of the complexity
of cells’ preferred features and receptive field sizes as one
progresses along the stream. While
neurons in V1 are tuned to oriented bars and have small
receptive fields, cells in IT appear to prefer
more complex visual patterns such as faces (for a review on face
cells, see [15]), and respond over
a wide range of positions and scales, pointing to a crucial role
of IT cortex in object recognition (as
confirmed by a great number of physiological, lesion and
neuropsychological studies [50, 94]).
These findings naturally prompted the question of how cells
tuned to views of complex objects
showing invariance to size and position changes in IT could
arise from small bar-tuned receptive
fields in V1, and how ultimately this neural substrate could be
used to perform object recognition.
In humans, psychophysical experiments had given rise to two main
competing theories of ob-
ject recognition: the structural description and the view-based
theory (see [98] for a review of the
two theories and their experimental evidence). The former
approach, its main protagonist being
the recognition-by-components theory of Biederman [5], holds
that object recognition proceeds by
8
-
decomposing an object into a view-independent part-based
description while in the view-based
theory object recognition is based on the viewpoints objects had
actually appeared in. Experimen-
tal evidence ([8, 48, 49, 97, 100], see also chapter 2) and
computational considerations [19] appear
to favor the view-based theory. However, several challenges for
view-based models remained; Tarr
and Bülthoff [98] very recently listed the following
problems:�
1. tolerance to changes in viewing condition — while the system
should allow fine shape dis-
criminations it should not require a new representation for
every change in viewing condi-
tion;
2. class generalization and representation of categories — the
system should generalize from
familiar exemplars of a class to unfamiliar ones, and also be
able to support categorization
schemes.
This thesis presents a simple hierarchical model of object
recognition in cortex (chapter 5 shows
how the model can be extended to object categorization in a
straightforward way), HMAX [96],
that addresses these challenges in a biologically plausible
system. In particular, chapter 2 (a reprint
of [82], copyright Nature America Inc., reprinted by permission)
introduces HMAX and compares
the visual properties of view-tuned units in the model,
especially with respect to translation, scal-
ing and rotation in depth of the visual scene, to those of
neurons in inferotemporal cortex recorded
from in various experiments. Chapter 3 (a reprint of [81],
copyright Cell Press, reprinted by per-
mission) shows that HMAX can even perform recognition in
cluttered scenes, without having to
resort to special segmentation processes, which is of special
interest in connection with the so-called
“Binding Problem”.
While the first two papers focus on “paperclip” objects that
have been used extensively in psy-
chophysical [8, 48, 91], physiological [49] and computational
studies [68], this object class has sev-
eral disadvantages (such as not being “nice” [107]) that make it
unsuitable as a basis to investigate
recognition in natural object classes — the topic of chapter 4
(a reprint of [84]) — where objects
have similar 3D shape, such as faces. Instead, in chapters 4 and
5, stimuli for model and experi-
ment were generated using a novel 3D morphing system developed
by Christian Shelton [90] that
allows us to generate morphed objects drawn from a space spanned
by a set of prototype objects,
for instances cars (chapter 4), or cats and dogs (chapters 4 and
5). We show that the recognition
results obtainable in natural object classes represented using a
population code where the activity
over a group of units codes for the identity of an object (as
suggested by recent physiology studies
[112, 115]) are quite comparable to those for individual objects
represented by “grandmother” units
(in chapter 2), that is, the performance of HMAX does not appear
to be special to a certain object
�They also listed the need for a simple mechanism to measure
perceptual similarity, in order “to generalize betweenexamplars or
between views” ([98], p. 9), which thus appears to be a corollary
of solving the two problems listed above.
9
-
class. In addition, simulations show that a population-based
object representation provides several
computational advantages over a “grandmother” representation. We
further present experimental
results from a psychophysical study in which we trained subjects
using a discrimination paradigm
to build a representation of a novel object class. This
representation was then probed by exam-
ining how discrimination performance was affected by viewpoint
changes. We find experimental
evidence, supported by the modeling results, that training
builds a viewpoint- and class-specific
representation that supplements a pre-existing representation
with lower shape discriminability
but greater viewpoint invariance. Chapter 5 (a reprint of [83])
finally shows how HMAX can be
extended in a straightforward way to perform object
categorization, and to support arbitrary object
categorization schemes, with interesting opportunities for
interactions between discrimination and
categorization as observed in categorical perception.
10
-
Chapter 2
Hierarchical Models of Object
Recognition in Cortex
Abstract
The classical model of visual processing in cortex is a
hierarchy of increasingly sophisticated
representations, extending in a natural way the model of simple
to complex cells of Hubel and
Wiesel. Somewhat surprisingly, little quantitative modeling has
been done in the last 15 years to
explore the biological feasibility of this class of models to
explain higher level visual processing,
such as object recognition. We describe a new hierarchical model
that accounts well for this
complex visual task, is consistent with several recent
physiological experiments in inferotemporal
cortex and makes testable predictions. The model is based on a
novel MAX-like operation on the
inputs to certain cortical neurons which may have a general role
in cortical function.
2.1 Introduction
The recognition of visual objects is a fundamental cognitive
task performed effortlessly by the brain
countless times every day while satisfying two essential
requirements: invariance and specificity.
In face recognition, for example, we can recognize a specific
face among many, while being rather
tolerant to changes in viewpoint, scale, illumination, and
expression. The brain performs this and
similar object recognition and detection tasks fast [101] and
well. But how?
Early studies [7] of macaque inferotemporal cortex (IT), the
highest purely visual area in the
ventral visual stream thought to have a key role in object
recognition [103] reported cells tuned to
views of complex objects such as a face, i.e., the cells
discharged strongly to the view of a face but
very little or not at all to other objects. A hallmark of these
cells was the robustness of their firing
11
-
to stimulus transformations such as scale and position
changes.
This finding presented an interesting question: How could these
cells show strongly differing
responses to similar stimuli (as, e.g., two different faces),
that activate the retinal photoreceptors in
similar ways, while showing response constancy to scaled and
translated versions of the preferred
stimulus that cause very different activation patterns on the
retina?
This puzzle was similar to one faced by Hubel and Wiesel on a
much smaller scale two decades
earlier when they recorded from simple and complex cells in cat
striate cortex [36]: both cell types
responded strongly to oriented bars, but whereas simple cells
exhibited small receptive fields with
a strong phase dependence, that is, with distinct excitatory and
inhibitory subfields, complex cells
had larger receptive fields and no phase dependence. This led
Hubel and Wiesel to propose a
model in which simple cells with their receptive fields in
neighboring parts of space feed into
the same complex cell, thereby endowing that complex cell with a
phase-invariant response. A
straightforward (but highly idealized) extension of this scheme
would lead all the way from simple
cells to “higher order hypercomplex cells” [37].
Starting with the Neocognitron [25] for translation invariant
object recognition, several hierar-
chical models of shape processing in the visual system have
subsequently been proposed to ex-
plain how transformation-invariant cells tuned to complex
objects can arise from simple cell inputs
[64, 111]. Those models, however, were not quantitatively
specified or were not compared with
specific experimental data. Alternative models for translation-
and scale-invariant object recogni-
tion have been proposed, based on a controlling signal that
either appropriately reroutes incoming
signals, as in the “shifter” circuit [2] and its extension [62],
or modulates neuronal responses, as
in the “gain-field” models for invariant recognition [78, 88].
While recent experimental studies
[14, 56] have indicated that in macaque area V4 cells can show
an attention-controlled shift or mod-
ulation of their receptive field in space, there is still little
evidence that this mechanism is used to
perform translation-invariant object recognition and whether a
similar mechanism applies to other
transformations (such as scaling) as well.
The basic idea of the hierarchical model sketched by Perrett and
Oram [64] was that invariance
to any transformation (not just image-plane transformations as
in the case of the Neocognitron
[25]) could be built up by pooling over afferents tuned to
various transformed versions of the
same stimulus. Indeed it was shown earlier [68] that
viewpoint-invariant object recognition was
possible using such a pooling mechanism. A (Gaussian RBF)
learning network was trained with
individual views (rotated around one axis in 3D space) of
complex, paperclip-like objects to achieve
3D rotation-invariant recognition of this object. In the network
the resulting view-tuned units fed
into a a view-invariant unit; they effectively represented
prototypes between which the learning
network interpolated to achieve viewpoint-invariance.
There is now quantitative psychophysical [8, 48, 95] and
physiological evidence [6, 42, 49] for
12
-
the hypothesis that units tuned to full or partial views are
probably created by a learning process
and also some hints that the view-invariant output is in some
cases explicitly represented by (a
small number of) individual neurons [6, 49, 66].
A recent experiment [48, 49] required monkeys to perform an
object recognition task using
novel “paperclip” stimuli the monkeys had never seen before.
Here, the monkeys were required
to recognize views of “target” paperclips rotated in depth among
views of a large number of “dis-
tractor” paperclips of very similar structure, after being
trained on a restricted set of views of each
target object. Following very extensive training on a set of
paperclip objects, neurons were found
in anterior IT that selectively responded to the object views
seen during training.
This design avoided two problems associated with previous
physiological studies investigat-
ing the mechanisms underlying view-invariant object recognition:
First, by training the monkey
to recognize novel stimuli with which the monkey had not had any
visual experience instead of
objects (e.g., faces) with which the monkey was quite familiar,
it was possible to estimate the de-
gree of view-invariance derived from just one object view.
Moreover, the use of a large number of
distractor objects allowed to define view-invariance with
respect to the distractor objects. This is a
key point, since only by being able to compare the response of a
neuron to transformed versions of
its preferred stimulus with the neuron’s response to a range of
(similar) distractor objects can the
VTU’s (view-tuned unit’s) invariance range be determined — just
measuring the tuning curve is
not sufficient.
The study [49] established (Fig. 2-1) that after training with
just one object view there are cells
showing some degree of limited invariance to 3D rotation around
the training view, consistent
with the view-interpolation model [68]. Moreover, the cells also
exhibit significant invariance to
translation and scale changes, even though the object was only
previously presented at one scale
and position.
These data put in sharp focus and in quantitative terms the
question of the circuitry under-
lying the properties of the view-tuned cells. While the original
model [68] described how VTUs
could be used to build view-invariant units, they did not
specify how the view-tuned units could
come about. The key problem is thus to explain in terms of
biologically plausible mechanisms the
VTUs’ invariance to translation and scaling obtained from just
one object view, which arises from
a trade-off between selectivity to a specific object and
relative tolerance (i.e., robustness of firing) to
position and scale changes. Here, we describe a model that
conforms to the main anatomical and
physiological constraints, reproduces the invariance data
described above and makes predictions
for experiments on the view-tuned subpopulation of IT cells.
Interestingly, the model is also con-
sistent with recent data from several other experiments
regarding recognition in context [54], or the
presence of multiple objects in a cell’s receptive field
[89].
13
-
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*
Spi
ke R
ate
Distractor ID
10 Best Distractors
37 9 20 5 24 3 2 1 0 60
10
20
30
40
60 108 132 156 18084
0
10
20
30
40
Rotation Around Y Axis
(a) (b)
Azimuth and Elevation(x = 2.25 degrees)
1.90 2.80 3.70 4.70 5.60
0
1
2
3
4
5
6
7
(0,0) (x,x) (x,-x) (-x,x) (-x,-x)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Degrees of Visual Angle
(Tar
get R
espo
nse)
/(M
ean
of B
est D
istr
acto
rs)
(c) (d)
0
1
2
3
4
5
6
7
Figure 2-1: Invariance properties of one neuron (modified from
Logothetis et al. [49]). The figure shows theresponse of a single
cell found in anterior IT after training the monkey to recognize
paperclip-like objects. Thecell responded selectively to one view
of a paperclip and showed limited invariance around the training
viewto rotation in depth, along with significant invariance to
translation and size changes, even though the mon-key had only seen
the stimulus at one position and scale during training. (a) shows
the response of the cell torotation in depth around the preferred
view. (b) shows the cell’s response to the 10 distractor objects
(other pa-perclips) that evoked the strongest responses. The lower
plots show the cell’s response to changes in stimulussize, (c)
(asterisk shows the size of the training view), and position, (d)
(using the 1.9� size), resp., relative tothe mean of the 10 best
distractors. Defining “invariance” as yielding a higher response to
transformed viewsof the preferred stimulus than to distractor
objects, neurons exhibit an average rotation invariance of 42�
(dur-ing training, stimuli were actually rotated by ���� in depth
to provide full 3D information to the monkey;therefore, the
invariance obtained from a single view is likely to be smaller),
translation and scale invarianceon the order of ��� and �� octave
around the training view, resp. (J. Pauls, personal
communication).
14
-
����������
��������
��������
������������
view-tuned cells
MAX
weighted sum
simple cells (S1)
complex cells (C1)
"complex composite" cells (C2)
"composite feature" cells (S2)
Figure 2-2: Sketch of the model. The model is an hierarchical
extension of the classical paradigm [36] ofbuilding complex cells
from simple cells. It consists of a hierarchy of layers with linear
(“S” units in the no-tation of Fukushima [25], performing template
matching, solid lines) and non-linear operations (“C” poolingunits
[25], performing a “MAX” operation, dashed lines). The non-linear
MAX operation — which selectsthe maximum of the cell’s inputs and
uses it to drive the cell — is key to the model’s properties and is
quitedifferent from the basically linear summation of inputs
usually assumed for complex cells. These two typesof operations
respectively provide pattern specificity and invariance (to
translation, by pooling over afferentstuned to different positions,
and scale (not shown), by pooling over afferents tuned to different
scales).
2.2 Results
The model is based on a simple hierarchical feedforward
architecture (Fig. 2-2). Its structure reflects
the assumption that invariance to position and scale on the one
hand and feature specificity on
the other hand must be built up through separate mechanisms: to
increase feature complexity, a
suitable neuronal transfer function is a weighted sum over
afferents coding for simpler features,
i.e., a template match. But is summing over differently weighted
afferents also the right way to
increase invariance?
From the computational point of view, the pooling mechanism
should produce robust feature
detectors, i.e., measure the presence of specific features
without being confused by clutter and con-
text in the receptive field. Consider a complex cell, as found
in primary visual cortex, whose pre-
ferred stimulus is a bar of a certain orientation to which the
cell responds in a phase-invariant way
[36]. Along the lines of the original complex cell model [36],
one could think of the complex cells
as receiving input from an array of simple cells at different
locations, pooling over which results in
15
-
the position-invariant response of the complex cell.
Two alternative idealized pooling mechanisms are: linear
summation (“SUM”) with equal weights
(to achieve an isotropic response) and a nonlinear maximum
operation (“MAX”), where the strongest
afferent determines the response of the postsynaptic unit. In
both cases, if only one bar is present
in the receptive field, the response of a model complex cell is
position invariant. The response
level would signal how similar the stimulus is to the afferents’
preferred feature. Consider now
the case of a complex stimulus, like e.g., a paperclip, in the
visual field. In the linear summation
case, complex cell response would still be invariant (as long as
the stimulus stays in the cell’s re-
ceptive field), but the response level now would not allow to
infer whether there actually was a
bar of the preferred orientation somewhere in the complex cell’s
receptive field, as the output sig-
nal is a sum over all the afferents. That is, feature
specificity is lost. In the MAX case, however,
the response would be determined by the most strongly activated
afferent and hence would signal
the best match of any part of the stimulus to the afferents’
preferred feature. This ideal example
suggests that the MAX mechanism is capable of providing a more
robust response in the case of
recognition in clutter or with multiple stimuli in the receptive
field (cf. below). Note that a SUM re-
sponse with saturating nonlinearities on the inputs seems too
brittle since it requires a case-by-case
adjustment of the parameters, depending on the activity level of
the afferents.
Equally critical is the inability of the SUM mechanism to
achieve size invariance: Suppose that
the afferents to a “complex” cell (which now could be a cell in
V4 or IT, for instance) show some
degree of size and position invariance. If the “complex” cell
were now stimulated with the same
object but at subsequently increasing sizes, an increasing
number of afferents would become ex-
cited by the stimulus (unless the afferents showed no overlap in
space or scale) and consequently
the excitation of the “complex” cell would increase along with
the stimulus size, even though the
afferents show size invariance (this is borne out in simulations
using a simplified two-layer model
[79])! For the MAX mechanism, however, cell response would show
little variation even as stimulus
size increased since the cell’s response would be determined
just by the best-matching afferent.
These considerations (supported by quantitative simulations of
the model, described below)
suggest that a sensible way of pooling responses to achieve
invariance is via a nonlinear MAX
function, that is, by implicitely scanning (see discussion) over
afferents of the same type that differ
in the parameter of the transformation to which the response
should be invariant (e.g., feature
size for scale invariance), and then selecting the best-matching
of those afferents. Note that these
considerations apply to the case where different afferents to a
pooling cell, e.g., those looking at
different parts of space, are likely to be responding to
different objects (or different parts of the
same object) in the visual field (as is the case with cells in
lower visual areas with their broad shape
tuning). Here, pooling by combining afferents would mix up
signals caused by different stimuli.
However, if the afferents are specific enough to only respond to
one pattern, as one expects in the
16
-
0
0.2
0.4
0.6
0.8
1
resp
on
se
MAX expt. SUM(a) (b)
Figure 2-3: Illustration of the highly nonlinear shape tuning
properties of the MAX mechanism. (a) Experi-mentally observed
responses of IT cells obtained using a “simplification procedure”
[113] designed to deter-mine “optimal” features (responses
normalized so that the response to the preferred stimulus is equal
to 1). Inthat experiment, the cell originally responds quite
strongly to the image of a “water bottle” (leftmost object).The
stimulus is then “simplified” to its monochromatic outline which
increases the cell’s firing, and furtherto a paddle-like object,
consisting of a bar supporting an ellipse. While this object evokes
a strong response,the bar or the ellipse alone produce almost no
response at all (figure used by permission). (b) Comparison
ofexperiment and model. Green bars show the responses of the
experimental neuron from (a). Blue and red barsshow the response of
a model neuron tuned to the stem-ellipsoidal base transition of the
preferred stimulus.The model neuron is at the top of a simplified
version of the model shown in Fig. 2-2, where there are onlytwo
types of S1 features at each position in the receptive field, tuned
to the left and right side of the transitionregion, resp., which
feed into C1 units that pool using a MAX function (blue bars) or a
SUM function (redbars). The model neuron is connected to these C1
units so that its response is maximal when the experimentalneuron’s
preferred stimulus is in its receptive field.
final stages of the model, then pooling by using a weighted sum,
as in the RBF network [68], where
VTUs tuned to different viewpoints were combined to interpolate
between the stored views, is
advantageous.
MAX-like mechanisms at some stages of the circuitry appear to be
compatible with recent neu-
rophysiological data. For instance, it has been reported [89]
that when two stimuli are brought into
the receptive field of an IT neuron, that neuron’s response
appears to be dominated by the stimulus
that produces a higher firing rate when presented in isolation
to the cell — just as expected if a
MAX-like operation is performed at the level of this neuron or
its afferents. Theoretical investiga-
tions into possible pooling mechanisms for V1 complex cells also
support a maximum-like pooling
mechanism (K. Sakai & S. Tanaka, Soc. Neurosci. Abs., 23,
453, 1997). Additional indirect support
for a MAX mechanism comes from studies using a “simplification
procedure” [113] or “complexity
reduction” [47] to determine the preferred features of IT cells,
i.e., the stimulus components that are
responsible for driving the cell. These studies commonly find a
highly nonlinear tuning of IT cells
(Fig. 2-3 (a)). Such tuning is compatible with the MAX response
function (Fig. 2-3 (b), blue bars).
Note that a linear model (Fig. 2-3 (b), red bars) cannot
reproduce this strong response change for
small changes in the input image.
In our model of view-tuned units (Fig. 2-2), the two types of
operations, scanning and template
17
-
matching, are combined in a hierarchical fashion to build up
complex, invariant feature detectors
from small, localized, simple cell-like receptive fields in the
bottom layer which receive input from
the model “retina.” There need not be a strict alternation of
these two operations: connections can
skip levels in the hierarchy, as in the direct C1�C2 connections
of the model in Fig. 2-2.
The question remains whether the proposed model can indeed
achieve response selectivity and
invariance compatible with the results from physiology. To
investigate this question, we looked at
the invariance properties of 21 view-tuned units in the model,
each tuned to a view of a different,
randomly selected paperclip, as used in the experiment [49].
Figure 2-4 shows the response of one model view-tuned unit to 3D
rotation, scaling and trans-
lation around its preferred view (see Methods). The unit
responds maximally to the training view,
with the response gradually falling off as the stimulus is
transformed away from the training view.
As in the experiment, we can determine the invariance range of
the VTU by comparing the response
to the preferred stimulus to the responses to the 60
distractors. The invariance range is then defined
as the range over which the model unit’s response is greater
than to any of the distractor objects.
Thus, the model VTU shown in Fig. 2-4 shows rotation invariance
of 24�, scale invariance of 2.6
octaves and translation invariance of 4.7� of visual angle.
Averaging over all 21 units, we obtain
average rotation invariance over 30.9�, scale invariance over
��� octaves and translation invariance
over 4.6�.
Units show invariance around the training view, of a range in
good agreement with the exper-
imentally observed values. Some units (5/21), an example of
which is given in Fig. 2-4 (d), show
tuning also for pseudo-mirror views (obtained by rotating the
preferred paperclip by 180� in depth,
which produces a pseudo-mirror view of the object due to the
paperclips’ minimal self-occlusion),
as observed in some experimental neurons [49].
While the simulation and experimental data presented so far
dealt with object recognition set-
tings in which one object was presented in isolation, this is
rarely the case in normal object recogni-
tion settings. More commonly, the object to be recognized is
situated in front of some background
or appears together with other objects, all of which are to be
ignored if the object is to be recognized
successfully. More precisely, in the case of multiple objects in
the receptive field, the responses of
the afferents feeding into a VTU tuned to a certain object
should be affected as little as possible by
the presence of other “clutter objects.”
The MAX response function posited above for the pooling
mechanism to achieve invariance has
the right computational properties to perform recognition in
clutter: If the VTU’s preferred object
strongly activates the VTU’s afferents, then it is unlikely that
other objects will interfere, as they
tend to activate the afferents less and hence will not usually
influence the response due to the MAX
response function. In some cases (such as when there are
occlusions of the preferred feature, or one
of the “wrong” afferents has a higher activation) clutter, of
course, can affect the value provided
18
-
0 20 40 60 80 100 120 140 1600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
stimulus size
resp
onse
10 20 30 40 50 600
0.5
1
50 60 70 80 90 100 110 120 1300
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
viewing angle
resp
onse
0 90 180 2700
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
viewing angle
resp
onse
−4
−2
0
2
4 −4
−2
0
2
40
0.5
1
y translation (deg)x translation (deg)
resp
onse
(a) (b)
(d)(c)
Figure 2-4: Responses of a sample model neuron to different
transformations of its preferred stimulus. Thedifferent panels show
the same neuron’s response to (a) varying stimulus sizes (inset
shows response to 60distractor objects, selected randomly from the
paperclips used in the physiology experiments [49]), (b) rotationin
depth and (c) translation. Training size was �� � �� pixels
corresponding to 2� of visual angle. (d) showsanother neuron’s
response to pseudo-mirror views (cf. text), with the dashed line
indicating the neuron’sresponse to the “best” distractor.
19
-
by the MAX mechanism, thereby reducing the quality of the match
at the final stage and thus
the strength of the VTU response. It is clear that to achieve
the highest robustness to clutter, a
VTU should only receive input from cells that are strongly
activated (i.e., that are relevant to the
definition of the object) by its preferred stimulus.
In the version of the model described so far, the penultimate
layer contained only 10 cells corre-
sponding to 10 different features, which turned out to be
sufficient to achieve invariance properties
as found in the experiment. Each VTU in the top layer was
connected to all the afferents and hence
robustness to clutter is expected to be relatively low. Note
that in order to connect a VTU to only the
subset of the intermediate feature detectors it receives strong
input from, the number of afferents
should be large enough to achieve the desired response
specificity.
The straightforward solution is to increase the number of
features. Even with a fixed number of
different features in S1, the dictionary of S2 features can be
expanded by increasing the number and
type of afferents to individual S2 cells (see Methods). In this
“many feature” version of the model,
the invariance ranges for a low number of afferents are already
comparable to the experimental
ranges — if each VTU is connected to the 40 (out of 256) C2
cells that are most strongly excited
by its preferred stimulus, model VTUs show an average scale
invariance over ��� octaves, rotation
invariance over 36.2� and translation invariance over 4.4�. For
the maximum of 256 afferents to
each cell, cells are rotation invariant over an average of 47�,
scale invariant over 2.4 octaves and
translation invariant over 4.7�.
Simulations show [81] that this model is capable of performing
recognition in context: Using
displays as inputs that contain the neurons preferred clip as
well as another, distractor, clip, the
model is able to correctly recognize the preferred clip in 90%
of the cases (for 40/256 afferents to
each neuron, the maximum rate is 94% for 18 afferents, dropping
to 55% for 256/256 afferents,
compared to 40% in the original version of the model with 10 C2
units), i.e., the addition of the
second clip interfered with the activation caused by the first
clip alone so much that in 10% of the
cases the response to the two clip display containing the
preferred clip fell below the response to
one of the distractor clips. This reduction of the response to
the two-stimulus display compared to
the response to the stronger stimulus alone has also been found
in experimental studies [86, 89].
The question of object recognition in the presence of a
background object was explored ex-
perimentally in a recent study [54], where a monkey had to
discriminate (polygonal) foreground
objects irrespective of the (polygonal) background they appeared
with. Recordings of IT neurons
showed that for the stimulus/background condition, neuronal
response on average was reduced
to a quarter of the response to the foreground object alone,
while the monkey’s behavioral perfor-
mance dropped much less. This is compatible with simulations in
the model [81] that show that
even though a unit’s firing rate is strongly affected by the
addition of the background pattern, it is
still in most cases well above the firing rate evoked by
distractor objects, allowing the foreground
20
-
1 4 16 64 2560
0.2
0.4
0.6
0.8
1
number of tiles
avg.
res
pons
e
(a) (b)
Figure 2-5: Average neuronal responses of neurons in the many
feature version of the model to scrambledstimuli. (a) Example of a
scrambled stimulus. The images (��� � ��� pixels) were created by
subdividingthe preferred stimulus of each neuron into 4, 16, 64,
and 256, resp., “tiles” and randomly shuffling the tiles tocreate a
scrambled image. (b) Average response of the 21 model neurons (with
40/256 afferents, as above) tothe scrambled stimuli (solid blue
curve), in comparison to the average normalized responses of IT
neurons toscrambled stimuli (scrambled pictures of trees) reported
in a very recent study [108] (dashed green curve).
object to be recognized successfully.
Our model relies on decomposing images into features. Should it
then be fooled into confusing
a scrambled image with the unscrambled original? Superficially,
one may be tempted to guess that
scrambling an image in pieces larger than the features should
indeed fool the model. Simulations
(see Fig. 2-5) show that this is not the case. The reason lies
in the large dictionary of filters/features
used that makes it practically impossible to scramble the image
in such a way that all features are
preserved, even for a low number of features. Responses of model
units drop precipitously as
the image is scrambled into progressively finer pieces, as
confirmed very recently in a physiology
experiment [108] of which we became aware after obtaining this
prediction from the model.
2.3 Discussion
We briefly outline the computational roots of the hierarchical
model we described, how the MAX
operation could be implemented by cortical circuits and remark
on the role of features and invari-
ances in the model.
A key operation in several recent computer vision algorithms for
the recognition and classifi-
cation of objects [87, 92] is to scan a window across an image,
through both position and scale, in
order to analyze at each step a subimage – for instance by
providing it to a classifier that decides
whether the subimage represents the object of interest. Such
algorithms have been successful in
achieving invariance to image plane transformations such as
translation and scale. In addition, this
brute force scanning strategy eliminates the need to segment the
object of interest before recogni-
tion: segmentation, even in complex and cluttered images, is
routinely achieved as a byproduct of
21
-
recognition. The computational assumption that originally
motivated the model described in this
paper was indeed that a MAX-like operation may represent the
cortical equivalent of the “window
of analysis” in machine vision to scan through and select input
data. Unlike a centrally controlled
sequential scanning operation, a mechanism like the MAX
operation that locally and automatically
selects a relevant subset of inputs seems biologically
plausible. A basic and pervasive operation
in many computational algorithms — not only in computer vision —
is the search and selection
of a subset of data. Thus it is natural to speculate that a
MAX-like operation may be replicated
throughout the cortex.
Simulations of a simplified two-layer version the model [79]
using soft-maximum approxima-
tions to the MAX operation (see Methods) where the strength of
the nonlinearity can be adjusted
by a parameter show that its basic properties are preserved and
structurally robust. But how is an
approximation of the MAX operation realized by neurons? It seems
that it could be implemented
by several different, biologically plausible circuitries [1, 13,
17, 32, 44]. The most likely hypothesis
is that the MAX operation arises from cortical microcircuits of
lateral, possibly recurrent, inhibition
between neurons in a cortical layer. An example is provided by
the circuit proposed for the gain-
control and relative motion detection in the visual system of
the fly [76], based on feedforward
(or recurrent) shunting presynaptic (or postsynaptic) inhibition
by “pool” cells. One of its key
elements, in addition to shunting inhibition (an equivalent
operation may be provided by linear
inhibition deactivating NMDA receptors), is a nonlinear
transformation of the individual signals
due to synaptic nonlinearities or to active membrane properties.
The circuit performs a gain control
operation and — for certain values of the parameters — a
MAX-like operation. “Softmax” circuits
have been proposed in several recent studies [34, 45, 61] to
account for similar cortical functions.
Together with adaptation mechanisms (underlying very short-term
depression [1]), the circuit may
be capable of pseudo-sequential search in addition to
selection.
Our novel claim here is that a MAX-like operation is a key
mechanism for object recognition
in the cortex. The model described in this paper — including the
stage from view-tuned to view-
invariant units [68] — is a purely feedforward hierarchical
model. Backprojections – well known
to exist abundantly in cortex and playing a key role in other
models of cortical function [59, 75] –
are not needed for its basic performance but are probably
essential for the learning stage and for
known top-down effects — including attentional biases [77] — on
visual recognition, which can be
naturally grafted into the inhibitory softmax circuits (see
[61]) described earlier.
In our model, recognition of a specific object is invariant for
a range of scales (and positions) af-
ter training with a single view at one scale, because its
representation is based on features invariant
to these transformations. View invariance on the other hand
requires training with several views
[68] because individual features sharing the same 2D appearance
can transform very differently
under 3D rotation, depending on the 3D structure of the specific
object. Simulations show that the
22
-
model’s performance is not specific to the class of paperclip
object: recognition results are similar
for e.g., computer-rendered images of cars (and other
objects).
From a computational point of view the class of models we have
described can be regarded
as a hierarchy of conjunctions and disjunctions. The key aspect
of our model is to identify the
disjunction stage with the build-up of invariances and to do it
through a MAX-like operation. At
each conjunction stage the complexity of the features increases
and at each disjunction stage so does
their invariance. At the last level – of the C2 layer in the
paper – it is only the presence and strength
of individual features and not their relative geometry in the
image that matters. The dictionary
of features at that stage is overcomplete, so that the
activities of the units measuring each feature
strength, independently of their precise location, can still
yield a unique signature for each visual
pattern (cf. the SEEMORE system [52]).
The architecture we have described shows that this approach is
consistent with available exper-
imental data and maps it into a class of models that is a
natural extension of the hierarchical models
first proposed by Hubel and Wiesel.
2.4 Methods
Basic model parameters. Patterns on the model “retina” (of ��� �
��� pixels — which corresponds to a
5� receptive field size (the literature [41] reports an average
V4 receptive field size of 4.4�) if we set 32 pixels
� ��) — are first filtered through a layer (S1) of simple
cell-like receptive fields (first derivative of gaussians,
zero-sum, square-normalized to 1, oriented at ��� ���� ��� ���
with standard deviations of 1.75 to 7.25 pixels
in steps of 0.5 pixels; S1 filter responses were rectified dot
products with the image patch falling into their
receptive field, i.e., the output s�j of an S1 cell with
preferred stimulus wj whose receptive field covers an
image patch Ij is s�j � jwj � Ijj). Receptive field (RF) centers
densely sample the input retina. Cells in the next
(C1) layer each pool S1 cells (using the MAX response function,
i.e., the output c�i of a C1 cell with afferents s�
j
is c�i � maxj s�j ) of the same orientation over eight pixels of
the visual field in each dimension and all scales.
This pooling range was chosen for simplicity — invariance
properties of cells were robust for different choices
of pooling ranges (cf. below). Different C1 cells were then
combined in higher layers, either by combining
C1 cells tuned to different features to give S2 cells responding
to co-activations of C1 cells tuned to different
orientations or to yield C2 cells responding to the same feature
as the C1 cells but with bigger receptive fields.
In the simple version illustrated here, the S2 layer contains
six features (all pairs of orientations of C1 cells
looking at the same part of space) with Gaussian transfer
function (� � �, centered at 1, i.e., the response s�k
23
-
of an S2 cell receiving input from C1 cells c�m� c�n with
receptive fields in the same location but responding to
different orientations is s�k � exp����c�m � ��
� �c�n � �������), yielding a total of 10 cells in the C2
layer.
Here, C2 units feed into the view-tuned units, but in principle,
more layers of S and C units are possible.
In the version of the model we have simulated, object specific
learning occurs only at the level of the
synapses on the view-tuned cells at the top. More complete
simulations will have to account for the effect of
visual experience on the exact tuning properties of other cells
in the hierarchy.
Testing the invariance of model units. View-tuned units in the
model were generated by recording
the activity of units in the C2 layer feeding into the VTUs to
each one of the 21 paperclip views and then
setting the connecting weights of each VTU, i.e., the center of
the Gaussian associated with the unit, resp., to
the corresponding activation. For rotation, viewpoints from 50�
to 130� were tested (the training view was
arbitrarily set to 90�) in steps of 4�. For scale, stimulus
sizes from 16 to 160 pixels in half octave steps (except
for the last step, which was from 128 to 160 pixels) and for
translation, independent translations of���� pixels
along each axis in steps of 16 pixels (i.e., exploring a plane
of ����� ��� pixels) were used.
“Many feature” version. To increase the robustness to clutter of
model units, the number of features in
S2 was increased: Instead of the previous maximum of two
afferents of different orientation looking at the
same patch of space as in the version described above, each S2
cell now received input from four neighboring
C1 units (in a � � � arrangement) of arbitrary orientation,
giving a total of �� � ��� different S2 types and
finally 256 C2 cells as potential inputs to each view-tuned cell
(in simulations, top level units were sparsely
connected to a subset of C2 layer units to gain robustness to
clutter, cf. Results). As S2 cells now combined C1
afferents with receptive fields at different locations, and
features a certain distance apart at one scale change
their separation as the scale changes, pooling at the C1 level
was now done in several scale bands, each of
roughly a half-octave width in scale space (filter standard
deviation ranges were 1.75–2.25, 2.75–3.75, 4.25–
5.25, and 5.75–7.25 pixels, resp.) and the spatial pooling range
in each scale band chosen accordingly (over
neighborhoods of � � �, � � �, � , and �� � ��, respectively —
note that system performance was robust
with respect to the pooling ranges, simulations with
neighborhoods of twice the linear size in each scale
band produced comparable results, with a slight drop in the
recognition of overlapping stimuli, as expected),
as a simple way to improve scale-invariance of composite feature
detectors in the C2 layer. Also, centers
of C1 cells were chosen so that RFs overlapped by half a RF size
in each dimension. A more principled way
24
-
would be to learn the invariant feature detectors, e.g., using
the trace rule [23]. The straightforward connection
patterns used here, however, demonstrate that even a simple
model shows tuning properties comparable to
the experiment.
Softmax approximation. In a simplified two-layer version of the
model [79] we investigated the effects of
approximations to the MAX operations on recognition performance.
The model contained only one pooling
stage, C1, where the strength of the pooling nonlinearity could
be controlled by a parameter, p. There, the
output c�i of a C1 cell with afferents xj was
c�i �X
j
exp�p � jxj j�Pk exp�p � jxkj�
xj �
which performs a linear summation (scaled by the number of
afferents) for p � � and the MAX operation for
p��.
2.5 Acknowledgments
Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R.
is supported by a Merck/MIT
Fellowship in Bioinformatics. T.P. is supported by the Uncas and
Helen Whitaker Chair at the
Whitaker College, MIT. We are grateful to H. Bülthoff, F.
Crick, B. Desimone, R. Hahnloser, C. Koch,
N. Logothetis, E. Miller, J. Pauls, D. Perrett, J. Reynolds, T.
Sejnowski, S. Seung, and R. Vogels for
very useful comments and for reading earlier versions of this
manuscript. We thank J. Pauls for
analyzing the average invariance ranges of his IT neurons and K.
Tanaka for the permission to
reproduce Fig. 2-3 (a).
25
-
Chapter 3
Are Cortical Models Really Bound by
the “Binding Problem”?
Abstract
The usual description of visual processing in cortex is an
extension of the simple to complex
hierarchy postulated by Hubel and Wiesel — a feedforward
sequence of more and more complex
and invariant features. The capability of this class of models
to perform higher level visual
processing such as viewpoint-invariant object recognition in
cluttered scenes has been questioned
in recent years by several researchers, who in turn proposed an
alternative class of models based
on the synchronization of large assemblies of cells, within and
across cortical areas. The main
implicit argument for this novel and controversial view was the
assumption that hierarchical
models cannot deal with the computational requirements of high
level vision and suffer from
the so-called “binding problem”. We review the present situation
and discuss theoretical and
experimental evidence showing that the perceived weaknesses of
hierarchical models are not true.
In particular, we show that recognition of multiple objects in
cluttered scenes, arguably among
the most difficult tasks in vision, can be done in a
hierarchical feedforward model.
26
-
3.1 Introduction: Visual Object Recognition
Two problems make object recognition difficult:
1. The segmentation problem: Visual scenes normally contain
multiple objects. To recognize in-
dividual objects, features must be isolated from the surrounding
clutter and extracted from
the image, and the feature set must be parsed so that the
different features are assigned to the
correct object. The latter problem is commonly referred to as
the “Binding Problem” [110].
2. The invariance problem: Objects have to be recognized under
varying viewpoints, lighting
conditions etc.
Interestingly, the human brain can solve these problems with
ease and quickly. Thorpe et al. [101]
report that visual processing in an object detection task in
complex visual scenes can be achieved
in under 150 ms, which is on the order of the latency of the
signal transmission from the retina
to inferotemporal cortex (IT), the highest area in the ventral
visual stream thought to have a key
role in object recognition [103]; see also [72]. This impressive
processing speed presents a strong
constraint for any model of object recognition.
3.2 Models of Visual Object Recognition and the Binding
Prob-
lem
Hubel and Wiesel [37] were the first to postulate a model of
visual object representation and recog-
nition. They recorded from simple and complex cells in the
primary visual cortices of cats and mon-
keys and found that while both types preferentially responded to
bars of a certain orientation, the
former had small receptive fields with a phase-dependent
response while the latter had bigger re-
ceptive fields and showed no phase-dependence. This observation
led them to hypothesize that
complex cells receive input from several simple cells.
Continuing this model in a straightforward
fashion, they suggested [36] that the visual system is composed
of a hierarchy of visual areas, from
simple cells all the way up to “higher order hypercomplex
cells.”
Later studies [7] of macaque inferotemporal cortex (IT)
described neurons tuned to views of
complex objects such as a face, i.e., the cells discharged
strongly to a face seen from a specific
viewpoint but very little or not at all to other objects. A key
property of these cells was their
scale and translation invariance, i.e., the robustness of their
firing to stimulus transformations such as
changes in size or position in the visual field.
These findings inspired various models of visual object
recognition such as Fukushima’s Neocog-
nitron [25] or, later, Perrett and Oram’s [64] outline of a
model of shape processing, and Wallis and
27
-
Rolls’ VisNet [111], all of which share the basic idea of the
visual system as a feedforward process-
ing hierarchy where invariance ranges and complexity of
preferred features grow as one ascends
through the levels.
Models of this type prompted von der Malsburg [109] to formulate
the binding problem. His
claim was that visual representations based on spatially
invariant feature detectors were ambigu-
ous: “As generalizations are performed independently for each
feature, information about neigh-
borhood relations and relative position, size and orientation is
lost. This lack of information can
lead to the inability to distinguish between patterns that are
composed of the same set of invariant
features. . . ” [110]. Moreover, as a visual scene containing
multiple objects is represented by a set of
feature activations, a second problem lies in “singling out
appropriate groups from the large back-
ground of possible combinations of active neurons” [110]. These
problems would manifest them-
selves in various phenomena such as hallucinations (the feature
sets activated by objects actually
present in the visual scene combine to yield the activation
pattern characteristic of another object)
and the figure-ground problem (the inability to correctly assign
image features to foreground ob-
ject and background), leading von der Malsburg to postulate the
necessity of a special mechanism,
the synchronous oscillatory firing of ensembles of neurons, to
bind features belonging to one object
together.
One approach to avoid these problems was presented by Olshausen
et al. [62]: Instead of trying
to process all objects simultaneously, processing is limited to
one object in a certain part of space
at a time, e.g., through “focussing attention” on a region of
interest in the visual field, which is
then routed through to higher visual areas, ignoring the
remainder of the visual field. The control
signal for the input selection in this model is thought to be
provided in form of the output of a
“blob-search” system that identifies possible candidates in the
visual scene for closer examination.
While this top-down approach to circumvent the binding problem
has intuitive appeal and is com-
patible with physiological studies that report top-down
attentional modulation of receptive field
properties, (see the article by Reynolds & Desimone in this
issue, or the recent study by Connor
et al. [14]), such a sequential approach seems to be difficult
to reconcile with the apparent speed
with which object recognition can proceed even in very complex
scenes containing many objects
[72, 101], and is also incompatible with reports of parallel
processing of visual scenes, as observed
in pop-out experiments [102]), suggesting that object
recognition does not seem to depend only on
explicit top-down selection in all situations.
A more head-on approach to the binding problem was taken in
other studies that have called
into question the assumption that representations based on sets
of spatially invariant feature detec-
tors are inevitably ambiguous. Starting with Wickelgren [114] in
the context of speech recognition,
several studies have proposed how coding an object through a set
of intermediate features, made
up of local arrangements of simpler features (e.g., using letter
pairs, or higher order combinations,
28
-
instead of individual letters to code words — for instance, the
word “tomaso” could be confused
with the word “somato” if both are coded by the sets of letters
they are made up of; this ambiguity
is resolved, however, if they are represented through letter
pairs) can sufficiently constrain the rep-
resentation to uniquely code complex objects without retaining
global positional information (see
Mozer [58] for an elaboration of this idea and an implementation
in the context of word recogni-
tion). The capabilities of such a representation based on
spatially-invariant receptive fields were
recently analyzed in detail by Mel & Fiser [53] for the
example domain of English text.
In the visual domain, Mel [52] recently presented a model to
perform invariant recognition of
a high number (100) of objects of different types, using a
representation based on a large number
of feature channels. While the model performed surprisingly well
for a variety of transformations,
recognition performance depended strongly on color cues, and did
not seems as robust to scale
changes as experimental neurons [49]. Perrett & Oram [65]
have recently outlined a conceptual
model based on very similar ideas of how a representation based
on feature combinations could in
theory avoid the “Binding Problem”, e.g., by coding a face
through a set of detectors for combina-
tions of face parts such as eye-nose or eyebrow-hairline. What
has been lacking so far, however, is
a computational implementation quantitatively demonstrating that
such a model can actually per-
form “real-world” subordinate visual object recognition to the
extent observed in behavioral and
physiological experiments [48, 49, 54, 89], where effects such
as scale changes, occlusion and over-
lap pose additional problems not found in an idealized text
environment. In particular, unlike in
the text domain where the input consists of letter strings and
the extraction of features (letter com-
binations) from the input is therefore trivial, the crucial task
of invariant feature extraction from
the image is nontrivial for scenes containing complex shapes,
especially when multiple objects are
present.
We have developed a hierarchical feedforward model of object
recognition in cortex, described
in [82], as a plausibility proof that such a model can account
for several properties of IT cells, in
particular the invariance properties of IT cells found by
Logothetis et al. [49]. In the following,
we will show that such a simple model can perform invariant
recognition of complex objects in
cluttered scenes and is compatible with recent physiological
studies. This is a plausibility proof
that complex oscillation-based mechanisms are not necessarily
required for these tasks and that the
binding problem seems to be a problem for only some models of
object recognition.
3.3 A Hierarchical Model of Object Recognition in Cortex
Studies of receptive field properties along the ventral visual
stream in the macaque, from primary
visual cortex, V1, to anterior IT report an overall trend of an
increase of average feature complexity
and receptive field size throughout the stream [41]. While
simple cells in V1 have small localized
29
-
(a) (b)
View angleΣ
view-tunedunits
view-invariant Σunit
Figure 3-1: (a) Cartoon of the Poggio and Edelman model [68] of
view-based object recognition. The grayovals correspond to
view-tuned units that feed into a view-invariant unit (open
circle). (b) Tuning curves ofthe view-tuned (gray) and the
view-invariant units (black).
receptive fields and respond preferentially to simple shapes
like bars, cells in anterior IT have been
found to respond to views of complex objects while showing great
tolerance to scale and position
changes. Moreover, some IT cells seem to respond to objects in a
view-invariant manner [6, 49, 66].
Our model follows this general framework. Previously, Poggio and
Edelman [68] presented
a model of how view-invariant cells could arise from view-tuned
cells (Fig. 3-1). However, they
did not describe any model of how the view-tuned units (VTUs)
could come about. We have re-
cently developed a hierarchical model that closes this gap and
shows how VTUs tuned to complex
features can arise from simple cell-like inputs. A detailed
description of our model can be found
in [82] (for preliminary accounts refer to [79, 80], and also to
[43]). We briefly review here some
of its main properties. The central idea of the model is that
invariance to scaling and translation
and robustness to clutter on one hand, and feature complexity on
the other hand require different
transfer functions, i.e., mechanisms by which a neuron combines
its inputs to arrive at an output
value: While for the latter a weighted sum of different
features, which makes the neuron respond
preferentially to a specific activity pattern over its
afferents, is a suitable transfer function, increas-
ing invariance requires a different transfer function that pools
over different afferents tuned to the
same feature but transformed to different degrees (e.g., at
different scales to achieve scale invari-
ance). A suitable pooling function (for a computational
justification, see [82]) is a so-called MAX
function, where the output of the neuron is determined by the
strongest afferent, thus performing a “scan-
ning” operation over afferents tuned to different positions and
scales. This is similar to the original
Hubel and Wiesel model of a complex cell receiving input from
simple cells at different locations
to achieve phase-invariance.
In our model of object recognition in cortex (Fig. 3-2), the two
types of operations, selection and
template matching, are combined in a hierarchical fashion to
build up complex, invariant feature
detectors from small, localized, simple cell-like receptive
fields in the bottom layer. In particular,
30
-
patterns on the model “retina” (of ��� � ��� pixels — which
corresponds to a 5� receptive field
size ([41] report an average V4 receptive field size of 4.4�) if
we set 32 pixels � ��) are first fil-
tered through a layer (S1, adopting Fukushima’s nomenclature
[25] of referring to feature-building
cells as “S” cells and pooling cells as “C” cells) of simple
cell-like receptive fields (first derivative
of gaussians, zero-sum, square-normalized to 1, oriented at ���
��� ���� �� with standard devia-
tions of 1.75 to 4.75 pixels in steps of 0.5 pixels. S1 filter
responses are absolute values of the image
“filtered” through the units’ receptive fields (more precisely,
the rectified dot product of the cells’
receptive field with the corresponding image patch). Receptive
field centers densely sample the
input retina. Cells in the next layer (C1) each pool S1 cells of
the same orientation over a range of
scales and positions. Filters were grouped in four bands each
spanning roughly � octaves, sam-
pling over position was done over patches of linear dimensions
of �� �� �� �� pixels, respectively
(starting with the smallest filter band); patches overlapped by
half in each direction to obtain more
invariant cells responding to the same features as the S1 cells.
Different C1 cells were then com-
bined in higher layers — the figure illustrates two
possibilities: either by combining C1 cells tuned
to different features to give S2 cells responding to
co-activations of C1 cells tuned to different orien-
tations or to yield C2 cells responding to the same feature as
the C1 cells but with bigger receptive
fields (i.e., the hierarchy does not have to be a strict
alternation of S and C layers). In the version
described in this paper, there were no direct C1 � C2
connections, and each S2 cell received input
from four neighboring C1 units (in a � � � arrangement) of
arbitrary orientation, yielding a total
of �� � �� different S2 cell types. S2 transfer functions were
Gaussian (� � �, centered at 1). C2
cells then pooled inputs from all S2 cells of the same type,
producing invariant feature detectors
tuned to complex shapes. Top-level view-tuned units had Gaussian
response functions and each
VTU received inputs from a subset of C2 cells (see below).
This model had originally been developed to account for the
transformation tolerance of view-
tuned units in IT as recorded from by Logothetis et al. [49]. It
turns out, however, that the model
also has interesting implications for the binding problem.
3.4 Binding without a problem
To correctly recognize multiple objects in clutter, two problems
must be solved: i) features must be
robustly extracted, and ii) based on these features, a decision
has to be made about which objects
are present in the visual scene. The MAX operation can perform
robust feature extraction (cf. [82]):
A MAX pooling cell that receives inputs from cells tuned to the
same feature at, e.g., different lo-
cations, will select the most strongly activated afferent, i.e.,
its response will be determined by the
afferent with the closest match to its preferred feature in its
receptive field. Thus, the MAX mech-
anism effectively isolates the feature of interest from the
surrounding clutter. Hence, to achieve
31
-
����������
��������
��������
������������
������������������������
������������������������
������������������������������������������������������������������������
������������������������������������������������������������������������
��������
��������
��������������������������������������������������������
��������������������������������������������������������
view-tuned cells
MAX
weighted sum
simple cells (S1)
complex cells (C1)
"complex composite" cells (C2)
"composite feature" cells (S2)
Figure 3-2: Diagram of our hierarchical model [82] of object
recognition in cortex. It consists of layers of linearunits that
perform a template match over their afferents (blue arrows), and of
non-linear units that perform a“MAX” operation over their inputs,
where the output is determined by the strongest afferent (green
arrows).While the former operation serves to increase feature
complexity, the latter increases invariance by effectivelyscanning
over afferents tuned to the same feature but at different positions
(to increase translation invariance)or scale (to increase scale
invariance, not shown). In the version described in this paper,
learning only occuredat the connections from the C2 units to the
top-level view-tuned units.
32
-
robustness to clutter, a VTU should only receive input from
cells that are strongly activated by the
VTU’s preferred stimulus (i.e., those features that are relevant
to the definition of the object) and
thus less affected by clutter (which will tend to activate the
afferents less and will therefore be ig-
nored by the MAX response function). Also, in such a scheme, two
view-tuned neurons receiving
input from a common afferent feature detector will tend to both
have strong connections to this fea-
ture detector. Thus, there will only be little interference even
if the common feature detector only
responded to one (the stronger) of the two stimuli in its
receptive field due to its MAX response
function. Note that the situation would be hopeless for a
response function that pools over all affer-
ents through, for example, a linear sum function: The response
would always change when another
object is introduced in the visual field, making it impossible
to disentangle the activations caused
by the individual stimuli without an additional mechanism such
as, for instance, an attentional
sculpting of the receptive field or some kind of segmentation
process.
In the following two sections we will show simulations that
support these theoretical consider-
ations, and we will compare them to recent physiological
experiments.
3.4.1 Recognition of multiple objects
The ability of the model neurons to perform recognition of
multiple, non-overlapping objects was
investigated in the following experiment: 21 model neurons, each
tuned to a view of a randomly
selected paperclip object (as used in theoretical [68],
psychophysical [8, 48], and physiological [49]
studies on object recognition), were each presented with 21
displays consisting of that neuron’s
preferred clip combined with each of the 21 preferred clips (in
the upper left and lower right corner
of the model retina, resp., see Fig. 3-3 (a)) yielding ��� � ���
two-clip displays. Recognition perfor-
mance was evaluated by comparing the neuron’s response to these
displays with its responses to 60
other, randomly chosen “distractor” paperclip objects (cf. Fig.
3-3). Following the studies on view-
invariant object recognition [8, 48, 82], an object is said to
be recognized if the neuron’s response
to the two-clip displays (containing its preferred stimulus) is
greater than to any of the distractor
objects. For 40 afferents to each view-tuned cell (i.e., the 40
C2 units excited most strongly by the
neuron’s preferred stimulus; this choice produced top-level
neurons with tuning curves similar to
the experimental neurons [82]), we find that on average in 90%
of the cases recognition of the neu-
ron’s preferred clip is still possible, indicating that there is
little interference between the activations
caused by the two stimuli in the visual field. The maximum
recognition ra