Top Banner
How a Part of the Brain Might or Might Not Work: A New Hierarchical Model of Object Recognition by Maximilian Riesenhuber Diplom-Physiker Universit¨ at Frankfurt, 1995 Submitted to the Department of Brain and Cognitive Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computational Neuroscience at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2000 c Massachusetts Institute of Technology 2000. All rights reserved. Author .......................................................................... Department of Brain and Cognitive Sciences May 2, 2000 Certified by ...................................................................... Tomaso Poggio Uncas and Helen Whitaker Professor Thesis Supervisor Accepted by ..................................................................... Earl K. Miller Co-Chair, Department Graduate Committee
90

How a Part of the Brain Might or Might Not Work: A New …cbcl.mit.edu/publications/theses/thesis-riesenhuber.pdf · 2002. 11. 20. · How a Part of the Brain Might or Might Not Work:

Jan 25, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • How a Part of the Brain Might or Might Not Work: A New

    Hierarchical Model of Object Recognition

    by

    Maximilian Riesenhuber

    Diplom-PhysikerUniversität Frankfurt, 1995

    Submitted to the Department of Brain and Cognitive Sciencesin partial fulfillment of the requirements for the degree of

    Doctor of Philosophy in Computational Neuroscience

    at the

    MASSACHUSETTS INSTITUTE OF TECHNOLOGY

    May 2000

    c�Massachusetts Institute of Technology 2000. All rights reserved.

    Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Department of Brain and Cognitive Sciences

    May 2, 2000

    Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Tomaso Poggio

    Uncas and Helen Whitaker ProfessorThesis Supervisor

    Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Earl K. Miller

    Co-Chair, Department Graduate Committee

  • How a Part of the Brain Might or Might Not Work: A New Hierarchical Model

    of Object Recognition

    by

    Maximilian Riesenhuber

    Submitted to the Department of Brain and Cognitive Scienceson May 2, 2000, in partial fulfillment of the

    requirements for the degree ofDoctor of Philosophy in Computational Neuroscience

    Abstract

    The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated rep-resentations, extending in a natural way the model of simple to complex cells of Hubel and Wiesel.Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to explorethe biological feasibility of this class of models to explain higher level visual processing, such asobject recognition in cluttered scenes. We describe a new hierarchical model, HMAX, that accountswell for this complex visual task, is consistent with several recent physiological experiments in in-ferotemporal cortex and makes testable predictions. Key to achieve invariance and robustness toclutter is a MAX-like response function of some model neurons which selects (an approximation to)the maximum activity over all the afferents, with interesting connections to “scanning” operationsused in recent computer vision algorithms.

    We then turn to the question of object recognition in natural (“continuous”) object classes, suchas faces, which recent physiological experiments have suggested are represented by a sparse dis-tributed population code. We performed two psychophysical experiments in which subjects weretrained to perform subordinate level discrimination in a continuous object class — images of com-puter-rendered cars — created using a 3D morphing system. By comparing the recognition perfor-mance of trained and untrained subjects we could estimate the effects of viewpoint-specific train-ing and infer properties of the object class-specific representation learned as a result of training.We then compared the experimental findings to simulations in HMAX, to investigate the computa-tional properties of a population-based object class representation. We find experimental evidence,supported by modeling results, that training builds a viewpoint- and class-specific representationthat supplements a pre-existing representation with lower shape discriminability but greater view-point invariance.

    Finally, we show how HMAX can be extended in a straightforward fashion to perform object cate-gorization and to support arbitrary class hierarchies. We demonstrate the capability of our scheme,called “Categorical Basis Functions” (CBF), with the example domain of cat/dog categorization,and apply it to study some recent findings in categorical perception.

    Thesis Supervisor: Tomaso PoggioTitle: Uncas and Helen Whitaker Professor

    2

  • Acknowledgments

    Thanks are due to quite a few people who have contributed in different ways to the gestation of

    this thesis. First, of course, are my parents. I am extremely grateful for their untiring support and

    encouragement over the years — especially to my father for convincing me to major in physics.

    Second is Hans-Ulrich Bauer, who advised my Diplom thesis at the Institute for Theoretical

    Physics of the University of Frankfurt. Without Hans-Ulrich and his urging to get my PhD in

    computational neuroscience in the US I would never have applied to MIT and would now probably

    be a bitter physics PhD working for McKinsey. To him this thesis is dedicated.

    At MIT, I first want to thank my advisor, Tommy Poggio, for warning me about the smog at

    Caltech, and for being the best advisor I could imagine: Providing a lot of independence while

    being very encouraging and supportive. Being exposed to his way of doing science I consider one

    of the biggest assets of my PhD training.

    Then there are the people in my thesis committee, all of which have provided valuable input

    and guidance along the way. Special thanks are due to Peter Dayan for a fine collaboration during

    my first year that introduced me to quite a few new ideas. To Earl Miller for being so open to collab-

    orations with wacky theorists. To Mike Tarr for a nice News & Views commentary [96] on our paper

    [82], and a very stimulating visit which might lead to an even more stimulating collaboration. . .

    To Pawan Sinha and Gadi Geiger for introducing me to the wonderful world of psychophysics

    and providing invaluable advice when I ran my first experiment. To David Freedman and An-

    dreas Tolias for introducing me to the wonderful world of monkey physiology and being great

    collaborators. To Christian Shelton for the “mother of all correspondence algorithms” that was so

    instrumental in many parts of this thesis and beyond. To Valerie Pires, without whose help the

    psychophysics in chapter 4 would have been nothing more than “suggestions for future work”,

    and to Mary Pat Fitzgerald, “the mother of CBCL”, whose help in dealing with the subtleties of the

    MIT administration was (and continues to be) greatly appreciated.

    Then there is our fine department that really has it all “under one roof”. I could not have

    imagined a better place to get my PhD.

    Last but not least I want to gratefully acknowledge the generous support provided by a Gerald

    J. and Marjorie J. Burnett Fellowship (1996–1998) and a Merck/MIT Fellowship in Bioinformatics

    (1998–2000) that enabled me to pursue the studies described in this thesis.

    3

  • Contents

    1 Introduction 8

    2 Hierarchical Models of Object Recognition in Cortex 11

    2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.5 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3 Are Cortical Models Really Bound by the “Binding Problem”? 26

    3.1 Introduction: Visual Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.2 Models of Visual Object Recognition and the Binding Problem . . . . . . . . . . . . . 27

    3.3 A Hierarchical Model of Object Recognition in Cortex . . . . . . . . . . . . . . . . . . 29

    3.4 Binding without a problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.4.1 Recognition of multiple objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.4.2 Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.6 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    4 The Individual is Nothing, the Class Everything:

    Psychophysics and Modeling of Recognition in Object Classes 40

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.3 Modeling: Representing Continuous Object Classes in HMAX . . . . . . . . . . . . . 50

    4.3.1 The HMAX Model of Object Recognition in Cortex . . . . . . . . . . . . . . . 51

    4

  • 4.3.2 View-Dependent Object Recognition in Continuous Object Classes . . . . . . 51

    4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.4.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.4.3 The Model Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4.5 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    5 A note on object class representation and categorical perception 66

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    5.2 Chorus of Prototypes (COP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5.3 A Novel Scheme: Categorical Basis Functions (CBF) . . . . . . . . . . . . . . . . . . . 68

    5.3.1 An Example: Cat/Dog Classification . . . . . . . . . . . . . . . . . . . . . . . 69

    5.3.2 Introduction of parallel categorization schemes . . . . . . . . . . . . . . . . . 72

    5.4 Interactions between categorization and discrimination: Categorical Perception . . . 72

    5.4.1 Categorical Perception in CBF . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.4.2 Categorization with and without Categorical Perception . . . . . . . . . . . . 75

    5.5 COP or CBF? — Suggestion for Experimental Tests . . . . . . . . . . . . . . . . . . . . 78

    5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    6 General Conclusions and Future Work 81

    5

  • List of Figures

    2-1 Invariance properties of one IT neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2-2 Sketch of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2-3 Illustration of the highly nonlinear shape tuning properties of the MAX mechanism 17

    2-4 Response of a sample model neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2-5 Average neuronal responses to scrambled stimuli . . . . . . . . . . . . . . . . . . . . 21

    3-1 Cartoon of the Poggio and Edelman model of view-based object recognition . . . . . 30

    3-2 Sketch of our hierarchical model of object recognition in cortex . . . . . . . . . . . . . 32

    3-3 Recognition of two objects simultaneously . . . . . . . . . . . . . . . . . . . . . . . . 34

    3-4 Stimulus/background example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3-5 Model performance: Recognition in clutter . . . . . . . . . . . . . . . . . . . . . . . . 36

    4-1 Natural objects, and artificial objects used in previous object recognition studies . . . 42

    4-2 The eight prototype cars used in the 8 car system . . . . . . . . . . . . . . . . . . . . . 45

    4-3 Training task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    4-4 Illustration of match/nonmatch pairs for Experiment 1 . . . . . . . . . . . . . . . . . 47

    4-5 Testing task for Experiments 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    4-6 Average performance of the trained subjects on the test task of Experiment 1 . . . . . 49

    4-7 Average performance of untrained subjects on the test task of Experiment 1 . . . . . 50

    4-8 Our model of object recognition in cortex . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4-9 Recognition performance of the model on the eight car morph space . . . . . . . . . 53

    4-10 Dependence of average (one-sided) rotation invariance, �r, on SSCU tuning width, � 55

    4-11 Dependence of invariance range on the number of afferents to each SSCU . . . . . . 55

    4-12 Dependence of invariance range on the number of SSCUs . . . . . . . . . . . . . . . . 56

    4-13 Effect of addition of noise to the SSCU representation . . . . . . . . . . . . . . . . . . 57

    4-14 Cat/dog prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4-15 The “Other Class” effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4-16 Car object class-specific features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    6

  • 4-17 Performance of the two layer-model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4-18 The 15 prototypes used in the 15 car system . . . . . . . . . . . . . . . . . . . . . . . . 61

    4-19 Average performance of trained subjects in Experiment 2 . . . . . . . . . . . . . . . . 62

    4-20 Average performance of untrained subjects in Experiment 2 . . . . . . . . . . . . . . 63

    5-1 Cartoon of the CBF categorization scheme . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5-2 Illustration of the cat/dog stimulus space . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5-3 Response of the cat/dog categorization unit . . . . . . . . . . . . . . . . . . . . . . . . 71

    5-4 Sketch of the model to explain the influence of experience with categorization tasks

    on object discrimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5-5 Average responses over all morph lines for the two networks . . . . . . . . . . . . . . 76

    5-6 Comparison of Euclidean distances of activation patterns . . . . . . . . . . . . . . . . 77

    5-7 Output of the categorization unit trained on the cat/dog categorization task . . . . . 80

    5-8 Same as Fig. 5-7, but for a representation based on 30 units chosen by k-means . . . 80

    7

  • Chapter 1

    Introduction

    Tell your hairdresser that you are working on vision and he will likely say “Vision? But that’s

    easy!” Indeed, the apparent ease with which we perform object recognition even in cluttered scenes

    and under difficult viewing conditions belies the amazing complexity of the visual system. This

    became apparent with the groundbreaking studies of Hubel and Wiesel [36, 37] in the primary

    visual cortices of cats and monkeys in the late 50s and 60s. Subsequently, many other visual areas

    were discovered, with recent surveys listing over 30 visual areas linked in an intricate and still

    unresolved pattern [22, 35]. This complex connection scheme can be coarsely divided up into two

    pathways, the “What” pathway (the ventral stream) running from primary visual cortex, V1, over

    V2 and V4 to inferotemporal cortex (IT), and the “Where” pathway (dorsal stream) from V1 to V2,

    V3, MT, MST and other parietal areas [103]. In this framework, the “What” pathway is specialized

    for object recognition whereas the “Where” pathway is concerned with spatial vision. Looking at

    the ventral stream in more detail, Kobatake et al. [41] have reported a progression of the complexity

    of cells’ preferred features and receptive field sizes as one progresses along the stream. While

    neurons in V1 are tuned to oriented bars and have small receptive fields, cells in IT appear to prefer

    more complex visual patterns such as faces (for a review on face cells, see [15]), and respond over

    a wide range of positions and scales, pointing to a crucial role of IT cortex in object recognition (as

    confirmed by a great number of physiological, lesion and neuropsychological studies [50, 94]).

    These findings naturally prompted the question of how cells tuned to views of complex objects

    showing invariance to size and position changes in IT could arise from small bar-tuned receptive

    fields in V1, and how ultimately this neural substrate could be used to perform object recognition.

    In humans, psychophysical experiments had given rise to two main competing theories of ob-

    ject recognition: the structural description and the view-based theory (see [98] for a review of the

    two theories and their experimental evidence). The former approach, its main protagonist being

    the recognition-by-components theory of Biederman [5], holds that object recognition proceeds by

    8

  • decomposing an object into a view-independent part-based description while in the view-based

    theory object recognition is based on the viewpoints objects had actually appeared in. Experimen-

    tal evidence ([8, 48, 49, 97, 100], see also chapter 2) and computational considerations [19] appear

    to favor the view-based theory. However, several challenges for view-based models remained; Tarr

    and Bülthoff [98] very recently listed the following problems:�

    1. tolerance to changes in viewing condition — while the system should allow fine shape dis-

    criminations it should not require a new representation for every change in viewing condi-

    tion;

    2. class generalization and representation of categories — the system should generalize from

    familiar exemplars of a class to unfamiliar ones, and also be able to support categorization

    schemes.

    This thesis presents a simple hierarchical model of object recognition in cortex (chapter 5 shows

    how the model can be extended to object categorization in a straightforward way), HMAX [96],

    that addresses these challenges in a biologically plausible system. In particular, chapter 2 (a reprint

    of [82], copyright Nature America Inc., reprinted by permission) introduces HMAX and compares

    the visual properties of view-tuned units in the model, especially with respect to translation, scal-

    ing and rotation in depth of the visual scene, to those of neurons in inferotemporal cortex recorded

    from in various experiments. Chapter 3 (a reprint of [81], copyright Cell Press, reprinted by per-

    mission) shows that HMAX can even perform recognition in cluttered scenes, without having to

    resort to special segmentation processes, which is of special interest in connection with the so-called

    “Binding Problem”.

    While the first two papers focus on “paperclip” objects that have been used extensively in psy-

    chophysical [8, 48, 91], physiological [49] and computational studies [68], this object class has sev-

    eral disadvantages (such as not being “nice” [107]) that make it unsuitable as a basis to investigate

    recognition in natural object classes — the topic of chapter 4 (a reprint of [84]) — where objects

    have similar 3D shape, such as faces. Instead, in chapters 4 and 5, stimuli for model and experi-

    ment were generated using a novel 3D morphing system developed by Christian Shelton [90] that

    allows us to generate morphed objects drawn from a space spanned by a set of prototype objects,

    for instances cars (chapter 4), or cats and dogs (chapters 4 and 5). We show that the recognition

    results obtainable in natural object classes represented using a population code where the activity

    over a group of units codes for the identity of an object (as suggested by recent physiology studies

    [112, 115]) are quite comparable to those for individual objects represented by “grandmother” units

    (in chapter 2), that is, the performance of HMAX does not appear to be special to a certain object

    �They also listed the need for a simple mechanism to measure perceptual similarity, in order “to generalize betweenexamplars or between views” ([98], p. 9), which thus appears to be a corollary of solving the two problems listed above.

    9

  • class. In addition, simulations show that a population-based object representation provides several

    computational advantages over a “grandmother” representation. We further present experimental

    results from a psychophysical study in which we trained subjects using a discrimination paradigm

    to build a representation of a novel object class. This representation was then probed by exam-

    ining how discrimination performance was affected by viewpoint changes. We find experimental

    evidence, supported by the modeling results, that training builds a viewpoint- and class-specific

    representation that supplements a pre-existing representation with lower shape discriminability

    but greater viewpoint invariance. Chapter 5 (a reprint of [83]) finally shows how HMAX can be

    extended in a straightforward way to perform object categorization, and to support arbitrary object

    categorization schemes, with interesting opportunities for interactions between discrimination and

    categorization as observed in categorical perception.

    10

  • Chapter 2

    Hierarchical Models of Object

    Recognition in Cortex

    Abstract

    The classical model of visual processing in cortex is a hierarchy of increasingly sophisticated

    representations, extending in a natural way the model of simple to complex cells of Hubel and

    Wiesel. Somewhat surprisingly, little quantitative modeling has been done in the last 15 years to

    explore the biological feasibility of this class of models to explain higher level visual processing,

    such as object recognition. We describe a new hierarchical model that accounts well for this

    complex visual task, is consistent with several recent physiological experiments in inferotemporal

    cortex and makes testable predictions. The model is based on a novel MAX-like operation on the

    inputs to certain cortical neurons which may have a general role in cortical function.

    2.1 Introduction

    The recognition of visual objects is a fundamental cognitive task performed effortlessly by the brain

    countless times every day while satisfying two essential requirements: invariance and specificity.

    In face recognition, for example, we can recognize a specific face among many, while being rather

    tolerant to changes in viewpoint, scale, illumination, and expression. The brain performs this and

    similar object recognition and detection tasks fast [101] and well. But how?

    Early studies [7] of macaque inferotemporal cortex (IT), the highest purely visual area in the

    ventral visual stream thought to have a key role in object recognition [103] reported cells tuned to

    views of complex objects such as a face, i.e., the cells discharged strongly to the view of a face but

    very little or not at all to other objects. A hallmark of these cells was the robustness of their firing

    11

  • to stimulus transformations such as scale and position changes.

    This finding presented an interesting question: How could these cells show strongly differing

    responses to similar stimuli (as, e.g., two different faces), that activate the retinal photoreceptors in

    similar ways, while showing response constancy to scaled and translated versions of the preferred

    stimulus that cause very different activation patterns on the retina?

    This puzzle was similar to one faced by Hubel and Wiesel on a much smaller scale two decades

    earlier when they recorded from simple and complex cells in cat striate cortex [36]: both cell types

    responded strongly to oriented bars, but whereas simple cells exhibited small receptive fields with

    a strong phase dependence, that is, with distinct excitatory and inhibitory subfields, complex cells

    had larger receptive fields and no phase dependence. This led Hubel and Wiesel to propose a

    model in which simple cells with their receptive fields in neighboring parts of space feed into

    the same complex cell, thereby endowing that complex cell with a phase-invariant response. A

    straightforward (but highly idealized) extension of this scheme would lead all the way from simple

    cells to “higher order hypercomplex cells” [37].

    Starting with the Neocognitron [25] for translation invariant object recognition, several hierar-

    chical models of shape processing in the visual system have subsequently been proposed to ex-

    plain how transformation-invariant cells tuned to complex objects can arise from simple cell inputs

    [64, 111]. Those models, however, were not quantitatively specified or were not compared with

    specific experimental data. Alternative models for translation- and scale-invariant object recogni-

    tion have been proposed, based on a controlling signal that either appropriately reroutes incoming

    signals, as in the “shifter” circuit [2] and its extension [62], or modulates neuronal responses, as

    in the “gain-field” models for invariant recognition [78, 88]. While recent experimental studies

    [14, 56] have indicated that in macaque area V4 cells can show an attention-controlled shift or mod-

    ulation of their receptive field in space, there is still little evidence that this mechanism is used to

    perform translation-invariant object recognition and whether a similar mechanism applies to other

    transformations (such as scaling) as well.

    The basic idea of the hierarchical model sketched by Perrett and Oram [64] was that invariance

    to any transformation (not just image-plane transformations as in the case of the Neocognitron

    [25]) could be built up by pooling over afferents tuned to various transformed versions of the

    same stimulus. Indeed it was shown earlier [68] that viewpoint-invariant object recognition was

    possible using such a pooling mechanism. A (Gaussian RBF) learning network was trained with

    individual views (rotated around one axis in 3D space) of complex, paperclip-like objects to achieve

    3D rotation-invariant recognition of this object. In the network the resulting view-tuned units fed

    into a a view-invariant unit; they effectively represented prototypes between which the learning

    network interpolated to achieve viewpoint-invariance.

    There is now quantitative psychophysical [8, 48, 95] and physiological evidence [6, 42, 49] for

    12

  • the hypothesis that units tuned to full or partial views are probably created by a learning process

    and also some hints that the view-invariant output is in some cases explicitly represented by (a

    small number of) individual neurons [6, 49, 66].

    A recent experiment [48, 49] required monkeys to perform an object recognition task using

    novel “paperclip” stimuli the monkeys had never seen before. Here, the monkeys were required

    to recognize views of “target” paperclips rotated in depth among views of a large number of “dis-

    tractor” paperclips of very similar structure, after being trained on a restricted set of views of each

    target object. Following very extensive training on a set of paperclip objects, neurons were found

    in anterior IT that selectively responded to the object views seen during training.

    This design avoided two problems associated with previous physiological studies investigat-

    ing the mechanisms underlying view-invariant object recognition: First, by training the monkey

    to recognize novel stimuli with which the monkey had not had any visual experience instead of

    objects (e.g., faces) with which the monkey was quite familiar, it was possible to estimate the de-

    gree of view-invariance derived from just one object view. Moreover, the use of a large number of

    distractor objects allowed to define view-invariance with respect to the distractor objects. This is a

    key point, since only by being able to compare the response of a neuron to transformed versions of

    its preferred stimulus with the neuron’s response to a range of (similar) distractor objects can the

    VTU’s (view-tuned unit’s) invariance range be determined — just measuring the tuning curve is

    not sufficient.

    The study [49] established (Fig. 2-1) that after training with just one object view there are cells

    showing some degree of limited invariance to 3D rotation around the training view, consistent

    with the view-interpolation model [68]. Moreover, the cells also exhibit significant invariance to

    translation and scale changes, even though the object was only previously presented at one scale

    and position.

    These data put in sharp focus and in quantitative terms the question of the circuitry under-

    lying the properties of the view-tuned cells. While the original model [68] described how VTUs

    could be used to build view-invariant units, they did not specify how the view-tuned units could

    come about. The key problem is thus to explain in terms of biologically plausible mechanisms the

    VTUs’ invariance to translation and scaling obtained from just one object view, which arises from

    a trade-off between selectivity to a specific object and relative tolerance (i.e., robustness of firing) to

    position and scale changes. Here, we describe a model that conforms to the main anatomical and

    physiological constraints, reproduces the invariance data described above and makes predictions

    for experiments on the view-tuned subpopulation of IT cells. Interestingly, the model is also con-

    sistent with recent data from several other experiments regarding recognition in context [54], or the

    presence of multiple objects in a cell’s receptive field [89].

    13

  • AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA *

    Spi

    ke R

    ate

    Distractor ID

    10 Best Distractors

    37 9 20 5 24 3 2 1 0 60

    10

    20

    30

    40

    60 108 132 156 18084

    0

    10

    20

    30

    40

    Rotation Around Y Axis

    (a) (b)

    Azimuth and Elevation(x = 2.25 degrees)

    1.90 2.80 3.70 4.70 5.60

    0

    1

    2

    3

    4

    5

    6

    7

    (0,0) (x,x) (x,-x) (-x,x) (-x,-x)

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

    Degrees of Visual Angle

    (Tar

    get R

    espo

    nse)

    /(M

    ean

    of B

    est D

    istr

    acto

    rs)

    (c) (d)

    0

    1

    2

    3

    4

    5

    6

    7

    Figure 2-1: Invariance properties of one neuron (modified from Logothetis et al. [49]). The figure shows theresponse of a single cell found in anterior IT after training the monkey to recognize paperclip-like objects. Thecell responded selectively to one view of a paperclip and showed limited invariance around the training viewto rotation in depth, along with significant invariance to translation and size changes, even though the mon-key had only seen the stimulus at one position and scale during training. (a) shows the response of the cell torotation in depth around the preferred view. (b) shows the cell’s response to the 10 distractor objects (other pa-perclips) that evoked the strongest responses. The lower plots show the cell’s response to changes in stimulussize, (c) (asterisk shows the size of the training view), and position, (d) (using the 1.9� size), resp., relative tothe mean of the 10 best distractors. Defining “invariance” as yielding a higher response to transformed viewsof the preferred stimulus than to distractor objects, neurons exhibit an average rotation invariance of 42� (dur-ing training, stimuli were actually rotated by ���� in depth to provide full 3D information to the monkey;therefore, the invariance obtained from a single view is likely to be smaller), translation and scale invarianceon the order of ��� and �� octave around the training view, resp. (J. Pauls, personal communication).

    14

  • ����������

    ��������

    ��������

    ������������

    view-tuned cells

    MAX

    weighted sum

    simple cells (S1)

    complex cells (C1)

    "complex composite" cells (C2)

    "composite feature" cells (S2)

    Figure 2-2: Sketch of the model. The model is an hierarchical extension of the classical paradigm [36] ofbuilding complex cells from simple cells. It consists of a hierarchy of layers with linear (“S” units in the no-tation of Fukushima [25], performing template matching, solid lines) and non-linear operations (“C” poolingunits [25], performing a “MAX” operation, dashed lines). The non-linear MAX operation — which selectsthe maximum of the cell’s inputs and uses it to drive the cell — is key to the model’s properties and is quitedifferent from the basically linear summation of inputs usually assumed for complex cells. These two typesof operations respectively provide pattern specificity and invariance (to translation, by pooling over afferentstuned to different positions, and scale (not shown), by pooling over afferents tuned to different scales).

    2.2 Results

    The model is based on a simple hierarchical feedforward architecture (Fig. 2-2). Its structure reflects

    the assumption that invariance to position and scale on the one hand and feature specificity on

    the other hand must be built up through separate mechanisms: to increase feature complexity, a

    suitable neuronal transfer function is a weighted sum over afferents coding for simpler features,

    i.e., a template match. But is summing over differently weighted afferents also the right way to

    increase invariance?

    From the computational point of view, the pooling mechanism should produce robust feature

    detectors, i.e., measure the presence of specific features without being confused by clutter and con-

    text in the receptive field. Consider a complex cell, as found in primary visual cortex, whose pre-

    ferred stimulus is a bar of a certain orientation to which the cell responds in a phase-invariant way

    [36]. Along the lines of the original complex cell model [36], one could think of the complex cells

    as receiving input from an array of simple cells at different locations, pooling over which results in

    15

  • the position-invariant response of the complex cell.

    Two alternative idealized pooling mechanisms are: linear summation (“SUM”) with equal weights

    (to achieve an isotropic response) and a nonlinear maximum operation (“MAX”), where the strongest

    afferent determines the response of the postsynaptic unit. In both cases, if only one bar is present

    in the receptive field, the response of a model complex cell is position invariant. The response

    level would signal how similar the stimulus is to the afferents’ preferred feature. Consider now

    the case of a complex stimulus, like e.g., a paperclip, in the visual field. In the linear summation

    case, complex cell response would still be invariant (as long as the stimulus stays in the cell’s re-

    ceptive field), but the response level now would not allow to infer whether there actually was a

    bar of the preferred orientation somewhere in the complex cell’s receptive field, as the output sig-

    nal is a sum over all the afferents. That is, feature specificity is lost. In the MAX case, however,

    the response would be determined by the most strongly activated afferent and hence would signal

    the best match of any part of the stimulus to the afferents’ preferred feature. This ideal example

    suggests that the MAX mechanism is capable of providing a more robust response in the case of

    recognition in clutter or with multiple stimuli in the receptive field (cf. below). Note that a SUM re-

    sponse with saturating nonlinearities on the inputs seems too brittle since it requires a case-by-case

    adjustment of the parameters, depending on the activity level of the afferents.

    Equally critical is the inability of the SUM mechanism to achieve size invariance: Suppose that

    the afferents to a “complex” cell (which now could be a cell in V4 or IT, for instance) show some

    degree of size and position invariance. If the “complex” cell were now stimulated with the same

    object but at subsequently increasing sizes, an increasing number of afferents would become ex-

    cited by the stimulus (unless the afferents showed no overlap in space or scale) and consequently

    the excitation of the “complex” cell would increase along with the stimulus size, even though the

    afferents show size invariance (this is borne out in simulations using a simplified two-layer model

    [79])! For the MAX mechanism, however, cell response would show little variation even as stimulus

    size increased since the cell’s response would be determined just by the best-matching afferent.

    These considerations (supported by quantitative simulations of the model, described below)

    suggest that a sensible way of pooling responses to achieve invariance is via a nonlinear MAX

    function, that is, by implicitely scanning (see discussion) over afferents of the same type that differ

    in the parameter of the transformation to which the response should be invariant (e.g., feature

    size for scale invariance), and then selecting the best-matching of those afferents. Note that these

    considerations apply to the case where different afferents to a pooling cell, e.g., those looking at

    different parts of space, are likely to be responding to different objects (or different parts of the

    same object) in the visual field (as is the case with cells in lower visual areas with their broad shape

    tuning). Here, pooling by combining afferents would mix up signals caused by different stimuli.

    However, if the afferents are specific enough to only respond to one pattern, as one expects in the

    16

  • 0

    0.2

    0.4

    0.6

    0.8

    1

    resp

    on

    se

    MAX expt. SUM(a) (b)

    Figure 2-3: Illustration of the highly nonlinear shape tuning properties of the MAX mechanism. (a) Experi-mentally observed responses of IT cells obtained using a “simplification procedure” [113] designed to deter-mine “optimal” features (responses normalized so that the response to the preferred stimulus is equal to 1). Inthat experiment, the cell originally responds quite strongly to the image of a “water bottle” (leftmost object).The stimulus is then “simplified” to its monochromatic outline which increases the cell’s firing, and furtherto a paddle-like object, consisting of a bar supporting an ellipse. While this object evokes a strong response,the bar or the ellipse alone produce almost no response at all (figure used by permission). (b) Comparison ofexperiment and model. Green bars show the responses of the experimental neuron from (a). Blue and red barsshow the response of a model neuron tuned to the stem-ellipsoidal base transition of the preferred stimulus.The model neuron is at the top of a simplified version of the model shown in Fig. 2-2, where there are onlytwo types of S1 features at each position in the receptive field, tuned to the left and right side of the transitionregion, resp., which feed into C1 units that pool using a MAX function (blue bars) or a SUM function (redbars). The model neuron is connected to these C1 units so that its response is maximal when the experimentalneuron’s preferred stimulus is in its receptive field.

    final stages of the model, then pooling by using a weighted sum, as in the RBF network [68], where

    VTUs tuned to different viewpoints were combined to interpolate between the stored views, is

    advantageous.

    MAX-like mechanisms at some stages of the circuitry appear to be compatible with recent neu-

    rophysiological data. For instance, it has been reported [89] that when two stimuli are brought into

    the receptive field of an IT neuron, that neuron’s response appears to be dominated by the stimulus

    that produces a higher firing rate when presented in isolation to the cell — just as expected if a

    MAX-like operation is performed at the level of this neuron or its afferents. Theoretical investiga-

    tions into possible pooling mechanisms for V1 complex cells also support a maximum-like pooling

    mechanism (K. Sakai & S. Tanaka, Soc. Neurosci. Abs., 23, 453, 1997). Additional indirect support

    for a MAX mechanism comes from studies using a “simplification procedure” [113] or “complexity

    reduction” [47] to determine the preferred features of IT cells, i.e., the stimulus components that are

    responsible for driving the cell. These studies commonly find a highly nonlinear tuning of IT cells

    (Fig. 2-3 (a)). Such tuning is compatible with the MAX response function (Fig. 2-3 (b), blue bars).

    Note that a linear model (Fig. 2-3 (b), red bars) cannot reproduce this strong response change for

    small changes in the input image.

    In our model of view-tuned units (Fig. 2-2), the two types of operations, scanning and template

    17

  • matching, are combined in a hierarchical fashion to build up complex, invariant feature detectors

    from small, localized, simple cell-like receptive fields in the bottom layer which receive input from

    the model “retina.” There need not be a strict alternation of these two operations: connections can

    skip levels in the hierarchy, as in the direct C1�C2 connections of the model in Fig. 2-2.

    The question remains whether the proposed model can indeed achieve response selectivity and

    invariance compatible with the results from physiology. To investigate this question, we looked at

    the invariance properties of 21 view-tuned units in the model, each tuned to a view of a different,

    randomly selected paperclip, as used in the experiment [49].

    Figure 2-4 shows the response of one model view-tuned unit to 3D rotation, scaling and trans-

    lation around its preferred view (see Methods). The unit responds maximally to the training view,

    with the response gradually falling off as the stimulus is transformed away from the training view.

    As in the experiment, we can determine the invariance range of the VTU by comparing the response

    to the preferred stimulus to the responses to the 60 distractors. The invariance range is then defined

    as the range over which the model unit’s response is greater than to any of the distractor objects.

    Thus, the model VTU shown in Fig. 2-4 shows rotation invariance of 24�, scale invariance of 2.6

    octaves and translation invariance of 4.7� of visual angle. Averaging over all 21 units, we obtain

    average rotation invariance over 30.9�, scale invariance over ��� octaves and translation invariance

    over 4.6�.

    Units show invariance around the training view, of a range in good agreement with the exper-

    imentally observed values. Some units (5/21), an example of which is given in Fig. 2-4 (d), show

    tuning also for pseudo-mirror views (obtained by rotating the preferred paperclip by 180� in depth,

    which produces a pseudo-mirror view of the object due to the paperclips’ minimal self-occlusion),

    as observed in some experimental neurons [49].

    While the simulation and experimental data presented so far dealt with object recognition set-

    tings in which one object was presented in isolation, this is rarely the case in normal object recogni-

    tion settings. More commonly, the object to be recognized is situated in front of some background

    or appears together with other objects, all of which are to be ignored if the object is to be recognized

    successfully. More precisely, in the case of multiple objects in the receptive field, the responses of

    the afferents feeding into a VTU tuned to a certain object should be affected as little as possible by

    the presence of other “clutter objects.”

    The MAX response function posited above for the pooling mechanism to achieve invariance has

    the right computational properties to perform recognition in clutter: If the VTU’s preferred object

    strongly activates the VTU’s afferents, then it is unlikely that other objects will interfere, as they

    tend to activate the afferents less and hence will not usually influence the response due to the MAX

    response function. In some cases (such as when there are occlusions of the preferred feature, or one

    of the “wrong” afferents has a higher activation) clutter, of course, can affect the value provided

    18

  • 0 20 40 60 80 100 120 140 1600

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    stimulus size

    resp

    onse

    10 20 30 40 50 600

    0.5

    1

    50 60 70 80 90 100 110 120 1300

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    viewing angle

    resp

    onse

    0 90 180 2700

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    viewing angle

    resp

    onse

    −4

    −2

    0

    2

    4 −4

    −2

    0

    2

    40

    0.5

    1

    y translation (deg)x translation (deg)

    resp

    onse

    (a) (b)

    (d)(c)

    Figure 2-4: Responses of a sample model neuron to different transformations of its preferred stimulus. Thedifferent panels show the same neuron’s response to (a) varying stimulus sizes (inset shows response to 60distractor objects, selected randomly from the paperclips used in the physiology experiments [49]), (b) rotationin depth and (c) translation. Training size was �� � �� pixels corresponding to 2� of visual angle. (d) showsanother neuron’s response to pseudo-mirror views (cf. text), with the dashed line indicating the neuron’sresponse to the “best” distractor.

    19

  • by the MAX mechanism, thereby reducing the quality of the match at the final stage and thus

    the strength of the VTU response. It is clear that to achieve the highest robustness to clutter, a

    VTU should only receive input from cells that are strongly activated (i.e., that are relevant to the

    definition of the object) by its preferred stimulus.

    In the version of the model described so far, the penultimate layer contained only 10 cells corre-

    sponding to 10 different features, which turned out to be sufficient to achieve invariance properties

    as found in the experiment. Each VTU in the top layer was connected to all the afferents and hence

    robustness to clutter is expected to be relatively low. Note that in order to connect a VTU to only the

    subset of the intermediate feature detectors it receives strong input from, the number of afferents

    should be large enough to achieve the desired response specificity.

    The straightforward solution is to increase the number of features. Even with a fixed number of

    different features in S1, the dictionary of S2 features can be expanded by increasing the number and

    type of afferents to individual S2 cells (see Methods). In this “many feature” version of the model,

    the invariance ranges for a low number of afferents are already comparable to the experimental

    ranges — if each VTU is connected to the 40 (out of 256) C2 cells that are most strongly excited

    by its preferred stimulus, model VTUs show an average scale invariance over ��� octaves, rotation

    invariance over 36.2� and translation invariance over 4.4�. For the maximum of 256 afferents to

    each cell, cells are rotation invariant over an average of 47�, scale invariant over 2.4 octaves and

    translation invariant over 4.7�.

    Simulations show [81] that this model is capable of performing recognition in context: Using

    displays as inputs that contain the neurons preferred clip as well as another, distractor, clip, the

    model is able to correctly recognize the preferred clip in 90% of the cases (for 40/256 afferents to

    each neuron, the maximum rate is 94% for 18 afferents, dropping to 55% for 256/256 afferents,

    compared to 40% in the original version of the model with 10 C2 units), i.e., the addition of the

    second clip interfered with the activation caused by the first clip alone so much that in 10% of the

    cases the response to the two clip display containing the preferred clip fell below the response to

    one of the distractor clips. This reduction of the response to the two-stimulus display compared to

    the response to the stronger stimulus alone has also been found in experimental studies [86, 89].

    The question of object recognition in the presence of a background object was explored ex-

    perimentally in a recent study [54], where a monkey had to discriminate (polygonal) foreground

    objects irrespective of the (polygonal) background they appeared with. Recordings of IT neurons

    showed that for the stimulus/background condition, neuronal response on average was reduced

    to a quarter of the response to the foreground object alone, while the monkey’s behavioral perfor-

    mance dropped much less. This is compatible with simulations in the model [81] that show that

    even though a unit’s firing rate is strongly affected by the addition of the background pattern, it is

    still in most cases well above the firing rate evoked by distractor objects, allowing the foreground

    20

  • 1 4 16 64 2560

    0.2

    0.4

    0.6

    0.8

    1

    number of tiles

    avg.

    res

    pons

    e

    (a) (b)

    Figure 2-5: Average neuronal responses of neurons in the many feature version of the model to scrambledstimuli. (a) Example of a scrambled stimulus. The images (��� � ��� pixels) were created by subdividingthe preferred stimulus of each neuron into 4, 16, 64, and 256, resp., “tiles” and randomly shuffling the tiles tocreate a scrambled image. (b) Average response of the 21 model neurons (with 40/256 afferents, as above) tothe scrambled stimuli (solid blue curve), in comparison to the average normalized responses of IT neurons toscrambled stimuli (scrambled pictures of trees) reported in a very recent study [108] (dashed green curve).

    object to be recognized successfully.

    Our model relies on decomposing images into features. Should it then be fooled into confusing

    a scrambled image with the unscrambled original? Superficially, one may be tempted to guess that

    scrambling an image in pieces larger than the features should indeed fool the model. Simulations

    (see Fig. 2-5) show that this is not the case. The reason lies in the large dictionary of filters/features

    used that makes it practically impossible to scramble the image in such a way that all features are

    preserved, even for a low number of features. Responses of model units drop precipitously as

    the image is scrambled into progressively finer pieces, as confirmed very recently in a physiology

    experiment [108] of which we became aware after obtaining this prediction from the model.

    2.3 Discussion

    We briefly outline the computational roots of the hierarchical model we described, how the MAX

    operation could be implemented by cortical circuits and remark on the role of features and invari-

    ances in the model.

    A key operation in several recent computer vision algorithms for the recognition and classifi-

    cation of objects [87, 92] is to scan a window across an image, through both position and scale, in

    order to analyze at each step a subimage – for instance by providing it to a classifier that decides

    whether the subimage represents the object of interest. Such algorithms have been successful in

    achieving invariance to image plane transformations such as translation and scale. In addition, this

    brute force scanning strategy eliminates the need to segment the object of interest before recogni-

    tion: segmentation, even in complex and cluttered images, is routinely achieved as a byproduct of

    21

  • recognition. The computational assumption that originally motivated the model described in this

    paper was indeed that a MAX-like operation may represent the cortical equivalent of the “window

    of analysis” in machine vision to scan through and select input data. Unlike a centrally controlled

    sequential scanning operation, a mechanism like the MAX operation that locally and automatically

    selects a relevant subset of inputs seems biologically plausible. A basic and pervasive operation

    in many computational algorithms — not only in computer vision — is the search and selection

    of a subset of data. Thus it is natural to speculate that a MAX-like operation may be replicated

    throughout the cortex.

    Simulations of a simplified two-layer version the model [79] using soft-maximum approxima-

    tions to the MAX operation (see Methods) where the strength of the nonlinearity can be adjusted

    by a parameter show that its basic properties are preserved and structurally robust. But how is an

    approximation of the MAX operation realized by neurons? It seems that it could be implemented

    by several different, biologically plausible circuitries [1, 13, 17, 32, 44]. The most likely hypothesis

    is that the MAX operation arises from cortical microcircuits of lateral, possibly recurrent, inhibition

    between neurons in a cortical layer. An example is provided by the circuit proposed for the gain-

    control and relative motion detection in the visual system of the fly [76], based on feedforward

    (or recurrent) shunting presynaptic (or postsynaptic) inhibition by “pool” cells. One of its key

    elements, in addition to shunting inhibition (an equivalent operation may be provided by linear

    inhibition deactivating NMDA receptors), is a nonlinear transformation of the individual signals

    due to synaptic nonlinearities or to active membrane properties. The circuit performs a gain control

    operation and — for certain values of the parameters — a MAX-like operation. “Softmax” circuits

    have been proposed in several recent studies [34, 45, 61] to account for similar cortical functions.

    Together with adaptation mechanisms (underlying very short-term depression [1]), the circuit may

    be capable of pseudo-sequential search in addition to selection.

    Our novel claim here is that a MAX-like operation is a key mechanism for object recognition

    in the cortex. The model described in this paper — including the stage from view-tuned to view-

    invariant units [68] — is a purely feedforward hierarchical model. Backprojections – well known

    to exist abundantly in cortex and playing a key role in other models of cortical function [59, 75] –

    are not needed for its basic performance but are probably essential for the learning stage and for

    known top-down effects — including attentional biases [77] — on visual recognition, which can be

    naturally grafted into the inhibitory softmax circuits (see [61]) described earlier.

    In our model, recognition of a specific object is invariant for a range of scales (and positions) af-

    ter training with a single view at one scale, because its representation is based on features invariant

    to these transformations. View invariance on the other hand requires training with several views

    [68] because individual features sharing the same 2D appearance can transform very differently

    under 3D rotation, depending on the 3D structure of the specific object. Simulations show that the

    22

  • model’s performance is not specific to the class of paperclip object: recognition results are similar

    for e.g., computer-rendered images of cars (and other objects).

    From a computational point of view the class of models we have described can be regarded

    as a hierarchy of conjunctions and disjunctions. The key aspect of our model is to identify the

    disjunction stage with the build-up of invariances and to do it through a MAX-like operation. At

    each conjunction stage the complexity of the features increases and at each disjunction stage so does

    their invariance. At the last level – of the C2 layer in the paper – it is only the presence and strength

    of individual features and not their relative geometry in the image that matters. The dictionary

    of features at that stage is overcomplete, so that the activities of the units measuring each feature

    strength, independently of their precise location, can still yield a unique signature for each visual

    pattern (cf. the SEEMORE system [52]).

    The architecture we have described shows that this approach is consistent with available exper-

    imental data and maps it into a class of models that is a natural extension of the hierarchical models

    first proposed by Hubel and Wiesel.

    2.4 Methods

    Basic model parameters. Patterns on the model “retina” (of ��� � ��� pixels — which corresponds to a

    5� receptive field size (the literature [41] reports an average V4 receptive field size of 4.4�) if we set 32 pixels

    � ��) — are first filtered through a layer (S1) of simple cell-like receptive fields (first derivative of gaussians,

    zero-sum, square-normalized to 1, oriented at ��� ���� ��� ��� with standard deviations of 1.75 to 7.25 pixels

    in steps of 0.5 pixels; S1 filter responses were rectified dot products with the image patch falling into their

    receptive field, i.e., the output s�j of an S1 cell with preferred stimulus wj whose receptive field covers an

    image patch Ij is s�j � jwj � Ijj). Receptive field (RF) centers densely sample the input retina. Cells in the next

    (C1) layer each pool S1 cells (using the MAX response function, i.e., the output c�i of a C1 cell with afferents s�

    j

    is c�i � maxj s�j ) of the same orientation over eight pixels of the visual field in each dimension and all scales.

    This pooling range was chosen for simplicity — invariance properties of cells were robust for different choices

    of pooling ranges (cf. below). Different C1 cells were then combined in higher layers, either by combining

    C1 cells tuned to different features to give S2 cells responding to co-activations of C1 cells tuned to different

    orientations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive fields.

    In the simple version illustrated here, the S2 layer contains six features (all pairs of orientations of C1 cells

    looking at the same part of space) with Gaussian transfer function (� � �, centered at 1, i.e., the response s�k

    23

  • of an S2 cell receiving input from C1 cells c�m� c�n with receptive fields in the same location but responding to

    different orientations is s�k � exp����c�m � ��

    � �c�n � �������), yielding a total of 10 cells in the C2 layer.

    Here, C2 units feed into the view-tuned units, but in principle, more layers of S and C units are possible.

    In the version of the model we have simulated, object specific learning occurs only at the level of the

    synapses on the view-tuned cells at the top. More complete simulations will have to account for the effect of

    visual experience on the exact tuning properties of other cells in the hierarchy.

    Testing the invariance of model units. View-tuned units in the model were generated by recording

    the activity of units in the C2 layer feeding into the VTUs to each one of the 21 paperclip views and then

    setting the connecting weights of each VTU, i.e., the center of the Gaussian associated with the unit, resp., to

    the corresponding activation. For rotation, viewpoints from 50� to 130� were tested (the training view was

    arbitrarily set to 90�) in steps of 4�. For scale, stimulus sizes from 16 to 160 pixels in half octave steps (except

    for the last step, which was from 128 to 160 pixels) and for translation, independent translations of���� pixels

    along each axis in steps of 16 pixels (i.e., exploring a plane of ����� ��� pixels) were used.

    “Many feature” version. To increase the robustness to clutter of model units, the number of features in

    S2 was increased: Instead of the previous maximum of two afferents of different orientation looking at the

    same patch of space as in the version described above, each S2 cell now received input from four neighboring

    C1 units (in a � � � arrangement) of arbitrary orientation, giving a total of �� � ��� different S2 types and

    finally 256 C2 cells as potential inputs to each view-tuned cell (in simulations, top level units were sparsely

    connected to a subset of C2 layer units to gain robustness to clutter, cf. Results). As S2 cells now combined C1

    afferents with receptive fields at different locations, and features a certain distance apart at one scale change

    their separation as the scale changes, pooling at the C1 level was now done in several scale bands, each of

    roughly a half-octave width in scale space (filter standard deviation ranges were 1.75–2.25, 2.75–3.75, 4.25–

    5.25, and 5.75–7.25 pixels, resp.) and the spatial pooling range in each scale band chosen accordingly (over

    neighborhoods of � � �, � � �, � , and �� � ��, respectively — note that system performance was robust

    with respect to the pooling ranges, simulations with neighborhoods of twice the linear size in each scale

    band produced comparable results, with a slight drop in the recognition of overlapping stimuli, as expected),

    as a simple way to improve scale-invariance of composite feature detectors in the C2 layer. Also, centers

    of C1 cells were chosen so that RFs overlapped by half a RF size in each dimension. A more principled way

    24

  • would be to learn the invariant feature detectors, e.g., using the trace rule [23]. The straightforward connection

    patterns used here, however, demonstrate that even a simple model shows tuning properties comparable to

    the experiment.

    Softmax approximation. In a simplified two-layer version of the model [79] we investigated the effects of

    approximations to the MAX operations on recognition performance. The model contained only one pooling

    stage, C1, where the strength of the pooling nonlinearity could be controlled by a parameter, p. There, the

    output c�i of a C1 cell with afferents xj was

    c�i �X

    j

    exp�p � jxj j�Pk exp�p � jxkj�

    xj �

    which performs a linear summation (scaled by the number of afferents) for p � � and the MAX operation for

    p��.

    2.5 Acknowledgments

    Supported by grants from ONR, Darpa, NSF, ATR, and Honda. M.R. is supported by a Merck/MIT

    Fellowship in Bioinformatics. T.P. is supported by the Uncas and Helen Whitaker Chair at the

    Whitaker College, MIT. We are grateful to H. Bülthoff, F. Crick, B. Desimone, R. Hahnloser, C. Koch,

    N. Logothetis, E. Miller, J. Pauls, D. Perrett, J. Reynolds, T. Sejnowski, S. Seung, and R. Vogels for

    very useful comments and for reading earlier versions of this manuscript. We thank J. Pauls for

    analyzing the average invariance ranges of his IT neurons and K. Tanaka for the permission to

    reproduce Fig. 2-3 (a).

    25

  • Chapter 3

    Are Cortical Models Really Bound by

    the “Binding Problem”?

    Abstract

    The usual description of visual processing in cortex is an extension of the simple to complex

    hierarchy postulated by Hubel and Wiesel — a feedforward sequence of more and more complex

    and invariant features. The capability of this class of models to perform higher level visual

    processing such as viewpoint-invariant object recognition in cluttered scenes has been questioned

    in recent years by several researchers, who in turn proposed an alternative class of models based

    on the synchronization of large assemblies of cells, within and across cortical areas. The main

    implicit argument for this novel and controversial view was the assumption that hierarchical

    models cannot deal with the computational requirements of high level vision and suffer from

    the so-called “binding problem”. We review the present situation and discuss theoretical and

    experimental evidence showing that the perceived weaknesses of hierarchical models are not true.

    In particular, we show that recognition of multiple objects in cluttered scenes, arguably among

    the most difficult tasks in vision, can be done in a hierarchical feedforward model.

    26

  • 3.1 Introduction: Visual Object Recognition

    Two problems make object recognition difficult:

    1. The segmentation problem: Visual scenes normally contain multiple objects. To recognize in-

    dividual objects, features must be isolated from the surrounding clutter and extracted from

    the image, and the feature set must be parsed so that the different features are assigned to the

    correct object. The latter problem is commonly referred to as the “Binding Problem” [110].

    2. The invariance problem: Objects have to be recognized under varying viewpoints, lighting

    conditions etc.

    Interestingly, the human brain can solve these problems with ease and quickly. Thorpe et al. [101]

    report that visual processing in an object detection task in complex visual scenes can be achieved

    in under 150 ms, which is on the order of the latency of the signal transmission from the retina

    to inferotemporal cortex (IT), the highest area in the ventral visual stream thought to have a key

    role in object recognition [103]; see also [72]. This impressive processing speed presents a strong

    constraint for any model of object recognition.

    3.2 Models of Visual Object Recognition and the Binding Prob-

    lem

    Hubel and Wiesel [37] were the first to postulate a model of visual object representation and recog-

    nition. They recorded from simple and complex cells in the primary visual cortices of cats and mon-

    keys and found that while both types preferentially responded to bars of a certain orientation, the

    former had small receptive fields with a phase-dependent response while the latter had bigger re-

    ceptive fields and showed no phase-dependence. This observation led them to hypothesize that

    complex cells receive input from several simple cells. Continuing this model in a straightforward

    fashion, they suggested [36] that the visual system is composed of a hierarchy of visual areas, from

    simple cells all the way up to “higher order hypercomplex cells.”

    Later studies [7] of macaque inferotemporal cortex (IT) described neurons tuned to views of

    complex objects such as a face, i.e., the cells discharged strongly to a face seen from a specific

    viewpoint but very little or not at all to other objects. A key property of these cells was their

    scale and translation invariance, i.e., the robustness of their firing to stimulus transformations such as

    changes in size or position in the visual field.

    These findings inspired various models of visual object recognition such as Fukushima’s Neocog-

    nitron [25] or, later, Perrett and Oram’s [64] outline of a model of shape processing, and Wallis and

    27

  • Rolls’ VisNet [111], all of which share the basic idea of the visual system as a feedforward process-

    ing hierarchy where invariance ranges and complexity of preferred features grow as one ascends

    through the levels.

    Models of this type prompted von der Malsburg [109] to formulate the binding problem. His

    claim was that visual representations based on spatially invariant feature detectors were ambigu-

    ous: “As generalizations are performed independently for each feature, information about neigh-

    borhood relations and relative position, size and orientation is lost. This lack of information can

    lead to the inability to distinguish between patterns that are composed of the same set of invariant

    features. . . ” [110]. Moreover, as a visual scene containing multiple objects is represented by a set of

    feature activations, a second problem lies in “singling out appropriate groups from the large back-

    ground of possible combinations of active neurons” [110]. These problems would manifest them-

    selves in various phenomena such as hallucinations (the feature sets activated by objects actually

    present in the visual scene combine to yield the activation pattern characteristic of another object)

    and the figure-ground problem (the inability to correctly assign image features to foreground ob-

    ject and background), leading von der Malsburg to postulate the necessity of a special mechanism,

    the synchronous oscillatory firing of ensembles of neurons, to bind features belonging to one object

    together.

    One approach to avoid these problems was presented by Olshausen et al. [62]: Instead of trying

    to process all objects simultaneously, processing is limited to one object in a certain part of space

    at a time, e.g., through “focussing attention” on a region of interest in the visual field, which is

    then routed through to higher visual areas, ignoring the remainder of the visual field. The control

    signal for the input selection in this model is thought to be provided in form of the output of a

    “blob-search” system that identifies possible candidates in the visual scene for closer examination.

    While this top-down approach to circumvent the binding problem has intuitive appeal and is com-

    patible with physiological studies that report top-down attentional modulation of receptive field

    properties, (see the article by Reynolds & Desimone in this issue, or the recent study by Connor

    et al. [14]), such a sequential approach seems to be difficult to reconcile with the apparent speed

    with which object recognition can proceed even in very complex scenes containing many objects

    [72, 101], and is also incompatible with reports of parallel processing of visual scenes, as observed

    in pop-out experiments [102]), suggesting that object recognition does not seem to depend only on

    explicit top-down selection in all situations.

    A more head-on approach to the binding problem was taken in other studies that have called

    into question the assumption that representations based on sets of spatially invariant feature detec-

    tors are inevitably ambiguous. Starting with Wickelgren [114] in the context of speech recognition,

    several studies have proposed how coding an object through a set of intermediate features, made

    up of local arrangements of simpler features (e.g., using letter pairs, or higher order combinations,

    28

  • instead of individual letters to code words — for instance, the word “tomaso” could be confused

    with the word “somato” if both are coded by the sets of letters they are made up of; this ambiguity

    is resolved, however, if they are represented through letter pairs) can sufficiently constrain the rep-

    resentation to uniquely code complex objects without retaining global positional information (see

    Mozer [58] for an elaboration of this idea and an implementation in the context of word recogni-

    tion). The capabilities of such a representation based on spatially-invariant receptive fields were

    recently analyzed in detail by Mel & Fiser [53] for the example domain of English text.

    In the visual domain, Mel [52] recently presented a model to perform invariant recognition of

    a high number (100) of objects of different types, using a representation based on a large number

    of feature channels. While the model performed surprisingly well for a variety of transformations,

    recognition performance depended strongly on color cues, and did not seems as robust to scale

    changes as experimental neurons [49]. Perrett & Oram [65] have recently outlined a conceptual

    model based on very similar ideas of how a representation based on feature combinations could in

    theory avoid the “Binding Problem”, e.g., by coding a face through a set of detectors for combina-

    tions of face parts such as eye-nose or eyebrow-hairline. What has been lacking so far, however, is

    a computational implementation quantitatively demonstrating that such a model can actually per-

    form “real-world” subordinate visual object recognition to the extent observed in behavioral and

    physiological experiments [48, 49, 54, 89], where effects such as scale changes, occlusion and over-

    lap pose additional problems not found in an idealized text environment. In particular, unlike in

    the text domain where the input consists of letter strings and the extraction of features (letter com-

    binations) from the input is therefore trivial, the crucial task of invariant feature extraction from

    the image is nontrivial for scenes containing complex shapes, especially when multiple objects are

    present.

    We have developed a hierarchical feedforward model of object recognition in cortex, described

    in [82], as a plausibility proof that such a model can account for several properties of IT cells, in

    particular the invariance properties of IT cells found by Logothetis et al. [49]. In the following,

    we will show that such a simple model can perform invariant recognition of complex objects in

    cluttered scenes and is compatible with recent physiological studies. This is a plausibility proof

    that complex oscillation-based mechanisms are not necessarily required for these tasks and that the

    binding problem seems to be a problem for only some models of object recognition.

    3.3 A Hierarchical Model of Object Recognition in Cortex

    Studies of receptive field properties along the ventral visual stream in the macaque, from primary

    visual cortex, V1, to anterior IT report an overall trend of an increase of average feature complexity

    and receptive field size throughout the stream [41]. While simple cells in V1 have small localized

    29

  • (a) (b)

    View angleΣ

    view-tunedunits

    view-invariant Σunit

    Figure 3-1: (a) Cartoon of the Poggio and Edelman model [68] of view-based object recognition. The grayovals correspond to view-tuned units that feed into a view-invariant unit (open circle). (b) Tuning curves ofthe view-tuned (gray) and the view-invariant units (black).

    receptive fields and respond preferentially to simple shapes like bars, cells in anterior IT have been

    found to respond to views of complex objects while showing great tolerance to scale and position

    changes. Moreover, some IT cells seem to respond to objects in a view-invariant manner [6, 49, 66].

    Our model follows this general framework. Previously, Poggio and Edelman [68] presented

    a model of how view-invariant cells could arise from view-tuned cells (Fig. 3-1). However, they

    did not describe any model of how the view-tuned units (VTUs) could come about. We have re-

    cently developed a hierarchical model that closes this gap and shows how VTUs tuned to complex

    features can arise from simple cell-like inputs. A detailed description of our model can be found

    in [82] (for preliminary accounts refer to [79, 80], and also to [43]). We briefly review here some

    of its main properties. The central idea of the model is that invariance to scaling and translation

    and robustness to clutter on one hand, and feature complexity on the other hand require different

    transfer functions, i.e., mechanisms by which a neuron combines its inputs to arrive at an output

    value: While for the latter a weighted sum of different features, which makes the neuron respond

    preferentially to a specific activity pattern over its afferents, is a suitable transfer function, increas-

    ing invariance requires a different transfer function that pools over different afferents tuned to the

    same feature but transformed to different degrees (e.g., at different scales to achieve scale invari-

    ance). A suitable pooling function (for a computational justification, see [82]) is a so-called MAX

    function, where the output of the neuron is determined by the strongest afferent, thus performing a “scan-

    ning” operation over afferents tuned to different positions and scales. This is similar to the original

    Hubel and Wiesel model of a complex cell receiving input from simple cells at different locations

    to achieve phase-invariance.

    In our model of object recognition in cortex (Fig. 3-2), the two types of operations, selection and

    template matching, are combined in a hierarchical fashion to build up complex, invariant feature

    detectors from small, localized, simple cell-like receptive fields in the bottom layer. In particular,

    30

  • patterns on the model “retina” (of ��� � ��� pixels — which corresponds to a 5� receptive field

    size ([41] report an average V4 receptive field size of 4.4�) if we set 32 pixels � ��) are first fil-

    tered through a layer (S1, adopting Fukushima’s nomenclature [25] of referring to feature-building

    cells as “S” cells and pooling cells as “C” cells) of simple cell-like receptive fields (first derivative

    of gaussians, zero-sum, square-normalized to 1, oriented at ��� ��� ���� �� with standard devia-

    tions of 1.75 to 4.75 pixels in steps of 0.5 pixels. S1 filter responses are absolute values of the image

    “filtered” through the units’ receptive fields (more precisely, the rectified dot product of the cells’

    receptive field with the corresponding image patch). Receptive field centers densely sample the

    input retina. Cells in the next layer (C1) each pool S1 cells of the same orientation over a range of

    scales and positions. Filters were grouped in four bands each spanning roughly � octaves, sam-

    pling over position was done over patches of linear dimensions of �� �� �� �� pixels, respectively

    (starting with the smallest filter band); patches overlapped by half in each direction to obtain more

    invariant cells responding to the same features as the S1 cells. Different C1 cells were then com-

    bined in higher layers — the figure illustrates two possibilities: either by combining C1 cells tuned

    to different features to give S2 cells responding to co-activations of C1 cells tuned to different orien-

    tations or to yield C2 cells responding to the same feature as the C1 cells but with bigger receptive

    fields (i.e., the hierarchy does not have to be a strict alternation of S and C layers). In the version

    described in this paper, there were no direct C1 � C2 connections, and each S2 cell received input

    from four neighboring C1 units (in a � � � arrangement) of arbitrary orientation, yielding a total

    of �� � �� different S2 cell types. S2 transfer functions were Gaussian (� � �, centered at 1). C2

    cells then pooled inputs from all S2 cells of the same type, producing invariant feature detectors

    tuned to complex shapes. Top-level view-tuned units had Gaussian response functions and each

    VTU received inputs from a subset of C2 cells (see below).

    This model had originally been developed to account for the transformation tolerance of view-

    tuned units in IT as recorded from by Logothetis et al. [49]. It turns out, however, that the model

    also has interesting implications for the binding problem.

    3.4 Binding without a problem

    To correctly recognize multiple objects in clutter, two problems must be solved: i) features must be

    robustly extracted, and ii) based on these features, a decision has to be made about which objects

    are present in the visual scene. The MAX operation can perform robust feature extraction (cf. [82]):

    A MAX pooling cell that receives inputs from cells tuned to the same feature at, e.g., different lo-

    cations, will select the most strongly activated afferent, i.e., its response will be determined by the

    afferent with the closest match to its preferred feature in its receptive field. Thus, the MAX mech-

    anism effectively isolates the feature of interest from the surrounding clutter. Hence, to achieve

    31

  • ����������

    ��������

    ��������

    ������������

    ������������������������

    ������������������������

    ������������������������������������������������������������������������

    ������������������������������������������������������������������������

    ��������

    ��������

    ��������������������������������������������������������

    ��������������������������������������������������������

    view-tuned cells

    MAX

    weighted sum

    simple cells (S1)

    complex cells (C1)

    "complex composite" cells (C2)

    "composite feature" cells (S2)

    Figure 3-2: Diagram of our hierarchical model [82] of object recognition in cortex. It consists of layers of linearunits that perform a template match over their afferents (blue arrows), and of non-linear units that perform a“MAX” operation over their inputs, where the output is determined by the strongest afferent (green arrows).While the former operation serves to increase feature complexity, the latter increases invariance by effectivelyscanning over afferents tuned to the same feature but at different positions (to increase translation invariance)or scale (to increase scale invariance, not shown). In the version described in this paper, learning only occuredat the connections from the C2 units to the top-level view-tuned units.

    32

  • robustness to clutter, a VTU should only receive input from cells that are strongly activated by the

    VTU’s preferred stimulus (i.e., those features that are relevant to the definition of the object) and

    thus less affected by clutter (which will tend to activate the afferents less and will therefore be ig-

    nored by the MAX response function). Also, in such a scheme, two view-tuned neurons receiving

    input from a common afferent feature detector will tend to both have strong connections to this fea-

    ture detector. Thus, there will only be little interference even if the common feature detector only

    responded to one (the stronger) of the two stimuli in its receptive field due to its MAX response

    function. Note that the situation would be hopeless for a response function that pools over all affer-

    ents through, for example, a linear sum function: The response would always change when another

    object is introduced in the visual field, making it impossible to disentangle the activations caused

    by the individual stimuli without an additional mechanism such as, for instance, an attentional

    sculpting of the receptive field or some kind of segmentation process.

    In the following two sections we will show simulations that support these theoretical consider-

    ations, and we will compare them to recent physiological experiments.

    3.4.1 Recognition of multiple objects

    The ability of the model neurons to perform recognition of multiple, non-overlapping objects was

    investigated in the following experiment: 21 model neurons, each tuned to a view of a randomly

    selected paperclip object (as used in theoretical [68], psychophysical [8, 48], and physiological [49]

    studies on object recognition), were each presented with 21 displays consisting of that neuron’s

    preferred clip combined with each of the 21 preferred clips (in the upper left and lower right corner

    of the model retina, resp., see Fig. 3-3 (a)) yielding ��� � ��� two-clip displays. Recognition perfor-

    mance was evaluated by comparing the neuron’s response to these displays with its responses to 60

    other, randomly chosen “distractor” paperclip objects (cf. Fig. 3-3). Following the studies on view-

    invariant object recognition [8, 48, 82], an object is said to be recognized if the neuron’s response

    to the two-clip displays (containing its preferred stimulus) is greater than to any of the distractor

    objects. For 40 afferents to each view-tuned cell (i.e., the 40 C2 units excited most strongly by the

    neuron’s preferred stimulus; this choice produced top-level neurons with tuning curves similar to

    the experimental neurons [82]), we find that on average in 90% of the cases recognition of the neu-

    ron’s preferred clip is still possible, indicating that there is little interference between the activations

    caused by the two stimuli in the visual field. The maximum recognition ra