Top Banner
Neural Networks 24 (2011) 148–158 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Acquisition of nonlinear forward optics in generative models: Two-stage ‘‘downside-up’’ learning for occluded vision Satohiro Tajima a,b,, Masataka Watanabe c a Nagano Station, Japan Broadcasting Corporation, 210-2, Inaba, Nagano-City, 380-8502, Japan b Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa-shi, Chiba 277-8561, Japan c Faculty of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan article info Article history: Received 13 October 2009 Received in revised form 13 October 2010 Accepted 14 October 2010 Keywords: Vision Generative model Predictive coding Neural network Occlusion Learning Developmental stage abstract We propose a two-stage learning method which implements occluded visual scene analysis into a generative model, a type of hierarchical neural network with bi-directional synaptic connections. Here, top-down connections simulate forward optics to generate predictions for sensory driven low-level representation, whereas bottom-up connections function to send the prediction error, the difference between the sensory based and the predicted low-level representation, to higher areas. The prediction error is then used to update the high-level representation to obtain better agreement with the visual scene. Although the actual forward optics is highly nonlinear and the accuracy of simulated forward optics is crucial for these types of models, the majority of previous studies have only investigated linear and simplified cases of forward optics. Here we take occluded vision as an example of nonlinear forward optics, where an object in front completely masks out the object behind. We propose a two-staged learning method inspired by the staged development of infant visual capacity. In the primary learning stage, a minimal set of object basis is acquired within a linear generative model using the conventional unsupervised learning scheme. In the secondary learning stage, an auxiliary multi-layer neural network is trained to acquire nonlinear forward optics by supervised learning. The important point is that the high-level representation of the linear generative model serves as the input and the sensory driven low- level representation provides the desired output. Numerical simulations show that occluded visual scene analysis can indeed be implemented by the proposed method. Furthermore, considering the format of input to the multi-layer network and analysis of hidden-layer units leads to the prediction that whole object representation of partially occluded objects, together with complex intermediate representation as a consequence of nonlinear transformation from non-occluded to occluded representation may exist in the low-level visual system of the brain. © 2010 Elsevier Ltd. All rights reserved. 1. Introduction Visual images are generated by photons which strike the retina after many reflections and refractions in an object-filled, three- dimensional world. This is a forward causal process that maps three-dimensional visual environments onto a two-dimensional retinal image (Marr, 1982). In contrast, recognition is a process which attempts to solve an inverse problem, i.e., to infer the three-dimensional world given the two-dimensional retinal image. To understand how the brain deals with this inverse problem is a fundamental challenge in vision science. One promising Corresponding author at: Nagano Station, Japan Broadcasting Corporation, 210- 2, Inaba, Nagano-City, 380-8502, Japan. Tel.: +81 26 291 5249, +81 4 7136 3914. E-mail addresses: [email protected] (S. Tajima), [email protected] (M. Watanabe). approach is the generative model framework (Dayan, Hinton, Neal, & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995; Kawato, Hayakawa, & Inui, 1993; Mumford, 1992; Rao & Ballard, 1997, 1999). Here, top-down neural projections function to generate predictions about the two-dimensional input image given the high-level abstract representation. The difference between the top-down predicted input image and the actual sensory input, i.e. the prediction error, is then fed back via bottom-up connections to update the higher level representation and produce better agreement with the visual scene. These top-down and bottom- up processes are termed forward- and inverse-optics models, respectively (Kawato et al., 1993). Although the forward optics of the real environment is highly nonlinear, previous studies have not thoroughly investigated how nonlinear forward optics should be implemented in the generative model framework and how they can be learned from visual experiences. In this paper, we introduce a learning method for occluded vision based on the generative model framework inspired by 0893-6080/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2010.10.004
11

Acquisition of nonlinear forward optics in generative models: Two ...

Jan 02, 2017

Download

Documents

doanduong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Acquisition of nonlinear forward optics in generative models: Two ...

Neural Networks 24 (2011) 148–158

Contents lists available at ScienceDirect

Neural Networks

journal homepage: www.elsevier.com/locate/neunet

Acquisition of nonlinear forward optics in generative models: Two-stage‘‘downside-up’’ learning for occluded visionSatohiro Tajima a,b,∗, Masataka Watanabe c

a Nagano Station, Japan Broadcasting Corporation, 210-2, Inaba, Nagano-City, 380-8502, Japanb Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwa-no-ha, Kashiwa-shi, Chiba 277-8561, Japanc Faculty of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan

a r t i c l e i n f o

Article history:Received 13 October 2009Received in revised form 13 October 2010Accepted 14 October 2010

Keywords:VisionGenerative modelPredictive codingNeural networkOcclusionLearningDevelopmental stage

a b s t r a c t

We propose a two-stage learning method which implements occluded visual scene analysis into agenerative model, a type of hierarchical neural network with bi-directional synaptic connections. Here,top-down connections simulate forward optics to generate predictions for sensory driven low-levelrepresentation, whereas bottom-up connections function to send the prediction error, the differencebetween the sensory based and the predicted low-level representation, to higher areas. The predictionerror is then used to update the high-level representation to obtain better agreement with the visualscene. Although the actual forward optics is highly nonlinear and the accuracy of simulated forward opticsis crucial for these types of models, the majority of previous studies have only investigated linear andsimplified cases of forward optics. Here we take occluded vision as an example of nonlinear forwardoptics, where an object in front completely masks out the object behind. We propose a two-stagedlearning method inspired by the staged development of infant visual capacity. In the primary learningstage, a minimal set of object basis is acquired within a linear generative model using the conventionalunsupervised learning scheme. In the secondary learning stage, an auxiliary multi-layer neural networkis trained to acquire nonlinear forward optics by supervised learning. The important point is that thehigh-level representation of the linear generative model serves as the input and the sensory driven low-level representation provides the desired output. Numerical simulations show that occluded visual sceneanalysis can indeed be implemented by the proposed method. Furthermore, considering the format ofinput to the multi-layer network and analysis of hidden-layer units leads to the prediction that wholeobject representation of partially occluded objects, together with complex intermediate representationas a consequence of nonlinear transformation from non-occluded to occluded representation may existin the low-level visual system of the brain.

© 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Visual images are generated by photons which strike the retinaafter many reflections and refractions in an object-filled, three-dimensional world. This is a forward causal process that mapsthree-dimensional visual environments onto a two-dimensionalretinal image (Marr, 1982). In contrast, recognition is a processwhich attempts to solve an inverse problem, i.e., to infer thethree-dimensionalworld given the two-dimensional retinal image.To understand how the brain deals with this inverse problemis a fundamental challenge in vision science. One promising

∗ Corresponding author at: Nagano Station, Japan Broadcasting Corporation, 210-2, Inaba, Nagano-City, 380-8502, Japan. Tel.: +81 26 291 5249, +81 4 7136 3914.

E-mail addresses: [email protected] (S. Tajima),[email protected] (M. Watanabe).

0893-6080/$ – see front matter© 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.neunet.2010.10.004

approach is the generative model framework (Dayan, Hinton,Neal, & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995; Kawato,Hayakawa, & Inui, 1993; Mumford, 1992; Rao & Ballard, 1997,1999). Here, top-down neural projections function to generatepredictions about the two-dimensional input image given thehigh-level abstract representation. The difference between thetop-down predicted input image and the actual sensory input, i.e.the prediction error, is then fed back via bottom-up connectionsto update the higher level representation and produce betteragreement with the visual scene. These top-down and bottom-up processes are termed forward- and inverse-optics models,respectively (Kawato et al., 1993). Although the forward optics ofthe real environment is highly nonlinear, previous studies have notthoroughly investigated how nonlinear forward optics should beimplemented in the generative model framework and how theycan be learned from visual experiences.

In this paper, we introduce a learning method for occludedvision based on the generative model framework inspired by

Page 2: Acquisition of nonlinear forward optics in generative models: Two ...

S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158 149

studies of human visual development. Occlusion is a typical ex-ample of nonlinear forward optics where objects in the frontcompletely mask out overlapping objects behind. Developmentalstudies on infant vision suggest that the ability to recognize oc-cluded visual scenes is acquired during postnatal development.Very young infants cannot infer the hidden parts of stationary oc-cluded objects (Otsuka, Kanazawa, & Yamaguchi, 2006; Slater et al.,1990) and cannot comprehend that objects may occlude other ob-jects (Csibra, 2001), although they are fully capable of recogniz-ing individual objects. Multi-staged development can be also seenin other aspects of vision, such as motion and shape perception(e.g. Atkinson, 2000).

In this study, we adopt a two-stage approach to implementoccluded vision analysis in a generativemodel. In the primary stagea linear generative model acquires the basic object features usinga conventional unsupervised learning scheme (e.g. Rao & Ballard,1997, 1999). In the secondary stage, the system acquired in theprimary stage is utilized to train an auxiliary multi-layer networkfor acquisition of nonlinear processing in visual occlusion analysisunder supervised learning. The high-level abstract representationin the linear generativemodel serves as the inputwhile the sensorydriven low-level representation serves as the desired output tothe auxiliary network (Fig. 1). The amount of change in synapticefficacies is derived by the steepest descent of prediction errors.This is equivalent to applying the error backpropagation methodto a neural network placed downside-up in the visual hierarchy. Inthis paper we focus on the secondary stage, since non-supervisedlearning of object basis in linear generative models has beenalready established (Rao & Ballard, 1997).

The remaining part of this paper is organized as follows:Section 2 describes themathematical foundation of acquiring non-linear forward optics. In Sections 3 and 4, we introduce our pro-posed model and present the results of numerical simulations thatinvestigated its learning and recognition properties. Additionally,we analyze the hidden-layer representation of themulti-layer neu-ral network for comparisonwith electrophysiological studies of thevisual cortex. Section 5 discusses the biological plausibility of thecurrent model and its relation to other theoretical studies.

2. Extension of the generative model framework for nonlinearforward optics

2.1. Overview of the linear generative model: unsupervised learningand inference

We begin with the simplest formulation. Suppose that an inputimage I , in a vector representation of p pixels, is explained by qcausal parameters. Let r be a q-dimensional vector that representsthe set of causes. Recognition of a visual scene is equivalent to themaximization of posterior probability of cause r for a given inputimage I . In the Bayesian framework, the posterior probability dis-tribution corresponds to the likelihood and the prior distributionof causal parameters:

P(r|I;G) ∝ P(I |r;G)P(r) ∝ e−E, (1)

where G is the generative model of the visual input. Here we as-sume that the posterior is represented by an exponential familyprobability distribution. A simple formof a linear generativemodel(Rao & Ballard, 1997) can be denoted by,

IG = G(r;U) = Ur, (2)

where IG is the reconstructed image, r is the causal parameter thatrepresents the visual scene with the combination of basis images,and U is a set of q basis images (p × q matrix). In the above case

of linear forward optics, the generated image is simply a linear su-perposition of basis images, i.e.,whenobjects overlap in space, theybecome transparent.

The objective function to minimize for optimal inference isgiven by the following equation,

E = |I − IG|2 + R(r). (3)

The first term on the right-hand side is the sum of squared recon-struction errors that is equivalent to the log-likelihood of causer under the assumption that the input image contains Gaussiannoise. The second term R(r) is the logarithm of prior distribution,which belongs to an exponential family, and represents the con-straint on cause r (Olshausen & Field, 1996; Rao & Ballard, 1999).Examples of constraints include the sparse code constraint (Ol-shausen & Field, 1996) or the minimum description length con-straint (Rao & Ballard, 1999).

During the estimation of causes, i.e., the recognition phase, theelements of r are updated by performing a gradient descent on E:

drldt

= −kr2

∂E∂rl

= kr−

i

(Ii − IGi )Uil, (4)

where t denotes time; Ii and IGi are the ith elements (pixel lumi-nance) of I and IG, rl is the lth element of r and kr is the updatingrate. The right-hand side of the equation is the product of the pre-diction error (Ii−IGi ) propagating from the lower to the higher leveland the acquired basis images.

In the learning phase, the basis image matrix U is optimized bya similar rule with an updating rate kU :

dUil

dt= −

kU2

∂E∂Uil

= kU(Ii − IGi )rl. (5)

The above equation corresponds to the learning of an internalmodel byminimizing the prediction error and optimizing the basisimage. The right-hand side follows the form of Hebbian learningand if one assumes that the learning takes place at synaptic sitesof top-down projection, (Ii − IGi ) and rl correspond to postsynapticand presynaptic activity, respectively (Rao & Ballard, 1997, 1999).

2.2. Generative model for nonlinear forward optics

2.2.1. Model description and implementation; supervised learning ofa multi-layer neural network

Nextwe extend the generativemodel to a general form so that itcan be applied to nonlinear forward optics. Function G denotes thegenerative model and the reconstruction of the input image andtakes the form,

IG = Gr, x1, x2, . . . ;U ,W 1,W 2, . . .

, (6)

where x1, x2, . . . are additional environmental parameters (spatiallocation of object, lighting etc.), and {r, x1, x2, . . .} are the set ofcauses, {W 1,W 2, . . .} are the model-controlling environmentalparameters for x1, x2, . . . respectively, and {U ,W 1,W 2, . . .} is theset of hyperparameters that determine the generative model. Forthe extended generative model, the posterior distribution to bemaximized is,

P(r, x1, x2, . . . |I;G) ∝ P(I |r, x1, x2, . . . ;U ,W 1,W 2, . . .)

× P(r, x1, x2, . . . |U ,W 1,W 2, . . .)

∝ e−E . (7)

Similar to Eq. (3), the objective function E is given as the sum ofsquared error and the constraints on environmental parametersx1, x2, . . . :

E =I − IG

2 + R(r, x1, x2, . . .). (8)

Page 3: Acquisition of nonlinear forward optics in generative models: Two ...

150 S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158

a b

Fig. 1. Illustration of proposed learning scheme. (a) The network architecture. The rectangles and the arrows represent the neuronal layers and the connections betweenthem, respectively. The top and bottom layers interact through both linear and nonlinear systems. (b) Flow of the two-staged development of the network. The solid anddashed arrows respectively indicate fixed and learned interlayer connections for each stage. In the primary learning stage (left), the basic object examples are acquired throughthe training of the linear connections. In the secondary learning stage (middle), the objects are inferred by the acquired linear circuit; simultaneously, the nonlinear systemis trained in order to accurately predict the visual input from the higher representation. The bright dashed arrows represent the propagated errors of nonlinear prediction.After the nonlinear system is established (right), it is used for online inference. The top-down predictions are given by the newly acquired nonlinear system; the predictionerrors are propagated by the connections of the nonlinear system or, in some cases, by the remaining bottom-up connections of the linear system (see Section 3 for a concreteexample).

The constraint term R also becomes a function of additionalenvironmental parameters. Following the formulation of the linearmodel, the gradient descent of E in the recognition phase can beformulated as:

drldt

= −kr2

∂E∂rl

= kr−

i

Ii − IGi

∂Gi

∂rl−

∂R∂rl

,

dx1mdt

= −kx12

∂E∂x1m

= kx1−

i

Ii − IGi

∂Gi

∂x1m−

∂R∂x1m

,

dx2ndt

= −kx22

∂E∂x2n

= kx2−

i

Ii − IGi

∂Gi

∂x1n−

∂R∂x1n

....

(9)

The learning phase as:

dUα

dt= −

kU2

∂E∂Uα

= kU−

i

Ii − IGi

∂Gi

∂Uα

,

dW 1β

dt= −

kW1

2∂E

∂W 1β

= kW1

−i

Ii − IGi

∂Gi

∂W 1β

,

dW 2γ

dt= −

kW2

2∂E

∂W 2γ

= kW2

−i

Ii − IGi

∂Gi

∂W 2γ

....

(10)

The suffixes l,m, n, . . . and α, β, γ , . . . indicate the elements ofr, x1, x2, . . . and U ,W 1,W 2, . . . , respectively.

Given that the properties of G are unknown, nonparametricfitting of the function is required. In a general sense, the functionG can be approximated by a multi-layered neural network, wheremodel parameters are replaced with synaptic weights:

IGi = Gi(y{1};W ) = f

−j

W {3}{2}ij y{2}

j

= f

−j

W {3}{2}ij f

−k

W {2}{1}jk y{1}

k

,

y{1}= (rT , x1T , x2T , . . .)T ,W = {W {2}{1},W {3}{2}

}.

(11)

The suffixes {1}, {2} and {3}denote indices of the input, hidden andoutput layer, respectively. For example, W {3}{2}

ij is the connectivity

weight from the jth unit in hidden layer y{2}j to the ith unit in

output layer y{3}i . The vector y{1} denotes the set of input variables.

Consequently, the partial differentiation terms of G in Eq. (9) areequivalent to those that appear in a backpropagation algorithm fora three-layer network:

dW {3}{2}ij

dt= kW (Ii − IGi )y{2}

j ,

dW {2}{1}kj

dt= kW

−i

(Ii − IGi )W {3}{2}ij f ′(y{3}

i )y{1}k .

(12)

In these equations, kw is the learning rate of synaptic weights, fdenotes the activation function of neurons and f ′ is the derivativeof f . The input to this network is the high-level representation{r, x1, x2, . . .} and the desired output is the visual input I . Asmentioned in the introduction section, the pre-trained lineargenerative model provides the high-level representation input tothe nonlinear neural network during the secondary learning phase.

Given a visual scenewith occlusion, the linear generativemodelcannot infer the causes completely since it assumes that objectsof different visual depth superimpose on each other. However upto a certain level of complexity in the occluded visual scene, thelinear generative model is capable of providing a crude estimateof existing objects (as shown by Rao & Ballard, 1997), and itproves to be a sufficient input to train the nonlinear forward-opticsmodel (see Section 3.2 for further details). On the other hand,to disambiguate the depth order of objects during the secondarylearning phase, we assume that other modality cues such as visualdisparity or tactile information are available.

2.2.2. InferenceSince we have the explicit form of the generative model (see

Eq. (10)), it is possible to write out the differentiation of G withrespect to causal parameters r, x1, x2, . . . by applying the chainrule. However, causal parameters are often discrete values whichcannot be differentiated. In the case of occlusion, parameters suchas x1 would denote the alignment depth order of objects withdiscrete values such as 0, 1, 2, . . . . To infer discrete environmentalvariables, methods other than direct differentiation are needed.One approach is to prepare a lookup table covering all possibilities,and then select the optimal combination in minimizing theobjective function E in Eq. (7). Another conventional approach is to

Page 4: Acquisition of nonlinear forward optics in generative models: Two ...

S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158 151

use the variational method, in which we approximate a tractableprobability distribution of the variables to be inferred (Dayan et al.,1995; Hinton et al., 1995; Jaakkola, Saul, & Jordan, 1996; Saul,Jaakkola, & Jordan, 1996). For example, we may approximate adistribution Q instead of the exact posterior distribution of causalparameters with the following equation:

P(r, x1, x2 . . . |I;G) ≈ Q (r, x1, x2 . . . |ϕ, µ, λ . . .), (13)where ϕ, µ, λ, . . . are the complementary parameters, or innerstate parameters, for each causal variable. To infer the causalparameters of the visual input, complementary parameters areoptimized so that the Kullback–Leibler divergence is minimizedbetween the approximated distribution Q and the exact posteriorprobability distribution P:

P(ϕ, µ, λ . . .) =

−r

−x1,x2,...

Q (r, x1, x2 . . . |ϕ, µ, λ . . .)

× logQ (r, x1, x2 . . . |ϕ, µ, λ . . .)

P(r, x1, x2 . . . |I;U ,W 1,W 2 . . .)

=

−r

−x1,x2,...

Q (r, x1, x2 . . . |ϕ, µ, λ . . .)

× [logQ (r, x1, x2 . . . |ϕ, µ, λ . . .)

+I − IG

2 + R(r, x1, x2, . . .)] + const. (14)Similar to Eq. (8), the optimization can be performed on D via agradient descent method.Wemay determine the formulation of Qgiven the specific problem.

The mathematical description so far can be applied to anynonlinear forward-optics model. Although there are problemswith domain-specific dependencies such as: (a) the constraint Ron additional variables; (b) the format of input and output forthe nonlinear multi-layer network; and (c) the optimization ofvariables in the recognition phase. These issues will be addressedin the following sections with regard to visual occlusion.

3. Model for visual occlusion

In this section and the subsequent sections, we explain the pro-posed model for visual occlusion. Fig. 2 illustrates the model to-gether with the training scheme of the multi-layer neural networkfor nonlinear forward optics. In this section, we first formulate thenonlinear forward-optics function, and then later describe the en-tire system.

3.1. Forward-optics model for occlusion

Weconsider a generativemodel capable of reconstructing inputimages with occlusion:

IG = G (r, z;U ,W ) (15)where z is the depth parameter and W is the set of connectivityweights of the multi-layer neural network. In this section and thesubsequent sections in this paper, we refer to G as the occlusionsynthesizer. For simplicity, we consider the case in which twoobjects are presented. The elements of r and z take the followingbinary values1:

rl =

1 (for lth basis:seen)0 (for lth basis:not seen), (16)

zl =

1 (for lth basis:fore g round)0 (for lth basis:back g round). (17)

1 Here, we should note that there is a constraint on the number of foregroundobjects such that it cannot be greater than one.

Particularly in the case of occlusion, the occlusion synthesizer canbe described as

G(r, z;U ,W ) =

−l

{Ulrl − Ml(U1r1,Uqrq, z;W )}, (18)

where Ul is the basis image of the lth object andMl is the mask forbasis l. Consequently, the occlusion synthesizer becomes a functionof Ul, rl and z:

G = G(U1r1, . . . ,Unrn, z;W ). (19)

In the case of two objects, Eq. (19) expands to:

G = G(Uara,Ubrb, za, zb;W ), (20)

where a, and b are the pair of basis image indices used to describethe visual scene at hand.

Given appropriate assumptions, the occlusion synthesizer canbe implemented by training a multi-layered neural network. Theassumptions we made in the current model are: (i) simple basisimages of objects and their contour representations are acquired ina non-occluded state during the primary learning stage; (ii) simpleobjects are recognizable during occlusion from their visible parts2;(iii) information on the object depth order can be detected fromcues other than occlusion (e.g. disparity, tactile, motion disconti-nuity). Given the above assumptions, U can be acquired by a lin-ear generative model in the primary learning stage, whereas W istuned during the secondary learning stage, while U , r and z arefixed. In the secondary stage the visual input acts as the teachingsignal for the occlusion synthesizer, as depicted in Fig. 2(b).

Next we discuss the format of the input to the nonlinear multi-layer network. A straightforward method is to directly use thehigh-level symbolic representation of the linear generative model,r and z , as input. We refer to this process as the symbolic trainingdesign. However in this case, the acquired occlusion synthesizerbecomes basis image specific and cannot be generalized to newlyacquired basis images. One simple solution to this problem isto render the symbolic representation using U so that the inputbecomes {Uara,Ubrb, za, zb} (a, b = 1, . . . , n). This correspondsto a pixel-wise representation of the object image predictedby the linear generative model. We refer to this as the pixel-wise training design (see Fig. 2(b)). In this case, the synthesizeracquires occlusion processing for each pixel, meaning that the finalocclusion synthesizer does not depend on the basis images used forlearning. Thus, additional learning or re-learning of the synthesizeris not required for newly acquired basis images. The pixel-wisetraining design is more general and efficient in comparison to thesymbolic training design and hence we adopt it for our numericalsimulation. A biologically plausible interpretation of the designwould be to assume that the occlusion synthesizer as awhole sits inthe lower visual system, where it receives the output of the lineargenerative model.

3.2. Inverse-optics model and inference

Now we turn to the recognition system for inferring occludedvision. Recognition of objects (r) and their depth order (z) aredependent on each other because shapes of objects are not de-termined unless the parts that are occluded are known, and viceversa. This means that we need to estimate the parameters r andz based on their joint probability distribution P(r, z|I). However,

2 We assume that object recognition is completely managed by the linear systemduring the secondary learning stage, but it does not necessarily mean that thenonlinear forward-optics model is not needed for object recognition. As wewill seein the subsequent section, the linearmodel is not sufficient for recognizing complexobjects under occlusion.

Page 5: Acquisition of nonlinear forward optics in generative models: Two ...

152 S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158

a b

Fig. 2. Inference and learning system for visual recognition with occlusion. (a) The algorithm for inference. ra and rb are the causal parameters representing which objectis seen, and za and zb are their depth order; µa,µb, ϕa and ϕb are the inner state parameters for those four variables. The occlusion solver consists of a bottom-up process(the left-hand part of the circuit) and a top-down process (the right-hand part of the circuit) containing a synthesis process (G) of occluded images (IG). The error signals(IC,err) are fed back to the variables in the higher area, filtered by the contour representations of the basis images (UC ). The figure above shows the case in which object a isestimated to be in front of object b, conflicting with the true visual input. (b) The learning model of occlusion synthesizer G with a multi-layered neural network.

direct calculation of such a joint probability distribution is oftenintractable in complex situations. A previous study showed thatan approximated inverse-optics model combinedwith an accurateforward-optics model effectively works for recognition (Kawatoet al., 1993). Here, we use two separately approximated probabilitydistributions for r and z , and optimize each distribution indepen-dently. We define the distribution of r and z as follows:

Qr(rl = 1, rm = 1, rotherwise = 0|ϕ)

=ϕlϕm∑

l′<m′

ϕl′ϕm′

, ϕ = {ϕ1, . . . , ϕq}, (21)

Qz(zl = 1, zm = 0|µ)

=1

1 + exp(−(µl − µm)/2), µ = {µl, µm}. (22)

These equations provide the probability distribution for the casewhere two arbitrary objects (corresponding to lth object basis inforeground and mth in background) are presented. Here, we in-troduce real-valued complementary parameters, or inner state pa-rameters, ϕ and µ for r and z , respectively, to parameterize thedistributions. The rationale behind the functions introduced inEqs. (21) and (22) is as follows. In Eq. (21), an object basis with alargerϕ value has a higher probability to be inferred as a recognizedobject (i.e., its r value takes 1). Since we consider the case in whichtwo objects are presented, we compare the pair-wise products ofϕvalues. The pair having a larger product ofϕ values ismore likely tobe regarded as the recognized object pair. Eq. (22) determines theprobability of depth order for the two objects in a pair. The objectbasis with higher µ value is more often inferred as the foregroundobject. The constraint on variables in Eq. (7) applies here for defin-ing the probability distribution of Eqs. (21) and (22),which are usedas variational approximations of the exact posterior distribution ofcausal parameters.

In the recognition phase, we optimize the inner state param-eters so as to minimize the Kullback–Leibler divergence between

approximated distribution Q and exact posterior probability dis-tribution

D(ϕ, µ) =

−r

−z

QrQz logQrQz

P(r, z|I). (23)

The simplest means to optimize the inner states is to perform agradient descent on D(ϕ, µ). However, the precise formulationof the derivative functions in D(ϕ, µ) consists of complicatednonlinear terms, which cannot be easily implemented in a neuralnetwork. Instead of directly calculating the steepest descentof D, here we make further approximations by using contourrepresentations of object images (Fig. 2(a)). First, we transformboth the original visual input and its reconstruction into contourrepresentations, IC and IGC . Next we feedback the dot product ofthe error (IC − IGC ) and the contour basis images (UC ) to the higherlevel:

dϕl

dt= kϕ

−l

(ICi − IGiC)UC

il , (24)

dµl

dt= kµ

−l

(ICi − IGiC)UC

il , (25)

Here kϕ and kµ are update coefficients, and the suffix C denotescontour representation. The right-hand side of Eqs. (24) and (25)approximate the exact gradients in terms of their sign, i.e., given areconstructed input image with an incorrect depth order, the errorbetween outlines will contain both positive-valued and negative-valued parts (see IC,err in Fig. 2(a)), thus facilitating one object andsuppressing the other so that the inner state µ is modified to-ward the correct value. This modification works similarly whenthe selection of an object is incorrect. This approximation for theinverse-optics model is represented as G# in Fig. 2(a). In summary,the occlusion solver synthesizes reconstruction images with sur-face information and analyzes it with the contour information.

Page 6: Acquisition of nonlinear forward optics in generative models: Two ...

S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158 153

a b

c

Fig. 3. Transition of training error and output samples during the secondary learning phase. (a) An example pair of training data with overlapping rectangle and triangle.(b) The evolution of the error averaged over the whole training data. Learning rate = 0.001. (c) Demonstration of a top-down network (occlusion synthesizer) that graduallyacquires the properties to predict the visual image containing surface occlusion; corresponding to the visual image shown in panel a, the depth order becomes apparent asthe training epoch proceeds.

4. Results

4.1. Acquisition of occlusion synthesizer

Fig. 3 shows the simulation results of training a multi-layerneural network for visual occlusion. A set of pixel-based imageswith randomly sized overlapping rectangles and triangles wereused for training. Fig. 3(a) shows an example pair of visual inputs,where a rectangle and a triangle overlap at different positions.Hence, the occlusion synthesizer described in the previous sectionshould reach the two solutions, ‘‘a rectangle on a triangle’’ and ‘‘atriangle on a rectangle’’.

Here we assumed a pre-trained linear generative model whichhad triangles and rectangles of various sizes/positions as the ac-quired basis image. This redundancy of basis images is necessarysince the currentmodel does not implement a position/size invari-ant representation or an accompanying top-down mechanism totransform and reroute the information to deal with variability ininput size and position. The pixel-wise training design described inSection 3.1 allows us to render the high-level symbolic represen-tation of the linear generative model using U , supporting the useof pixel-based images as input to the occlusion synthesizer (seeAppendix A for details on the network architecture and the trainingdata set). Fig. 3(b) shows the evolution of training error averagedover the whole training data and Fig. 3(c) depicts the evolving out-puts of the occlusion. The initial network weights were set suchthat the network produced the linear summation of the two basisimages (Fig. 3(c), the leftmost pair, see Appendix A for the detailsof the initial network settings). Connectivity efficacies were iter-atively updated with the conventional gradient descent methodof error backpropagation learning. The update coefficient of con-nectivity weights were fixed to kw = 0.01. Numerical simula-tion results showed that visual occlusion is gradually acquired, i.e.,the foreground object masked the overlapped regions of the back-ground object as learning proceeded (Fig. 3(c)) resulting in a de-crease in prediction error (Fig. 3(b)). The results demonstrate thatnonlinear forward optics of occlusion can be indeed embedded ina multi-layer neural network by the proposed framework.

4.2. Simultaneous inference of object and depth

Next, we investigated how the whole system (i.e., the combi-nation of linear inverse optics and nonlinear forward optics) per-formed, given a more complex occluded scene. Here we focused

on cases where the linear generative model was inadequate in es-timating the true combination of objects. Moreover, we tested thegeneralization performance of the acquired nonlinear forward op-tics by presenting objects which were not used during the train-ing phase (as described in the previous section). The novel objectswere only embedded as basis images in the linear forward-opticssystem.

Fig. 4 shows the numerical simulation results. For the visual in-put, we used occluded scenes (Fig. 4(b)) which were combinationsof the four basis images: cross, capital-A, square, and four-circles(Fig. 4(a)). Note that none of these, except for the square, werepart of the training data set in Section 4.1. Despite the fact that thetrained network remained somewhat incomplete (i.e., containingnoises in its output as shown in Fig. 3(c), not like in hand-codedmodels), we see that correct inferences were achieved.

The inner state parameters ϕ and µ were iteratively updatedfor 1500 estimation steps; each estimation step consisted of anupdate of ϕ followed by 50 updates of µ. The updating rulesgenerally followed Eqs. (24) and (25), although due to a restrictionof machine power, we used prediction errors (IC − IGC ) averagedover ϕ and µ, instead of conducting full stochastic sampling. Theupdating rates for ϕ and µ were set at kϕ = 0.1 and kµ =

0.005. Fig. 4(c) and (d) depict the transition of values ϕ and µ,respectively. The final values ofϕ andµ, whichwere initially set tozero, indicate that the objects and their depth orderswere correctlyestimated in all 4 cases.

The results in Fig. 4 demonstrate several important character-istics of the proposed model. First, they show that the occlusionsynthesizer can be generalized to untrained objects: it hadnoprob-lems with cross, capital-A, four-circles, which were not part of theinitial training data set. We also confirmed that the system per-forms well for other novel test objects of different shapes. This isdue to the pixel-wise training design where the network acquiredvisual occlusion at a very local level.

Second, the true object pair increased their ϕ values from theinitial phase of inference in three out of four test samples (Data1–3). This reflects the fact that in these cases, object recognitionwas possible without any knowledge of depth order. In the previ-ous section, we assumed that correct object recognition was possi-ble for easy cases of occlusion before the occlusion synthesizer wasoptimized, thus using the information for supervised training. Thecurrent results justify the above assumption.

Third, the results demonstrate that the recognition of occlusionwas essential for accurate object identification in complexoccluded scenes. Here we focus on the case of Data 4, in which

Page 7: Acquisition of nonlinear forward optics in generative models: Two ...

154 S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158

a

b

c

d

Fig. 4. Transition of parameters during inference of objects and depth order. (a) Individual object symbols. (b) Visual inputs. (c) Transition of ϕ; the larger the value of ϕ,the more strongly the symbol is activated. The change in order of activation is observed around 500 steps in Data 4 (indicated by the arrow). (d) Transition of µ; the largerthe value of µ, the more accurately the symbol is recognized. (Parameter values: kϕ = 0.4, kµ = 0.4.)

the orders of ϕ switched at approximately 500 steps of iteration(indicated by the downward arrow in Fig. 4(c), Data 4). In thisparticular case, a square, partly occluded by four circleswas initiallyidentified as combination of a cross and four circles. This happenedbecause in the absence of any occlusion information, the contourof a hidden square fit better with a cross than a square. If thesystem did not have an inner model of nonlinear forward opticsof occlusion, as for the case of linear generative models (Rao &Ballard, 1997), the final result of inference would have remainedas cross and four-circles. However, as the effects of the occlusionsynthesizer took hold, the interpretation of a square occluded byfour circlesminimized the prediction error and thus dominated thefinal estimation of the given visual scene. The current results provethat the nonlinear forward optics of occlusion enables the systemto select bases that better explains the visual scene.

4.3. Characteristics of hidden-layer neurons

In former subsections we have seen that a multi-layer neuralnetwork is capable of acquiring visual occlusion, and that the oc-clusion synthesizer plays a key role in recognition given a complexoccluded visual scene. In this sectionwe discuss the characteristicsof the hidden-unit responses. Although the backpropagation algo-rithm per se is not considered to be a biologically plausible learningmethod, an examination of the acquired properties can provide in-sight to the structure of computation at the neural level (Zipser &Anderson, 1988).

The data presented in this section was generated by the sametraining of the network as in Section 4.1, except that here weused 5 × 5 pixels, five-step luminance, and 200 hidden units(eight hidden units per pixel). We first saw the individual trendof the hidden-layer units followed by the trend of the population.Fig. 5 demonstrates the eight hidden neurons that account for arepresentative pixel. The input variables for each pixel were thedepth information of the objects a and b [(za, zb) = (1, 0) or(0, 1)] and their luminance (1–5 in the horizontal axis representthe luminance, 0 means no object is on the pixel). The plot labels(a) and (b) correspond to which object (a or b) is in the foreground:plots (a) show the cases of (za, zb) = (1, 0)whereas plots (b) showthe cases of (za, zb) = (0, 1). In inference, each cell selected onesurface from (a) and (b) according to the depth order of the twoobjects (see Appendix B for further explanation). All eight units

in Fig. 5 had non-zero connectivity toward the output layer units,implying that all of them had some role in resolving occlusion.

For instance, the cell labeled No. 133 in Fig. 5 drasticallychanged its activity when object a was in the foreground (panela), but not when a was in the background (panel b). The cell actedaccording to the luminance of object a only when there was no oc-clusion, i.e., when the luminance of object bwas zero.We interpretthe activity of cell No. 133 as mainly conveying information aboutobject a, but also as being modulated by depth order. Other unitssimilarly showed dependency on the luminance of both a and b.The majority of the units in Fig. 5 varied their characteristics withthe depth order, indicating that there exists ‘‘if -statement’’-likestructures in the neural network. This is not surprising, given thatconventional computer graphics algorithms utilize if-branches forpixel-wise processing.

To further examine the population characteristics of the hiddenlayer, we defined an index to quantify the properties of hidden-layer neurons. The indices used here represented averaged dy-namic ranges for varied surface luminance of basis a or b:

Xaj =

maxua

hj(ua, ub) − maxub

hj(ua, ub)

ub

,

Xbj =

maxub

hj(ua, ub) − maxua

hj(ua, ub)

ua

.

(26)

Here, ua and ub are the input luminance level of object a and b, re-spectively, while, hj(ua, ub) is the activity of the jth hidden neuronfor each combination in input luminance. These indices reflectedwhich luminance of basis a and b had more influence on neuronalactivity. For instance, the larger the Xa

j was and the smaller theXbj was, the more influence object a had on this particular neuron.

Depth order could also affect neuronal activity. For this reason, twomarkers were used in Fig. 6 to discriminate the two depth orders;each pair of + and dot connected with a line represent respectiveunit characteristics.

The distribution of nodes showed that the trained hidden neu-rons were separated into a number of categories (Fig. 6(a)). Thenodes were largely clustered near (Xa

j , Xbj ) = (0.0, 0.0), (1.2, 0.0)

and (0.0, 1.2). These clusters corresponded to the type of neuronmodulated by neither object a or b [(Xa

j , Xbj ) = (0.0, 0.0)] and

the type modulated by one but not both [(Xaj , X

bj ) = (1.2, 0.0),

(0.0, 1.2)]. In addition to these types, there were several small

Page 8: Acquisition of nonlinear forward optics in generative models: Two ...

S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158 155

Fig. 5. Variety of neural activities in the hidden layer. Each unit showed two different characteristics depending on the depth order of objects. (a) Responses toward objectluminance in the network input while object a is in the foreground and b is in the background. (b) Responses toward object luminance in the network input while object ais in the background and b is in the foreground. The thick solid line in each panel shows the neural activities when Ua or Ub are zero, which means that only one or no objectis presented within the receptive visual field of the neuron.

clusters that showed a cross-effect from both objects (Fig. 6(a), thearea far from the axis). This reflects the fact that there are severalsteady regions in the solution space for coding occlusion.

Furthermore, the distribution of links between the nodes rep-resented trends in neuronal levels, whereas the node distributionitself represented trends in the activity level. Clusterswere also ob-served at the link level. An interesting fact here is that themajorityof cells drastically varied their response properties with the depthorder. Fig. 6(b) depicts the development of the distribution duringthe training process. As the training proceeds, the cells not only be-gan to form clusters, but increasingly developed their depth-order-variant properties (thus increasing the length of the link).

5. Discussion

In this paperwe propose amethod to incorporate nonlinear for-ward optics in a generative model framework. Nonlinear forwardoptics were implemented in a multi-layered neural network withan error backpropagation learning method that utilized the higherrepresentation of the pre-trained linear generative model as net-work input and lower representations of visual input as the de-sired output. We analytically derived the learning scheme basedon the steepest descent of an error function, which resulted in the‘‘downside-up’’ placement and training of a multi-layered neuralnetwork. We conducted numerical simulations on establishing anocclusion solver and analyzed the neural properties of the hidden

Page 9: Acquisition of nonlinear forward optics in generative models: Two ...

156 S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158

a b

Fig. 6. A scatter plot of the indices reflecting hiddenneuronal dependencies on the luminance of input objects a and b. (a) A plot on trainednetwork (training epoch= 20,000).(b) Snapshots of distribution during training at 0, 10, 100 and 1000 epochs, respectively.

layer for comparison with physiological studies. Here we discusshow our proposal and results relate to the conventional machinelearning framework of a generativemodel, the postnatal visual de-velopment in humans and finally, the neural mechanisms of oc-cluded object recognition.

In the framework of generative models, a number of studieshave used multi-layered neural networks to model the nonlineartop-down and bottom-up visual computations (e.g. Dayan et al.,1995; Hinton et al., 1995; Hinton, Osindero, & Teh, 2006; Hinton& Salakhutdinov, 2006). The main purpose of these studies wasto acquire efficient representation of images under unsupervisedlearning. However, nonlinear forward optics, such as occlusion,has not been extensively studied in this context. In this paper wefocused on acquiring nonlinear forward optics under a supervisedlearning scheme that utilizes a pre-acquired linear generativemodel. In other words, it is a two-stage learning method. Firstthe linear generative model is trained to acquire a minimal setof object basis under unsupervised learning, and second a multi-layer neural network is trained to attain nonlinear forward opticsby using the higher level representation of the linear generativemodel. Here, the first stage needs only to acquire a minimal setof object basis, since the pixel-wise input rendered from higherabstract representation leads to the generalization of nonlinearforward optics for untrained objects. Furthermore, learning newobject bases may continue after establishing nonlinear forwardoptics, where the generalized nonlinear forward optics improvesthe efficiency of learning.

The advantage of aforementioned staged learning, as well asits similarity to postnatal development, can be viewed from othercontexts; for example, Gori (2009) describes a similarity betweenpre-operational stage in child development and a methodologyof machine learning that neglects constraint penalty terms. Forcomplex nonlinear forward-optics simulation, the correspondingobjective function is not convex, and there are difficulties inreaching the globally optimal state. By segmenting the learninginto multiple stages, and at the same time, limiting the primaryunsupervised learning to linear forward optics, we converted theacquisition of object bases to a convex problem (e.g., the objectivefunction defined by Eq. (3) is a quadratic function of U , and thusconverges to a global minimum).

The successful acquisition of nonlinear forward optics in thesecond stage of learning leads to the prediction that the top-downprojections in the nervous system may mature in the late stagesof postnatal neuronal development. In line with this prediction,an anatomical study of human infants shows that the structureof top-down connections from area V2 to V1 are still immatureat 4 months of age, whereas the structure of bottom-up connec-tions mature at a relatively early age, suggesting possible relation-ships to some late-onset functional developments in infant vision(Burkhalter, 1993).

Particularly with regard to acquiring the ability to recognizeoccluded objects, the idea of staged learning agrees with the de-velopment of human infant visual recognition. A line of studies indevelopmental psychology have shown that infants of age less than8 months cannot appropriately recognize partially occluded sta-tionary objects, and do not understand that a surface can functionas an occluder (Csibra, 2001; Slater et al., 1990 but see also Otsukaet al., 2006, for discussion on precise identification of the criticalage). On the other hand, if the objects are presented with move-ment, 4-month-old infants are capable of distinguishing shapes ofbackground objects (Craton, 1996). These findings can be imple-mented by the proposed two-staged learning. It is possible thatnewborn infants do not have a complete model of nonlinear for-ward optics, but is learned after minimum object bases are ac-quired. Although not dealt with in the current paper, the primaryand the secondary stagemayproceed in parallel as in the likely caseof neural development. First, basesmay be acquired under simplis-tic conditions (e.g. without occlusion). Once a minimal number ofbases are stored, training of the nonlinear forward-optics modelmay proceedwhile new bases are acquired simplistic viewing con-ditions. Parallel learningmay continueuntil the nonlinear forward-optics model is fully established. Then on, new object image basescan be obtained under complex viewing conditions.

There are various models that have dealt with visual occlu-sion (Fukushima, 2005; Grossberg & Mingolla, 1985; Heitger, vonder Heydt, & Kubler, 1994; Sajda & Finkel, 1995). One strategy to-ward occluded vision is to complete the hidden regions of objectsat lower levels (Grossberg & Mingolla, 1985; Heitger et al., 1994;Sajda & Finkel, 1995). In another set of models, the occluded re-gions are assumed to be known, and the background object is rec-ognized by eliminating the effect of the occluder (e.g., Fukushima,

Page 10: Acquisition of nonlinear forward optics in generative models: Two ...

S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158 157

2005). Our approach is based on analysis-by-synthesis, wherehigh-level abstract representations are transformed into low-levelrepresentations via top-down connections and compared with theactual visual input. Visual occlusion is dealt with by the acquisi-tion of nonlinear forward optics, which mimics the actual processof occlusion to enhance the performance of synthesis.

Interestingly, we find similarities between the inference pro-cess of the occlusion synthesizer and the human observer. Thesimulation result of Data 4 in Fig. 4(c) predicts that if the humancognitive system adopts a similar strategy to what is implementedin the current model, perceptual changes can occur when objectsare presented in specific configurations. Recently, an interestingillusion termed ‘‘bar-cross-ellipse illusion’’ has been reported byTse and Caplovitz (2006), which has similarity to the above simula-tion results. The visual stimulus consisted of a black rotating ellipseoccluded by four white squares, as in Fig. 4(a). Without any priorknowledge, the subjects perceived a ‘‘black cross’’ morphing in sizeand shape; it took many seconds for the observer to recognize thatactually an occluded ellipse is rotating. This switch of percept canbe interpreted as a delayed correction of visual scene interpreta-tion by a nonlinear forward-optics model, as we observed in ournumerical simulation results.

Our model also makes a prediction about the neuronal sub-strates of visual occlusion. Physiological research on contextualmodulation in early visual cortices has revealed interesting factsrelated to occlusion: neural responses corresponding to amodalcompletion of occluded contour are found in the primary area ofmonkeys (Bakin, Nakayama, & Gilbert, 2000; Lee & Nguyen, 2001;Sugita, 1999), and fMRI activation in the early visual cortex of hu-mans (Ban et al., 2004). In relation to our model, these responsesmay be interpreted as activity originating from high-level objectbased representation, which may act as input to the occlusionsynthesizer. Furthermore, our model predicts that not only thecontour but also the surface, which is amodally completed, is rep-resented in early visual cortices. To our best knowledge, the neu-ronal correlates of amodal completion of occluded surfaces haveyet to be found in electrophysiological studies. We expect that thefollowing physiological experiment may clarify whether occludedsurface representations actually exist in vivo. In the early visualcortex, cells with preference to physical and/or perceptual sur-face luminance of stimuli has been reported (Bartlett & Doty, 1974;Hung, Ramsden, Chen, & Roe, 2001; Hung, Ramsden, & Roe, 2007;Kayama, Riso, & Chu, 1980; Kinoshita & Komatsu, 2001; MacEvoy,Kim, & Paradiso, 1998;MacEvoy & Paradiso, 2001; Roe, Lu, & Hung,2005; Rossi & Paradiso, 1999; Rossi, Rittenhouse, & Paradiso, 1996;Squatrito, Trotter, & Poggio, 1990). If cells exist whose activity isinfluenced by the luminance of an occluded surface (Fig. 7), theirresponses may be quite complex as in the hidden-layer analysis ofour model.

We next discuss the use of backpropagation as a learningmethod for acquiring nonlinear forward optics. In spite of thefact that it is not biologically plausible, we adopted it becauseit is the best available method to embed complex nonlinearinput–output relationships in a multi-layered neural network. Itis likely that the brain’s complex circuitry has its own way toadjust the synaptic efficacies of a hierarchical multi-layer neuronalstructure with numerous hidden layers. Therefore we discuss theproperties of the hidden-layer neurons, as in the work of Zipserand Anderson (1988), under the assumption that the hidden-layer representations are comparable despite the difference in thelearning strategies. In relation, another challenge in the field ofsupervised learning is the availability of vector teaching signalsin the brain. Unlike reinforcement learning, output neurons needsto be ‘‘taught’’ how exactly they should activate given a certaininput. This fundamental challenge in supervised learning is solvedin our model by assuming that a multi-layer neural network is

a b

Fig. 7. An example of visual stimuli for testing the predicted neuronal propertywith single unit recordings. The current model predicts the existence of neuronswhose responses are modulated by the presence of the occluded object surfacewithin their receptive fields, even though the visible surface brightness of theforeground object is not affected. The dark square behind the bright circle fallswithin the targeted neuronal receptive field in panel (a) but not in panel (b).

placed downside-up in the visual hierarchy, and the sensory drivenlow-level representation undertakes the role of the vector teachingsignal.

Finally, by generalization of our current results, we discuss howother types of nonlinear forward optics may be implemented inthe brain. Forward optics in the real world contains strong non-linearities such as object occlusion, shading, shadow projection,three-dimensional rotations, and so on. For a generative model tofunction in the real world, top-down reconstruction processesneed to simulate every aspect of nonlinear effects in forwardoptics. The proposed scheme of staged learning can be general-ized to acquire other types of forward optics by experience andlead to similar predictions as in the case of visual occlusion. Forexample, implementation of view point invariance and shadinginvariance would lead to low and mid-level representations ofnon-rotated and non-shaded objects, together with the complexhidden-layer representations as a consequence of transformationfrom the above non-processed representation to the fully pro-cessed prediction. The presentmodelmay provide new viewpointsin postnatal development and neural representation in the visualcortex.

Acknowledgements

The second author’s work is supported by grants from the Min-istry of Education, Culture, Sports, Science and Technology (MEXT),No. 17022015. We thank the reviewers for many suggestions toimprove the paper. We also thank Dr. T. Melano for discussion andcomments on the earlier version of the manuscript.

Appendix A. Design and parameter settings for the training ofmulti-layer network

We trained a three-layer neural network; the input, hidden andoutput layers respectively had 4706, 3136 and 2352 units. Theactivation function of each unit was defined by a sigmoid function,

f (x) = −α +1 + 2β

1 + exp [−β(x + 0.5)]

where α = 0.1, β = 2 ln(1 + α)/α, such that f (0) = 0,f (1) = 1 are satisfied. The input layer had the pixel-based rep-resentations of two object basis images (Ua and Ub) as well as theirdepth information (za and zb; see Fig. 2(b)). The network is trainedso that the output layer provides the pixel-based image of the twoobjects with visual occlusion. The connections between the layerswere set pixel-by-pixel; units corresponding to each image pixeldiverged to four units in the hidden layer. Those four units con-nected to three output units, which represented the three-step

Page 11: Acquisition of nonlinear forward optics in generative models: Two ...

158 S. Tajima, M. Watanabe / Neural Networks 24 (2011) 148–158

pixel luminosity. Also two depth-representing units and a bias unitin the input layer had connections to all of the hidden units (seeAppendix B for details of depth and luminance representation).Although simplified, the limitations on connections accurately re-flect the topography of inter-cortical connections, and the limita-tion in range of lateral connections in the lower visual cortex. Thetraining data set consisted of 100 pairs of randomly sized overlap-ping rectangles or triangles images (in 28 × 28 pixels) whose sur-face luminance varied in three steps. We assumed that all possiblescaled and translated versions of each object were already storedas the image bases, through the primary learning of linear genera-tive model.

Initially the network connectivity weights were set to outputthe linear summation of two basis images regardless of occlusion.It can be achieved as in the following: for each pixel, the three pairsof activities of luminance-representing units (that represent thethree-step surface luminance of object a and b; see Appendix B)in the input layer were respectively summed up with the sameweights by three hidden-layer units, then those hidden-unitactivities were respectively copied to the three units in the outputlayer. The remaining one hidden unit received and emitted almostno input and output signals at the beginning of the learning. Inthe actual simulation, connectivity weights were perturbed witha small amount of Gaussian noise to prevent being captured insingularity during the subsequent learning stage. In the learningstage, the connections from the input layer to the hidden layerand from the hidden layer to the output layer were optimizedin order to reduce the sum of squared prediction error using thebackpropagation algorithm.

Appendix B. Representation of depth order and surface lumi-nance

Each depth-representing unit was given a value 0 or 1, indi-cating whether the corresponding object was in the backgroundor the foreground (value 0 for background, 1 for foreground); thetwo depth-representing units were never given the same values.The luminance of each pixel in the input and output layer wasrepresented by an array of three units of stepwise responses withdifferent threshold luminance values, (i.e., one unit was activatedto represent the darkest surface, and all of three units were acti-vated for the brightest surface; when none of three units were acti-vated, it meant there was not any object surface within that pixel).The step function was used primarily for simplicity, and it shouldbe noted that luminance-selective neurons in reality show piece-wise-linear responses (Bartlett &Doty, 1974; Kinoshita & Komatsu,2001).

References

Atkinson, J. (2000). The developing visual brain. USA: Oxford University Press.Bakin, J. S., Nakayama, K., & Gilbert, C. D. (2000). Visual responses of monkey

areas V1 and V2 to three-dimensional surface configurations. The Journal ofNeuroscience, 20(21), 8188–8198.

Ban, H., Nakagoshi, A., Yamamoto, H., Tanaka, C., Umeda, M., & Ejima, Y. (2004). Therepresentation of occluded objects in low-level visual regions: an fMRI study.Technical report of IEICE. HIP. 104 (100) (pp. 1–6).

Bartlett, J. R., & Doty, R. W. (1974). Response of units in striate cortex of squirrelmonkeys to visual and electrical stimuli. Journal of Neurophysiology, 37(4),621–641.

Burkhalter, A. (1993). Development of forward and feedback connections betweenareas V1 and V2 of human visual cortex. Cerebral Cortex, 3(5), 476–487.

Craton, L. G. (1996). The development of perceptual completion abilities: infants’perception of stationary, partially occluded objects. Child Development , 67(3),890–904.

Csibra, G. (2001). Illusory contour figures are perceived as occluding surfaces by8-month-old infants. Developmental Science, 4(4), F7–F11.

Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine.Neural Computation, 7(5), 889–904.

Fukushima, K. (2005). Restoring partly occluded patterns: a neural network model.Neural Networks, 18(1), 33–43.

Gori, M. (2009). Semantic-based regularization and Piaget’s cognitive stages.NeuralNetworks, 22(7), 1035–1036.

Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping:textures, boundaries, and emergent segmentations. Perception & Psychophysics,38(2), 41–171.

Heitger, F., von der Heydt, R., & Kubler, O. (1994). A computational model of neuralcontour processing: figure-ground segregation and illusory contours. In: IEEEproceedings, from perception to action conference. 1994. 18 (7) (pp. 181–192).

Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithmfor unsupervised neural networks. Science, 268(5214), 1158–1160.

Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deepbelief nets. Neural Computation, 18(7), 1527–1554.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of datawith neural networks. Science, 313(5786), 504–507.

Hung, C. P., Ramsden, B. M., Chen, L. M., & Roe, A. W. (2001). Building surfaces fromborders in areas 17 and 18 of the cat. Vision Research, 41(10–11), 1389–1407.

Hung, C. P., Ramsden, B. M., & Roe, A. W. A. (2007). Functional circuitry for edge-induced brightness perception. Nature Neuroscience, 10(9), 1185–1190.

Jaakkola, T., Saul, L. K., & Jordan,M. I. (1996). Fast learning by bounding likelihoods insigmoid type belief networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo(Eds.), Advances in neural information processing systems: Vol. 8. Cambridge, MA:MIT Press.

Kawato, M., Hayakawa, H., & Inui, T. (1993). A forward-inverse optics modelof reciprocal connections between visual cortical areas. Network, 4(4),415–422.

Kayama, Y., Riso, R. R., & Chu, F. C. (1980). Luxotonic response of units in macaquestriate cortex. Journal of Neurophysiology, 42(6), 1495–1517.

Kinoshita, M., & Komatsu, H. (2001). Neural representation of the luminance andbrightness of a uniform surface in themacaque primary visual cortex. Journal ofNeurophysiology, 86(5), 2559–2570.

Lee, T. S., & Nguyen, M. (2001). Dynamics of subjective contour formation in theearly visual cortex. Proceedings of the National Academy of Sciences of the UnitedStates of America, 98(4), 1907–1911.

MacEvoy, S. P., Kim, W., & Paradiso, M. A. (1998). Integration of surface informationin primary visual cortex. Nature Neuroscience, 1(7), 616–620.

MacEvoy, S. P., & Paradiso, M. A. (2001). Lightness constancy in primary visualcortex. Proceedings of the National Academy of Sciences of the United States ofAmerica, 98(15), 8827–8831.

Marr, D. (1982). Vision: a computational investigation into the human representationand processing of visual information. New York: Freeman.

Mumford, D. (1992). On the computational architecture of the neocortex. II. The roleof cortico-cortical loops. Biological Cybernetics, 66(3), 241–251.

Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive fieldproperties by learning a spars code for natural images. Nature, 381(6583),607–609.

Otsuka, Y., Kanazawa, S., & Yamaguchi, M. K. (2006). Development of modal andamodal completion in infants. Perception, 35(9), 1251–1264.

Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predictsneural response properties in the visual cortex. Neural Computation, 9(4),721–763.

Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: afunctional interpretation of some extra-classical receptive-field effects. NatureNeuroscience, 2(1), 79–87.

Roe, A.W., Lu, H. D., & Hung, C. P. (2005). Cortical processing of a brightness illusion.Proceedings of the National Academy of Sciences of the United States of America,102(10), 3869–3874.

Rossi, A. F., & Paradiso, M. A. (1999). Neural correlates of perceived brightness in theretina, lateral geniculate nucleus, and striate cortex. The Journal of Neuroscience,19(14), 6145–6156.

Rossi, A. F., Rittenhouse, C. D., & Paradiso, M. A. (1996). The representation ofbrightness in primary visual cortex. Science, 273(5278), 1104–1107.

Sajda, P., & Finkel, L. H. (1995). Intermediate-level visual representations andthe construction of surface perception. Journal of Cognitive Neuroscience, 7(2),267–291.

Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid beliefnetworks. Journal of Artificial Intelligence Research, 4(4), 61–76.

Slater, A., Morison, V., Somers, M., Mattock, A., Brown, E., & Taylor, D. (1990).Newborn and older infants’ perception of partly occluded objects. InfantBehavior and Development , 19(1), 145–148.

Squatrito, S., Trotter, Y., & Poggio, G. F. (1990). Influences of uniform and texturedbackgrounds on the impulse activity of neurons in area V1 of the alert macaque.Brain Research, 536(1–2), 261–270.

Sugita, Y. (1999). Grouping of image fragments in primary visual cortex. Nature,401(6735), 269–272.

Tse, P. U., & Caplovitz, G. P. (2006). The bar-cross-ellipse illusion: alternatingpercepts of rigid and nonrigid motion based on contour ownership andtrackable feature assignment. Perception, 35(7), 993–997.

Zipser, D., & Anderson, R. A. (1988). A backpropagation programmed network thatsimulates response properties of a subset of posterior parietal neurons. Nature,331(6158), 679–684.