Visual Learning with Navigation as an Exampleweng/research/IEEE_IS.pdf · 2002-01-28 · In model-free methods, we consider the vision-based navigation controller as the function

V I S I O N - B A S E D L E A R N I N G

Visual Learning withNavigation as an Example

Juyang Weng and Shaoyun Chen, Michigan State University

SEPTEMBER/OCTOBER 2000 1094-7167/00/$10.00 © 2000 IEEE 63

THE SUCCESS OF AUTONOMOUSnavigation (or any sensor-based control sys-tem) depends largely on how much we cancontrol the environment. In a controlled envi-ronment, we can define a few known land-marks (transponders, visual markers, and soon) before system design, and the navigationsystem can employ landmark detectors.

Such navigation systems typically employa model-based design method. However, aswe show, these methods have difficultiesdealing with learning in complex, changingenvironments. Model-free design methodspresent a potentially powerful alternative.

To overcome the limitations of model-based methods, we have developed Shoslif(Self-organizing Hierarchical Optimal Sub-space Learning and Inference Framework),a model-free, learning-based approach.Shoslif belongs to the class of appearance-based methods, which apply statistical toolsdirectly to normalized image pixels.1,2

Shoslif introduces mechanisms such asautomatic feature derivation, a self-orga-nizing tree structure to reach a very low log-arithmic time complexity, one-instancelearning, and incremental learning withoutforgetting prior memorized information. Inaddition, we’ve created a state-based ver-sion of Shoslif that lets humans teach robotsto use past history (state) and local viewsthat are useful for disambiguation.

Shoslif-N, which we started developing in

1994, is a prototype autonomous navigationsystem using Shoslif. We have tested Shoslif-N primarily indoors, because our robot’s Help-mate drivebase runs only on flat surfaces.Indoor navigation encounters fewer lightingchanges than outdoor navigation. However, itoffers other, considerable challenges forvision-based navigation. Such challengesinclude a large variation in scenes, the lack ofstable prominent local features, the lack of sta-ble global contrast regions, and other opticalphenomena such as the ubiquitous specularity(mirror-like surface) of waxed floors andpainted walls. Shoslif-N has shown that it cannavigate in real time reliably in an unalteredindoor environment for an extended amountof time and distance, without any specialimage-processing hardware.

The problem with model-based methods

As an example of the application of amodel-based method, consider road follow-ing. For a road vehicle to automatically fol-low a road or a marked lane, the systemdesigner must first model the driving envi-ronment (world). For example, you canmodel a single-lane road as a stretch of uni-form intensity of a certain shape with a highcontrast between the road and nonroad areas.This modeling task becomes extremely chal-lenging when the driving environment is lesscontrolled. For instance, the environmentmight include multiple lanes, a driving sur-face marked with traffic signs, roads with tree

THE STATE-BASED LEARNING METHOD PRESENTED HERE IS

APPLICABLE TO VIRTUALLY ANY VISION-BASED CONTROL

PROBLEM. THE AUTHORS USE NAVIGATION AS AN EXAMPLE.THE SHOSLIF-N NAVIGATION SYSTEM AUTOMATICALLY

DERIVES, DURING TRAINING, THE VISUAL FEATURES THAT ARE

BEST SUITED FOR NAVIGATION. USING SYSTEM STATES

ENABLES SHOSLIF-N TO DISREGARD UNRELATED SCENE PARTS

AND ACHIEVE BETTER GENERALIZATION.

shadows, or rainy or snowy conditions.Other driving tasks require more than just a

road-following capability, such as driving aforklift in a warehouse, a cart in an airport build-ing, or a wheelchair during everyday indoorand outdoor use. Manually modeling every pos-sible case and designing a reliable case detec-tor for all possible cases become intractable.

In addition, the human designer must ana-lyze all the visual scenes that the vehiclemight encounter and then develop a modelfor each scene. Each model characterizes thecorresponding type of scene content anddefines a set of type-specific parameters tobe determined through experiments. Forexample, two straight lines in the imageplane model a straight single-lane road, andtwo quadratic curves model a smooth-turningsingle-lane road. The lines’and curves’para-meters are the roads’ parameters.

Figure 1 illustrates such a model-based

approach. As the figure indicates, unless thescene with which we’re dealing is veryrestricted, a single model is insufficient. So,this approach uses multiple models, whichrequires a model selector. The model selectordetermines which model is the most appropri-ate given the current scene. To avoid cata-strophic accidents due to unpredicted visualscenes, this approach requires an applicabilitychecker. The applicability checker determineswhether the current scene fits one of the mod-els with a sufficient confidence.

The model-based approach has theseadvantages:

• The system design uses specific humanknowledge about the environment, so thesystem can be efficient for predictablecases.

• Because a human designer explicitlywrites the control algorithms, humans can

interpret them more easily. (For example,the programmed behavior “When thereading from the front range detector isless than 0.5 meters, back up” is easier tointerpret than an artificial neural net-work’s weights.)

• Computational requirements are lowerbecause, given a scene model, thisapproach processes only a relativelysmall part of an image in a specific way.

It has these disadvantages:

• The human-designed models are restric-tive; they are not always applicable in amore general setting.

• The applicability checker, which is cru-cial for a safe application, is difficult todevelop because it must be able to dealwith virtually any scene.

• This approach requires a potentially hugenumber of models, unless the scene is verywell known. An everyday real-world traf-fic situation might require potentially tensof thousands of complex models. Design-ing and fully testing all these models takesconsiderable effort. Furthermore, themodel selector and applicability checkerbecome increasingly complex, unreliable,and slow when the number and complex-ity of the models increase.

Model-free methods

Model-free methods lack any predefinedmodel of the driving environment during sys-tem design. In other words, a human pro-grammer does not specify any environmentmodel, either explicit or implicit; the systemitself must automatically derive the model.

64 IEEE INTELLIGENT SYSTEMS

Related work

Several experimental road-following systems have used model-free design methods. Two such systems are Carnegie Mellon’sAlvinn1 and the University of Maryland’s Robin.2 For its learn-ing architecture, Alvinn used a multilayer feed-forward network,trained by a back-propagation learning algorithm. Robin used aradial basis function network with hand-selected centers for theradial basis functions. RBFN reportedly achieved smootherbehavior than MFFN in a road-following experiment.2

Both systems used offline batch learning (see “Incrementalonline learning” in the main article). MFFN uses the training setto tune the network’s size parameters, including the number ofhidden nodes. RBFN uses the training set to tune the number ofradial basis functions and the center of each radio basis function.Each additional environment situation requires retraining the

entire network and retuning the manually selected parameters.Both MFFN and RBFN are stateless networks.

Unlike Shoslif, MFFN and RBFN are not appearance-basedmethods because they do not use statistical tools to derive featuresubspaces. For example, they do not use the covariance of theinput vector.

References

1. D.A. Pomerleau, “Efficient Training of Artificial Neural Networksfor Autonomous Navigation,” Neural Computation, Vol. 3, No. 1,Jan. 1991, pp. 88–97.

2. M. Rosenblum and L.S. Davis, “An Improved Radial Basis Func-tion Network for Visual Autonomous Road Following,” IEEETrans. Neural Networks, Vol. 7, No. 5, June 1996, pp. 1111–1120.

Model-specificrules

Sensory input Control signalModel-specific

control

Applicabilitychecker

Model n

Model 1

Model 2Modelselector

Figure 1. The model-based approach to navigation control. Dashed lines indicate that the human must create the corresponding module’s content during system design and development.

In model-free methods, we consider thevision-based navigation controller as thefunction

Yt+1 = f(Xt). (1)

Xt is a vector consisting of all the sensoryinputs at time t with possibly other naviga-tional instructions at time t (such as thedesired heading direction at an intersection).Yt+1 is a vector consisting of the desired vehi-cle control signal at time t + 1. Without lossof generality, we have assumed discrete timeinstances: t = 0, 1, 2, ....

The challenge of model-free learning is toconstruct an approximate function f̂ so thatf̂ (X) is very close to the desired output f(X)for all the possible X inputs. To accomplishthis goal, we use a set of training samples L = {(Xi, f(Xi)) | i = 0, 1, ..., n}, which comefrom either synthetically generated inputs orthe real sensory inputs. We use this set totrain a system implemented by some learn-ing architecture.

For a look at two other model-free meth-ods, see the “Related work” sidebar.

Learning-based navigationcontrol

Because a learning-based approach (seeFigure 2) such as Shoslif uses a model-freedesign method, it relieves the human de-signer from hand-developing world models.(A method that uses some learning tech-niques is not necessarily free of world mod-els. For example, a line-fitting method esti-mates the predefined line parameters basedon a set of observed image points. However,a line model for an image is a specific aspectof a world model.)

In this approach, the learning systemdeals with raw image input directly, so itsimage representation must be very gen-eral. We consider a digital image X of rrows and c columns as a point in a rc-dimensional space S (see the sidebar,“Vector representation for an image”). Forexample, for images of size 30 × 40, S hasa dimensionality of 30 × 40 = 1,200. Thecontrol signal vector Y, which might con-tain navigation signals such as headingdirection and speed, is a point in the out-put space C. We perform a normalizationso that the average intensity is zero and thevariance of pixels of the image is a unit;the system uses the resulting image pixel

array. However, the system makes noassumption that the world is 2D. For thesame reason, the human retina is 2D, buthumans perceive a 3D world. What mat-ters is how the 2D information is mappedto perception and actions. So, as weexplained earlier, the learning task is toefficiently approximate function f, whichmaps space S into space C, using an appro-priate architecture.

Building trees. Shoslif uses a tree approxi-mator; Figure 3 illustrates the concept. Inthat figure, the bounded space S is two-dimensional. To characterize the distributionof the training samples shown in Figure 3a,the tree approximator automatically gener-ates the recursive partition tree (RPT) in Fig-ure 3b.3 Statistical researchers call such a treethat outputs numerical values a regressiontree.4 In the tree, each leaf node represents a

SEPTEMBER/OCTOBER 2000 65

Memory

Learner(learning mode)

Sensoryinput

Control signal

Memory

Sensoryinput

Control signalLearner

(performance mode)

(a) (b)

Figure 2. The learning-based approach to navigation control: (a) the learning phase; (b) the performance phase. Thedashed lines indicate that the human must create the corresponding module’s content during learning.

0,1

1,21,3

1,4

1,5

0,1

2,2

2,3

2,4

2,5

2,7

2,8

2,101,2 1,3 1,51,1

2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8

1,4

2,9 2,10

(a) (b)

Figure 3. The Shoslif tree approximator: (a) A 2D view of the partition. A label indicates a cell’s center. Because of positionoverlap, the figure does not show the child’s label if the parent covers the child. A number pair i,j marks each training sample X in space S; i is the level number and j the child number in the (b) corresponding resulting recursive partition tree.

Vector representation for an image

We can represent a digital image with r pixel rows and c pixel columns by a vectorin (rc)-dimensional space S without any information loss. For example, we can writethe set of image pixels {I(i, j) | 0 ≤ i < r, 0 ≤ j < c} as a vector X = (x1, x2, L, xd)t wherexri+j+1 = I(i, j) and d = rc. The actual mapping from the 2D position of every pixel to acomponent in the d-dimensional vector X is not essential but is fixed once it is selected.Because the pixels of all the practical images can only take values in a finite range, wecan view S as bounded. If we consider X as a random vector in S, the correspondingelement of the covariance matrix Σx of X represents the cross-pixel covariance. Usingthis vector representation, our appearance-based method considers the correlationbetween any two pixels in Σx, not just between neighboring pixels.

single training sample (X, Y) = (X, f(X)) (ora set of nearby training samples in S that havethe same control vector Y). Given anunknown input image X′, Shoslif uses theRPT to find the nearest X among all the train-ing samples. It then uses the correspondingcontrol vector Y = f(X) stored in the corre-sponding leaf node as the next control vec-tor. In practice, we preprocess an image rep-resented by a vector X. First, we perform thenormalization mentioned earlier by trans-forming the value of every pixel by a linearfunction. This reduces the effect of the light-ing’s absolute brightness and the global con-trast.

Time complexity motivated us to use atree for finding the match. Given an un-known input, finding the nearest neighborfrom a set of n samples using linear searchrequires O(n) time, which is impractical forreal-time navigation with even a moderatelylarge n. The time complexity for retrievingthe output from an RPT with n stored sam-ples is O(log(n)). This scheme tends to buildtrees that are roughly balanced, as we see inthe section “Automatic feature derivation.”Also, we do not need to store every trainingsample, as we see in the section “Incremen-tal online learning.”

The learning phase. We discuss the batchlearning method here; we look at incremen-tal learning later. Batch learning requires thatall the training samples are available beforethe training starts. In the learning phase, theapproximator generates an RPT based on thetraining data set. The RPT serves as thememory in Figure 2. The RPT’s root repre-sents the entire input space S. The approxi-mator analyzes the training set L, dividingthe space S into b (b > 1) cells, each repre-sented by a child of the root. We define themean of the samples falling into each cell asthat cell’s center, as marked in Figure 3a. Theanalysis involves automatic derivation of lin-ear features, each of which corresponds tothe normal of a hyperplane in Figure 3a. (We

discuss the feature derivation for each non-leaf node in “Automatic feature derivation.”)The analysis proceeds this way recursively,based on the samples falling into each cell,to subdivide the cell into smaller cells. Sucha recursive partition proceeds for each nodeuntil the resulting node contains only onesample or several samples that all have vir-tually the same output vector Y.

The performance phase. In this phase, thelearning-based approach grabs the currentinput image X′ and uses it to retrieve the con-trol signal from the RPT. To do this, it exam-ines the center of every child of the root. Inthe same way, it further explores the cellwhose center is closest to X′ recursively untilit reaches a leaf node. Then, it uses the cor-responding Y vector stored in the leaf nodeas the resulting control vector for X′.

The faster our approach can search theRPT this way, the faster it will update the con-trol parameters. To speed up the processing,we use a binary RPT (b = 2), taking intoaccount the computation in each node and thetree’s depth. To reduce the chance of missingthe best-matched leaf node, we explore k > 1paths down the tree, instead of the single-pathexploration we described earlier. At each levelof the tree, our modified approach exploresthe top k cells and compares their children,kb of them in general, to find the top k near-est centers. Finally, it finds k leaf nodes. Theoutput control vector is the weighted sum ofall the corresponding Y vectors of these leafnodes. A faraway leaf node uses a smallerweight than the near ones. So, faraway leafnodes do not affect the output vector much.

Automatic feature derivation. Automaticfeature derivation is different from featureselection or feature extraction. Feature selec-tion selects a few features from a set ofhuman-defined features. Feature extractionextracts selected features (for example, edgesand color) from images. Automatic featurederivation must derive features from high-

dimensional raw vector inputs. The pro-grammers program rules for deriving fea-tures, but not the actual features.

The dimensionality of the image space S islarge, typically larger than the number oftraining images. For effective space partition,finding the subspace S′ in which the trainingsamples lie is useful. Principal-componentanalysis5 is appropriate for this purpose.

Figure 4a shows how we use PCA to par-tition the space S recursively on the basis ofthe training samples points. In Figure 4a, var-ious symbols in the entire space S representsamples. Suppose we use n samples, fromwhich PCA computes n principal componentvectors V1, V2, ..., Vn in S, ranked by thedecreasing eigenvalues (see the sidebar,“Principal-component analysis”). Given anumber m (m ≤ n), we call the top m vectorsthe most expressive features because theybest explain the variation of the sample dis-tribution. Each MEF vector corresponds toan image. The first principal component vec-tor V1 indicates the direction with the mostsignificant variation for the samples in S. Weneed only to compute a single MEF V1 forthe binary RPT construction.

Put geometrically, we determine a hyper-plane that has V1 as its normal vector and thatgoes through the centroid of the samples. Thisplane defines the level-0 splitter that partitionsthe corresponding cell (root). The samples thatfall on one side of the splitter are assigned tothe root’s left child; the samples falling on theother side go to the right child. In Figure 4a,the thickest line indicates the hyperplane, par-titioning the entire space into two cells. ThePCA for each child uses the samples that fallinto the cell represented by the child.

However, if we know the class informa-tion of the training samples, we can gener-ally do better. The symbols of the same typein Figure 4a denote input images associatedwith the same control vector—that is, be-longing to the same class. The MEF-basedrecursive partition in Figure 4a does not givean ideal partition, because the partition’sboundaries do not cut along the class bound-ary. For example, the normal of the thickestline does indicate the sample cluster’s longaxis, but the long axis does not necessarilyalign with most class boundaries, as Figure4a indicates. Consequently, the resulting treeis larger than necessary. The main reason forthis phenomenon is that such an MEF-basedrecursive partition does not use the classinformation in the training samples.

How do we define classes for navigation


Principal-component analysis

This method computes an orthonormal basis from a set of sample vectors.1 Com-putationally, it computes the eigenvectors and the associated eigenvalues of the sam-ple covariance matrix Γ of the samples. When the number of samples n is smaller thanthe dimensionality d of the sample vectors, we can compute the eigenvectors andeigenvalues of a smaller n × n matrix instead of a large d × d matrix.

Reference

1. K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press,New York, 1990.

control? For example, we can predivide thetraining samples into different classes,according to the scene’s type (for instance, astraight passageway or a corner) and thedesired control signal in each type (forinstance, heading increments of 10o, 5o, 0o,−5o, or −10o). We can use these preclassifiedtraining samples to more effectively partitionthe sample space (see Figure 4b), using multi-class, multidimensional linear discriminantanalysis (see the related sidebar). LDA com-putes the basis of a linear subspace of a givendimensionality so that, projected onto the lin-ear subspace,

• samples of different classes are far apartas much as possible, and

• the average size of clusters of samplesfrom the same class is constant.

We call the top basis vectors the most dis-criminating features.

As in the MEF case, the binary RPT needsonly a single MDF vector; this vector givesthe hyperplane’s normal. First, we computethe first MDF as the splitter of the root, givenall the preclassified samples. Then, also as inthe MEF case, the samples that fall on oneside of the hyperplane go to the left child andthe other samples go to the right child. Weperform LDA for each child on the basis ofthe samples each has received. This processcontinues until each cell contains only sam-ples of the same class.

Figure 4b shows the recursive partitionrepresented by the MDF RPT. The partitionboundary tends to cut along the class bound-ary, resulting in a tree much smaller than the

MEF RPT in Figure 4a. Such a concise treenot only enables faster retrieval but also tendsto approximate the class boundary better.

Figure 5 shows sample training imagescollected by Michigan State University’sROME (robotic mobile experiment) robotinside the MSU Engineering Building. Inone of our tests, we used 210 images alonga straight corridor and 108 images at a cor-ner, grouped into six classes. The first fiveclasses are straight corridors classifiedaccording to the next heading directionneeded to recover the correct heading—thatis, Class 0 for 10o, Class 1 for 5o, Class 2 for0o, Class 3 for −5o, and Class 4 for −10o.Class 5 consists of the 108 corner images.

Figures 6a and 6b show the first five of thecomputed MEFs and MDFs. The MEFsmainly record large areas of contrast, whileMDFs record locations of edges with increas-ing spatial resolutions. These MEFs andMDFs play the same role that traditional edge

detectors (for example, the Laplacian-of-Gaussian operator) do. However, they aremuch better because they are not local (notjust single edges, but a combination of edges)and are optimal (expressiveness for MEF ordisciminativeness for MDF). Figure 6c showsthe training samples projected onto the sub-space spanned by the first two MEFs, and Fig-ure 6d shows those corresponding to the firsttwo MDFs. In the MEF subspace, each class’ssamples spread out widely, and the samplesof different classes tend to mix together. Butin the MDF subspace, each class’s samplesare clustered more tightly, and the samplesfrom different classes are farther apart. FromFigure 6, we can see that for classifying anunknown image from the same environmentusing the nearest-neighbor rule, the MEF sub-space is not as good as the MDF subspace.

Shoslif with states

In the previous section, we explained astateless system. Because the visual varia-tion among views can be very large, such astateless system is sufficient only for rela-tively benign scenes.

Why states? In principle, the nearest-neigh-bor rule will perform well as long as we useenough samples to cover all the cases withsufficient density. MDFs can disregard unre-lated parts of the image, if enough of the pro-vided images show that the parts are indeedunrelated. However, in practice, providingall the necessary images is impossible. Forexample, if the navigation scene contains abig poster on a wall and that poster changesevery day, training the system for all the pos-sible posters is impractical. (Vision-basednavigation is much more difficult than range-


(a) (b)

+

++

+

- -

-

+++

+

++-

-

-

-

-

**

**

*

**

* *

+ ++

+

@

@@@

@@

+

@

+++

++

++-

-

-

-

-

**

**

*

**

* *

+ ++

+

@

@@@

@@

+

@

+

++

+

- -

-

Figure 4. Recursive partitions represented by (a) the MEF (most expressive feature) binary tree and (b) the MDF (mostdiscriminating feature) binary tree, which is smaller. Symbols of the same type indicate samples of the same class.

Linear discriminant analysis

Ronald Aylmer Fisher developed linear discriminant analysis to discriminate twoclasses; researchers have extended the result to more than two classes. LDA definestwo matrices from the preclassified training samples: the within-class scatter matrixW and the between-class scatter matrix B. It computes a basis of a linear subspace ofa given dimensionality in S so as to maximize the ratio of the between-class scatter overthe within-class scatter in the subspace. This basis defines the MDF (most discrimi-nating feature) vectors. Computationally, these vectors are the eigenvectors of W−1Bassociated with the largest eigenvalues.1 When W is a degenerate matrix, we can per-form LDA in the subspace of principal-component analysis (see the related sidebar).

Reference

1. K. Fukunaga. Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press,New York, 1990.

based navigation that uses, for example, laserrange finders.) Therefore, a monolithic treat-ment of an input image significantly limitsthe generalization power.

So, our objective is to train the system toactively pay attention to critical parts of thescene—landmarks—according to each situ-

ation, and to disregard unrelated parts in thescene. For learning-based methods, usinglandmarks is not trivial. We must systemati-cally incorporate the attention selectionmechanism into the learning system insteadof explicitly programming it into a naviga-tion program. This is because of human lim-

itations in deriving rules for complex atten-tion selection.

States. We define a system state vector St ateach time t. This state keeps informationabout the context needed for task execution.For example, the information about attention


Figure 5. Sample learning images: (a) a straight corridor; (b) a left turn.

(a)

(b)

Figure 6. The difference between MEFs and MDFs in representing learning samples from a straight corridor and a corner. The first five (a) MEFs and (b) MDFs of the learning set. EachMEF is a vector in the space S, the space of all possible images. Each MEF and MDF is a weighted sum of all the training images and is automatically derived. The bottom graphs illustrate the learning samples projected onto the subspace spanned by the first two (c) MEFs and (d) MDFs. The numbers in the plot space are the class labels of the learning samples.

MEF1

MEF

2

–1,000 –500 0 500

500

0

–500

100

50

0

–50

–100

01

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

01

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

01

2

3

4

01

2

3

4

0

1

2

3

4

01

2

3

4

01

2

3

4

0

1

2

3

4

01

2

34

0

1

2

3

4

01

2

3

4

01

2

3

4

01

2

3

4

012

3

4

0

1

2

3

4

0

1

2

34

01

2

3

4

0

1

2

34

01

2

3

4

01

2

3

4

01

2

3

4

01

2

3

4

0

1

2

3

4

01

23

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

012

3

4

0

1

2

3

4

01

2

3

4

0

1

2

34

0

1

2

3

4

0

1

2

3

4

55

5

5

55

5

5

5

5

5 5

55

5

5

55

5

5

5

55

5

5

5

55 5 5

5 55

5

555

5555

55

55

55

5

5

5

5

5

5

55

5

5

55

55

5

5

555

5

555

55

55

55

5

5555

5

5

5 55

5

5

5

5

55

5

555

5

555

5

5

5

5

5

5

5

5

MDF1

MDF

2

–100 0 100 200

0

1

23

4

0

1

2

3

4

0

1

2 3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

23

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2 3

4

0

1

2

3

4

0

1

2

34

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

34

0

1

2

3

4

0

1

2

34

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

23

4

0

1

2

3

4

0

1

2

3

4

0

1

23

4 0

1

2

3 4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

3

4

0

1

2

34

0

1

2 3

4

5

5

5

555 5

55

5

5

5

55

55

555

55

5

5

55

5

5 5

5

5 5

55

5

5555

5

5 55

55

55

5

555

5

555

5

55

55

5

55

5

55

5 5

55

55 555

5 5

55

5

5

55

55

55

5

555

55

5

55

5

5

55

5

5

55

5

5

5

55

(c) (d)

(a)

(b)

action is a part of the state. In scenes thatrequire visual attention, the system movesinto the particular state defined for that atten-tion need. At the next time instant, the sys-tem directs its visual attention to a part of theview, as a learned behavior depending on thatstate. The attention’s relative position canalso be associated with the state.

Symbolically, at time t, the system is atstate St and observes image Xt. It gives con-trol vector Yt+1 and enters the next state St+1.So, we can represent the corresponding map-ping as

(St+1, Yt+1) = f (St, Xt) (2)

where Yt+1 includes the attention controlparameters. The formulation in Equation 2would correspond to a finite-state machine ifthe number of states were finite. However,this is not the case, because the state spaceand the sensory input space are both contin-uous. The programmer designs the vectorrepresentation of each state in the statespace, but Shoslif automatically derives thecomposite features that it uses to analyze thecombined space of St, Xt. So, we considerstate St and input Xt as random, to take intoaccount the uncertainty in these observationsand estimates. This results in an observation-driven Markov model (ODMM).6

Computational considerations. However,we must avoid the computationally intractabletask of estimating the high-dimensional dis-tribution density conditioned on the currentstate and observation. Fortunately, Shoslifeffectively approximates the mean of (St+1,Yt+1) given the estimated (St, Xt) using train-ing samples. We can still consider the map-ping in Equation 2 as a mapping from input(St, Xt) to output (St+1, Yt+1). The funda-mental difference between this state-basedShoslif and the stateless version is that state-based Shoslif uses a part of output St+1 attime t as a part of the input at time t + 1.

Figure 7 shows several types of corridorsegments on the third floor of MSU’s Engi-neering Building. For our experiments, thetrainer created several states correspondingto some of these segment types. A specialstate A (ambiguous) indicates that localvisual attention is needed. The trainer de-fined this state for a segment right before aturn. In this segment, the image area thatrevealed the visual difference between dif-ferent turn types (that is, a landmark) wasmainly in a small part of the scene. The local

attention action in A directs the system tolook at such landmarks through a prespeci-fied image subwindow so that the systemissues the correct steering action before it istoo late. Figure 8 shows the states and theirtransitions, which the trainer teaches Shoslifincrementally and online.

Incremental online learning. Learningmethods fall into two categories: batch orincremental.

As we mentioned before, a batch learningmethod requires that all the training data areavailable at the same time when the systemlearns. Figuring out beforehand how manyand what kinds of training images we needto reach a required performance level is dif-

ficult. So, a batch learning method requiresrepeated cycles of collecting data, training,and testing. The limited space available tostore training images and the need for moreimages for better performance are two con-flicting factors. Thus, experience has taughtus that collecting a sufficiently rich yet smallset of training samples is tedious. Further-more, batch learning must occur offline,because processing the entire training settakes considerable time.

With an incremental learning method,training samples are available only one at atime. This method discards each sample assoon as it is used for training. The requiredstream of training samples is typically solong that the number of samples is virtually


L L Z Z T Xl r l r S

Figure 7. Different types of corridor segments in an indoor environment.

A

S/g

L L Z Zl r l r/g /g /g /g

/a

Figure 8. An observation-driven Markov model for indoor vision-based navigation. The label of the form x/y denotes astate x with associated attention action y. Actions a and g denote local-view and global-view attentions, respectively. Ateach state, the transition to the next state depends on the image view, either local or global, indicated by the attentionaction. Each arrow corresponds to a very complex set of image appearances learned by Shoslif-N. In our experiments,we taught our robot to turn in a specific direction at each intersection. However, the robot must disambiguate each current situation from many other perceptually similar but physically different situations, based on the current visualinput and estimated state.

infinite. In incremental learning, many sim-ilar samples are not stored at all if they arefound very close to a stored element. In otherwords, the system can look at a spot for along time, but the memory does not grow.With a batch method, all those similar imagesof the same spot must be stored first. So,batch learning is not practical for very longtraining streams.

We have developed an incremental-learn-ing version of Shoslif. This version builds theRPT incrementally online. During training,the human operator controls the navigatorthrough a joystick. At each time instant,Shoslif grabs an image from the camera anduses it to query the current RPT. If the dif-ference between the current RPT’s outputand the desired control signal (given by thehuman operator) is beyond a prespecified tol-erance, Shoslif learns the current sample toupdate the RPT. Otherwise, it rejects theimage case without learning it. This selec-tive-learning mechanism effectively preventsredundant learning, keeping the RPT small.This incremental-learning mode also makeslearning convenient. As soon as the systemrejects most of the recent training samples ata given location, we know that we can moveto other locations. Little time is wasted. Aslong as the system has rejected most samplesat all locations, we can set the system free tonavigate on its own.

Experimental results

A major issue of visual learning is howwell Shoslif compares with other con-strained-search methods such as regressiontrees and neural networks. We discuss ourcomparison experiments before presentingthe results of our navigation tests.

Shoslif versus other methods. AI researchon methods using classification and regres-sion trees has been extensive. The book byLeo Breiman and his colleagues gives a bal-anced exposition.4 However, in statistics,available samples for statistical decisions aretypically human-provided feature vectorswhere each component corresponds to themeasurement of a meaningful parameter.Consequently, each node there uses only onecomponent of the input vector. We face thechallenging problems of dealing with high-dimensional input such as images or imagesequences, where each component corre-sponds to a pixel. Univariate trees (based on

a single component at a time), such as CARTand C5.0, or multivariate trees for lower-dimensional data, such as OC1, certainlywere not designed for high-dimensional,highly correlated data. Our method uses allinput components at each node to derive fea-tures that utilize the information of correla-tion among input components, not just a sin-gle component.

In our tests using highly correlated Feretimage data sets (5,632-dimensional), weobtained these error rates: CART, 41%; C5.0,41%; OC1, 56%; and our incremental Shosliftree, 0.00% (perfect). The error rates of ourprior versions of the Shoslif tree range from3% to 7%. (The error rates of CART, C5.0and OC1 are comparable with Shoslif only

in tests for lower-dimensional data sets)Because the Shoslif RPTs perform automaticfeature derivation, they are more than justclassification and regression trees.

In our previous work, we comparedShoslif with feedforward neural networksand radial basis function networks forapproximating stateless appearance-basednavigation systems.7 We selected the bestFFN from 100 FFNs, each with a random ini-tial guess. FFNs performed the worst. RBFwas slightly better than FFN, but only whenhumans, using human vision, manuallyselected its centers. Shoslif performed sig-nificantly better than FFNs and RBFs.

Navigation tests. We have tested the state-based Shoslif-N navigator on our ROMErobot (see Figure 9) with a single video cam-era, without using any other sensor. A slow,onboard Sun Sparc-1 workstation with a Sun-Video frame grabber performed the computa-tion, without any special-purpose image-processing hardware. We trained ROMEonline and interactively, on the third floor of

our Engineering Building. After the systemaccepts 272 samples to update the tree, it cannavigate on its own. It successfully extendedits experience to many hallway sections wherewe had not trained it. The refresh rate is 6 Hz,meaning that the system processes six imageframes and performs six state-transitions withthe trained RPT per second.

We have conducted many test runs toobserve the trained robot’s performance sta-bility. Moving at 40 to 50 millimeters per sec-ond, ROME finished one large loop in ourbuilding in approximately 20 minutes. Fig-ure 10 shows a few frames from a video ofthe robot roaming during a test. As Figures 5and 10 illustrate, the scene presents verysevere specularities and a wide variety offloor tile patterns. The corridor structurechanges drastically from one spot to another.Our prior stateless system could only navi-gate correctly through approximately one-quarter of the long loop. The state-based sys-tem can now reliably cover the entire loop.ROME continuously roamed for longer thanfive hours until the onboard batteries becamelow. It has performed flawlessly in dozens ofsuch tests, in the natural presence of passers-by. Our experiment also demonstrated that ifwe do not use states, the robot will run intowalls during complex turns.

AS WE’VE SHOWN, OUR LEARN-ing method does not require human pro-grammers to program information that is spe-cific to the environment (such as indoors oroutdoors) and the actions (such as headingdirection or arm joint increments). In fact,we have also used this method for other tasks,such as face recognition, hand gesture recog-nition, speech recognition, and vision-basedrobot arm action learning.3 These tasks canbe learned reliably if the environment, al-though very complex, does not vary too much.If the number of environment types increases,the number of required states will also increase.So, defining internal states for training canbecome overwhelming for humans. The result-


THE RESULTING ISSUE IS HOW

TO ENABLE THE MACHINE

LEARNER TO AUTOMATICALLY

GENERATE INTERNAL

REPRESENTATION ONLINE

IN REAL TIME.

ing issue is how to enable the machine learnerto automatically generate internal representa-tion online in real time—that is, to fully auto-mate the learning process. Some work in thisdirection has been reported.8

AcknowledgmentsNational Science Foundation grant IRI 9410741

and Office of Naval Research grant N00014-95-1-0637 supported this work. We thank Yuntao Cui,Sally Howden, and Dan Swets for discussions anddevelopment of Shoslif subsystems.

References1. M. Kirby and L. Sirovich, “Application of the

Karhunen-Loève Procedure for the Charac-terization of Human Faces,” IEEE Trans. Pat-tern Analysis and Machine Intelligence, Vol.12, No. 1, Jan. 1990, pp. 103–108.

2. M. Turk and A. Pentland, “Eigenfaces forRecognition,” J. Cognitive Neuroscience, Vol.3, No. 1, Jan. 1991, pp. 71–86.

3. J. Weng, “Cresceptron and Shoslif: TowardComprehensive Visual Learning,” EarlyVisual Learning, S.K. Nayar and T. Poggio,eds., Oxford Univ. Press, New York, 1996, pp.183–214.

4. L. Breiman et al., Classification and Regres-sion Trees, Chapman & Hall, New York, 1993.

5. K. Fukunaga. Introduction to Statistical Pat-tern Recognition, 2nd ed., Academic Press,New York, 1990.

6. D.R. Cox, “Statistical Analysis of TimeSeries: Some Recent Developments,” Scan-dinavian J. Statistics,Vol. 8, No. 2, June 1981,pp. 93–115.

7. J. Weng and S. Chen, “Vision-Guided Navi-gation Using Shoslif,” Neural Networks, Vol.11, Nos. 7–8, Oct./Nov.1998, pp. 1511–1529.

8. J. Weng et al., “Developmental Humanoids:Humanoids that Develop Skills Automati-cally,” Proc. First IEEE-RAS Int’l Conf.Humanoid Robotics (CD-ROM), IEEE Press,Piscataway, N.J., 2000.

Juyang Weng is an associate professor in theDepartment of Computer Science at Michigan StateUniversity. His research interests include mentaldevelopment, computer vision, autonomous navi-gation, and human–machine interfaces using vi-sion, speech, gestures, and actions. He is an origi-nator and proponent of a new research directioncalled developmental robots—robots that canautonomously develop cognitive and behavioralcapabilities through online, real-time interactionswith their environments through their sensors andeffectors. He received his BS from Fudan Univer-sity, Shanghai, China, and his MS and PhD fromthe University of Illinois, Urbana-Champaign, allin computer science. Contact him at 3115 Eng.Bldg., Dept. of Computer Science and Eng., Michi-gan State Univ., East Lansing, MI 48824-1226;[email protected]; www.cse.msu.edu/~weng.

Shaoyun Chen works on wafer inspection atKLA-Tencor. Previously, he worked on activevision and iris recognition at Sensar. His majorinterests lie in computer vision, image processing,pattern recognition, and their applications, includ-

ing fingerprint recognition and iris recognition. Hereceived his BS from Xiamen University, his MEfrom Tsinghua University, China, and his PhDfrom Michigan State University, all in computerscience. Contact him at KLA-Tencor Corp., Bldg.I, Mailstop I-1009, 160 Rio Bobles, San Jose, CA95134; [email protected].


Figure 10. Controlled by the state-based Shoslif-N, ROME navigates autonomously: (a)–(e) behind ROME; (f)–(h) in front of ROME.

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 9. The ROME (robotic mobile experiment) robot,which Shoslif-N controls.

Visual Learning with Navigation as an Exampleweng/research/IEEE_IS.pdf · 2002-01-28 · In model-free methods, we consider the vision-based navigation controller as the function

Documents