Top Banner
Eyes from Eyes: New Cameras for Structure from Motion Jan Neumann, Cornelia Ferm ¨ uller and Yiannis Aloimonos Center for Automation Research University of Maryland College Park, MD 20742-3275, USA {jneumann, fer, yiannis}@cfar.umd.edu Abstract We investigate the relationship between camera design and the problem of recovering the motion and structure of a scene from video data. The visual information that could possibly be obtained is described by the plenoptic function. A camera can be viewed as a device that captures a sub- set of this function, that is, it measures some of the light rays in some part of the space. The information contained in the subset determines how difficult it is to solve subse- quent interpretation processes. By examining the differen- tial structure of the time varying plenoptic function we re- late different known and new camera models to the spatio- temporal structure of the observed scene. This allows us to define a hierarchy of camera designs, where the order is determined by the stability and complexity of the com- putations necessary to estimate structure and motion. At the low end of this hierarchy is the standard planar pin- hole camera for which the structure from motion problem is non-linear and ill-posed. At the high end is a new cam- era, which we call the full field of view polydioptric camera, for which the problem is linear and stable. In between are multiple-view cameras with large fields of view which we have built, as well as catadioptric panoramic sensors and other omni-directional cameras. We develop design sug- gestions for the polydioptric camera, and based upon this new design we propose a linear algorithm for ego-motion estimation, which in essence combines differential motion estimation with differential stereo. 1 Introduction When we think about vision, we usually think of inter- preting the images taken by (two) eyes such as our own - that is, images acquired by planar eyes. These are the so- called camera-type eyes based on the pinhole principle on which commercially available cameras are based. One con- siders a point in space and the light rays passing through that point. Then the rays are cut with a plane, and a subset of them forms an image. But these are not the only types of eyes that exist; the biological world reveals a large variety of eye designs. It has been estimated that eyes have evolved no fewer than forty times, independently, in diverse parts of the animal kingdom. An eye or camera is a mechanism that forms images by focusing light onto a light sensitive surface (retina, film, CCD array, etc.). Throughout this paper we will use the term ”eye” both for cameras and biological eyes. Different eyes are obtained by varying three elements: (1) the geome- try of the surface, (2) the geometric distribution and optical properties of the photoreceptors, and (3) the way light is collected and projected onto the surface (single or multiple lenses, or tubes as in compound eyes). Vision systems pro- cess these images to recognize, navigate, and generally in- teract with the environment. How advanced this interaction is depends both on the value of the information collected by the eyes and on how difficult it is to create intelligent behavior from such information. Evolutionary considera- tions tell us that the design of a system’s eye is related to the visual tasks the system has to solve. The way images are acquired determines how difficult it is to perform a task, and since systems have limited resources, their eyes should be designed to optimize subsequent image processing as it relates to particular tasks. We would like to gain insight into the relationship of eye design and task performance. Technological advances make it possible to construct in- tegrated imaging devices using electronic and mechanical micro-assembly, micro-optics, and advanced data buses, not only of the kind that exists in nature such as log-polar reti- nas [7], but also of many other kinds. Thus answers to our question are not only relevant for explaining biological sys- tems but also have immediate impact on technology. We wish to evaluate and compare different eye designs in a scientific sense by using mathematical criteria. More specifically, we want to determine how we ought to col- lect images of a (dynamic) scene to best recover the scene’s shapes and actions from video sequences. This problem has 1
8

Eyes from eyes: New cameras for structure from motion

May 13, 2023

Download

Documents

Nil Satana
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Eyes from eyes: New cameras for structure from motion

Eyes from Eyes: New Cameras for Structure from Motion

Jan Neumann, Cornelia Fermuller and Yiannis AloimonosCenter for Automation Research

University of MarylandCollege Park, MD 20742-3275, USA{jneumann, fer, yiannis}@cfar.umd.edu

Abstract

We investigate the relationship between camera designand the problem of recovering the motion and structure ofa scene from video data. The visual information that couldpossibly be obtained is described by the plenoptic function.A camera can be viewed as a device that captures a sub-set of this function, that is, it measures some of the lightrays in some part of the space. The information containedin the subset determines how difficult it is to solve subse-quent interpretation processes. By examining the differen-tial structure of the time varying plenoptic function we re-late different known and new camera models to the spatio-temporal structure of the observed scene. This allows usto define a hierarchy of camera designs, where the orderis determined by the stability and complexity of the com-putations necessary to estimate structure and motion. Atthe low end of this hierarchy is the standard planar pin-hole camera for which the structure from motion problemis non-linear and ill-posed. At the high end is a new cam-era, which we call the full field of view polydioptric camera,for which the problem is linear and stable. In between aremultiple-view cameras with large fields of view which wehave built, as well as catadioptric panoramic sensors andother omni-directional cameras. We develop design sug-gestions for the polydioptric camera, and based upon thisnew design we propose a linear algorithm for ego-motionestimation, which in essence combines differential motionestimation with differential stereo.

1 Introduction

When we think about vision, we usually think of inter-preting the images taken by (two) eyes such as our own -that is, images acquired by planar eyes. These are the so-called camera-type eyes based on the pinhole principle onwhich commercially available cameras are based. One con-siders a point in space and the light rays passing through

that point. Then the rays are cut with a plane, and a subsetof them forms an image. But these are not the only types ofeyes that exist; the biological world reveals a large varietyof eye designs. It has been estimated that eyes have evolvedno fewer than forty times, independently, in diverse parts ofthe animal kingdom.

An eye or camera is a mechanism that forms images byfocusing light onto a light sensitive surface (retina, film,CCD array, etc.). Throughout this paper we will use theterm ”eye” both for cameras and biological eyes. Differenteyes are obtained by varying three elements: (1) the geome-try of the surface, (2) the geometric distribution and opticalproperties of the photoreceptors, and (3) the way light iscollected and projected onto the surface (single or multiplelenses, or tubes as in compound eyes). Vision systems pro-cess these images to recognize, navigate, and generally in-teract with the environment. How advanced this interactionis depends both on the value of the information collectedby the eyes and on how difficult it is to create intelligentbehavior from such information. Evolutionary considera-tions tell us that the design of a system’s eye is related tothe visual tasks the system has to solve. The way imagesare acquired determines how difficult it is to perform a task,and since systems have limited resources, their eyes shouldbe designed to optimize subsequent image processing as itrelates to particular tasks. We would like to gain insight intothe relationship of eye design and task performance.

Technological advances make it possible to construct in-tegrated imaging devices using electronic and mechanicalmicro-assembly, micro-optics, and advanced data buses, notonly of the kind that exists in nature such as log-polar reti-nas [7], but also of many other kinds. Thus answers to ourquestion are not only relevant for explaining biological sys-tems but also have immediate impact on technology.

We wish to evaluate and compare different eye designsin a scientific sense by using mathematical criteria. Morespecifically, we want to determine how we ought to col-lect images of a (dynamic) scene to best recover the scene’sshapes and actions from video sequences. This problem has

1

Page 2: Eyes from eyes: New cameras for structure from motion

(a) (b) (c) (d) (e)

Camera Design: Recovery of 3D motion and structure:

(a) planar camera nonlinear and ill-posed(b) spherical camera or nonlinear, but stable, robust(c) Argus eye and omni-directional cameras(d) small field-of-view polydioptric camera linear, but not stable(e) spherical polydioptric camera linear and stable

Table 1. Hierarchy of cameras with respect to the structure from motion problem

wide implications for a variety of applications not only invision and recognition, but also in navigation, virtual real-ity, tele-immersion, and graphics. At the core of this ca-pability is the celebrated module of structure from motion,and so our question becomes: What eye should we use, forcollecting video, so that we can subsequently facilitate thestructure from motion problem in the best possible way?

To classify cameras, we will study the most complete vi-sual representation of the scene, namely the plenoptic func-tion as it changes differentially over time. In free spacethe plenoptic function amounts to the 5D space of time-varying light rays. Any imaging device captures a subsetof the plenoptic function. We would like to know how, byconsidering different subsets of the light field, the problemof structure from motion becomes easier or harder. Theproblem of structure from motion amounts to estimating therigid motion which the camera undergoes on the basis of therays captured by the camera; then, with knowledge of thecamera’s location at two (or more) time instants, throughtriangulation of rays originating from the same scene point,recovery of the scene structure is achieved.

A theoretical model for a camera that captures theplenoptic function in some part of the space is a surfaceS that has at every point a pinhole camera. We call thiscamera apolydioptriccamera1. With such a camera we ob-serve every point in the scene in view from many differentviewpoints (theoretically, from every point onS) and thuswe capture many rays emanating from that point. A param-

1A “plenoptic camera” had been proposed in [2], but since no physi-cal device can capture the true time-varying plenoptic function, we preferthe term polydioptric to emphasize the difference between the theoreticalconcept and the implementation.

eterization for these general cameras has been introducedrecently in [13].

Standard single-pinhole cameras capture only one rayfrom each point in space, and from this ray at different timesthe structure and motion must be estimated. This makes es-timation of the viewing geometry a non-linear problem. Theadditional information in the polydioptric camera (multiplerays from the same scene point) makes estimation of theviewing geometry linear. There is another factor that af-fects the estimation, namely the surfaceS on which light iscaptured. It has long been known that there is an ambiguityin the estimation of the motion parameters for small field ofview cameras, but only recently has it been noticed that thisambiguity disappears for a full field of view camera. Wewill explore this issue in more detail in Section 2.

Thus there are two principles relating camera design toperformance in structure from motion – the field of view andthe linearity of the estimation. These principles are summa-rized in Table 1.

For a planar camera, which by construction has a lim-ited field of view, the problem is nonlinear and ill-posed.If, however, the field of view approaches360◦, that is, thepencil of light rays is cut by a sphere, then the problem be-comes well-posed and stable, although still nonlinear. It iscurrently technologically impossible to implement a high-resolution spherical camera. Catadioptric sensors have beenused to capture a full field of view [8, 11], but they do notprovide the resolution necessary for model building. Onecan, however, approximate such a camera by using severalconventional cameras, capable of synchronized recordingand arranged on a surface so that they capture a number oflight ray pencils simultaneously. We use the name ”Argus

2

Page 3: Eyes from eyes: New cameras for structure from motion

Eye” for such a device [4]. When the cameras are arrangedon a sphere (or any other surface that enables sampling ofthe full field of view), then we obtain a spherical Argus eye.

A polydioptric camera can be obtained if we arrange or-dinary cameras very close to each other (Figs. 1 and 2). Thiscamera has an additional property arising from the proxim-ity of the individual cameras: it can form a very large num-ber of orthographic images, in addition to the perspectiveones. Indeed, consider a directionr in space and then con-sider in each individual camera the captured ray parallel tor. All these rays together, one from each camera, form animage with rays that are parallel. Furthermore, for differentdirectionsr a different orthographic image can be formed.For example, Fig. 1 shows that we can select one appropri-ate pixel in each camera to form an orthographic image thatlooks to one side (blue rays) or another (red rays). Fig. 2shows all the captured rays, thus illustrating that each indi-vidual camera collects conventional pinhole images.

Figure 1. Designof a PolydioptricCamera capturingParallel Rays

Figure 2. andsimultaneouslycapturing Pencilof Rays

Thus, a polydioptric camera has the unique property thatit captures, simultaneously, a large number of perspectiveand affine images (projections). We will demonstrate thatit also makes the structure from motion problem linear. Apolydioptric spherical camera is therefore the ultimate cam-era since it combines the stability of full field of view mo-tion estimation with linearity of the problem, as well asthe ability to reconstruct scene models with minimal recon-struction errors.

The outline of this paper is as follows. We will analyzethe effect of the field of view of a camera on motion estima-tion (Section 2), before we use the framework of plenop-tic video geometry (Section 3) to show how a polydiop-tric camera makes structure from motion estimation a linearproblem. Then we relate the information in the polydiop-tric camera to the information in ordinary pinhole camerasand differential stereo setups, and also examine the relationbetween the plenoptic derivatives and their relative scales.Based on the insights gained, we propose a feedback algo-rithm to accurately compute the structure and motion using

all the plenoptic derivatives, and we conclude with sugges-tions about how to implement and construct polydioptricand Argus eyes.

2 Ambiguities due to the Field of View

For standard cameras with only one imaging surface andone pinhole there is only information about the change ofintensity with direction (∇rE ).

If the imaging surface is a sphere of unit radius centeredat the origin of the fiducial coordinate system (x = 0), weconsider a parameterization of the imaging surface by thedirectional coordinatesr whereR is the scene point pro-jected on the imaging surface atr and thusr = R/|R|.

Assuming the intensity at corresponding image pointsto be the same, we obtain theimage brightness constraintequation

Et +∇rE · dr

dt= 0 (1)

which we relate to the motion parameters as follows:

−Et = ∇rE · dr

dt= ∇rE ·

(1|R|

t + (ω × r))

. (2)

The most common approach to motion estimation onthe basis of this input proceeds in two computational steps.First, on the basis of the image derivatives(Et,∇rE) oneestimates an approximation to the motion fielddr/dt, theso-called optical flow field. To do so, one has to make as-sumptions about the flow which in essence amount to as-sumptions about the scene in view; usually the flow field ismodeled as varying smoothly. In a second step one solvesfor the parameters of the rigid motion, that is, the directionof the translational velocityt/|t| and the rotational velocityω. This is accomplished by minimizing deviation from theepipolar constraint. This constraint for both the plane andthe sphere takes the form(dr/dt − ω × r) · (t × r) = 0.As can be seen, this equation is non-linear in the motion pa-rameters. Other approaches, often called direct approaches,relate the image derivatives directly to the 3D motion pa-rameters. Without making any assumptions about the scenein view, the only constraint that can be used is thedepthpositivity constraint, that is, the depth has to have positivevalue. Algorithms that implement this constraint search (inappropriate subspaces) for the 3D motion parameters whichyield the smallest number of negative depth values.

Solving accurately for 3D motion parameters using con-ventional small field of view cameras turned out to be a verydifficult problem. The main reason for this has to do withthe apparent confusion between translation and rotation inthe image displacements. This really is a geometric prob-lem and it exists for both small and large baselines betweenthe views, that is, for the case of continuous motion as well

3

Page 4: Eyes from eyes: New cameras for structure from motion

as in the case of discrete displacements of the cameras. Thebasic understanding of these difficulties has attracted only afew investigators over the years [3, 9, 14, 16].

(a) (b)

Figure 3. Schematic illustration of error func-tion in the space of the direction of transla-tion. (a) A valley for a planar surface witha limited field of view. (b) A clearly definedminimum for a spherical field of view.

Intuitively speaking, for imaging surfaces with smallfields of view the minima of the error functions lie in avalley. This is a cause for inherent instability because, ina real situation, any point of that valley or flat area couldserve as the minimum, thus introducing errors in the com-putation (see Fig. 3a). For imaging surfaces with a largefield of view, on the other hand, the functions have a welldefined-minimum, as shown in Fig. 3b, and thus there is noambiguity in the solution.

To give some geometric intuition, we write the motionconstraint on the sphere (Eq. 2) as

−Et = ∇rE · t

|R|+ (r ×∇rE) · ω (3)

Since∇rE is perpendicular to(r×∇rE), for a small fieldof view (r varies very little) and little variation in depth,a translational errortε can be compensated by a rotationalerrorωε without violating the constraint in Eq. 3 as long asthe errors have the following relationship:

1|R|

r × tε = −r × (r × ωε). (4)

That is, the projections of the translational and rotationalerrors on the tangent plane to the sphere atr need to beperpendicular. We call this theorthogonality constraint. Ifwe now increase the field of view, the constraint on the er-rors in Eq. 4 cannot be satisfied for allr, thus the confusiondisappears.

There is another ambiguity. Looking at the first termin Eq.3, that is∇rE · t/|R|, we see that the componentof t parallel tor does not factor into the equation (since∇rE · r = 0) and therefore cannot be recovered from theprojection onto the gradients for a small field of view. Wecall this theline constrainton the plane, because the projec-tions of the actualt (FOE) and the estimatedt = t+λr, λ ∈

R onto the image plane lie on a line through the image cen-ter. Again an increase in the field of view will eliminatethis ambiguity, since then measurements at other image lo-cations enable us to estimate the component oft parallel tor.

3 Plenoptic Video Geometry: The Differen-tial Structure of the Space of Light Rays

3.1 Photometric Properties of the Scene

Let the scene surrounding the image sensor be modeledby the signed distance functionf(x; t) : R3 × R+ → R.The insides of the objects in the scene are defined byf(x; t) ≤ 0, therefore, the iso-surfacef(x; t) = 0 is arepresentation of the surfaces in the scene. At each loca-tion x in free space(f(x; t) > 0), the radiance, that is thelight intensity or color observed atx from a given direc-tion r at timet, can be measured by the plenoptic functionE(r;x; t); E : S2 × R3 × R+ → Rd, whered = 1 forintensity,d = 3 for color images, andS2 is the unit sphereof directions inR3 [1]. Since a transparent medium such asair does not change the color of the light, we have a constantradiance along the view directionr:

∇xE · r = ∇rE · r = 0 ∀x ∈ R3, f(x; t) > 0 (5)

∇r = (∂/∂r1, ∂/∂r2, ∂/∂r3)T

∇x = (∂/∂x1, ∂/∂x2, ∂/∂x3)T.

Therefore, the plenoptic function in free space reducesto five dimensions – the time-varying space of directed linesfor which many representations have been presented (for anoverview see Camahort and Fussel [6]).

A sensor element at locationx on the imaging surfaceS captures a light rayφ from directionr, soφ(λ) = x +λr, λ ∈ R. If we assume that in the neighbourhood of theray φ the radiance is varying continuously (e.g. smoothlyvarying reflectance and albedo, andφ is not tangent to ascene surface), then we can develop the plenoptic functionin the neighborhood of rayφ, that isE(r;x; t), into a Taylorseries

E(r + dr;x + dx; t + dt) = E(r;x; t)+

Etdt +∇rE · dr +∇xE · dx +O(‖dr, dx, dt‖2) (6)

whereEt = ∂E/∂t is the temporal derivative of the ra-diance at the sensor element. Disregarding the higher-order terms, we have a linear function which relates a localchange in view ray position and direction to the differentialbrightness structure of the plenoptic function at the sensorelement.

The camera moves in a static world, therefore, we as-sume that the intensity of a light ray leaving the scene sur-face remains constant over consecutive time instants. This

4

Page 5: Eyes from eyes: New cameras for structure from motion

allows us to use the spatio-temporal brightness derivativesof the light rays captured by an imaging surface to constrainthe plenoptic ray flow, that is the change in position andorientation between the two rays captured by the imagingelement at consecutive time instants, by generalizing thewell-known Image Brightness Constancy Constraintto thePlenoptic Brightness Constancy Constraint:

d

dtE(r;x; t) = Et +∇rE · dr

dt+∇xE · dx

dt= 0. (7)

3.2 Plenoptic Motion Equations

In this section we will relate the motion of an imag-ing sensor to the plenoptic brightness constancy constraint(Eq. 7). Assuming that the imaging sensor undergoes arigid motion with instantaneous translationt and rotationω around the origin of the fiducial coordinate system, wecan define the plenoptic ray flow for the ray captured by theimaging element located at locationx and looking in direc-tion r as

dr

dt= ω × r and

dx

dt= ω × x + t (8)

Combining Eqs. 7 and 8 leads to theplenoptic motionconstraint

−Et = ∇xE · (ω × x + t) +∇rE · (ω × r) (9)

= ∇xE · t + (x×∇xE + r ×∇rE) · ω (10)

which is a linear constraint in the motion parameters andrelates them to all the differential image information thata sensor can capture. This equation will be our main toolfor comparing the different camera models. To our knowl-edge, this is the first time that the temporal properties of theplenoptic function have been related to the structure frommotion problem. In previous work, the plenoptic functionhas mostly been studied in the context of image-based ren-dering in computer graphics under the names light field [15]and lumigraph [12], and only the 4D subspace of the staticplenoptic function corresponding to the light rays in freespace was examined.

It is important to realize that the derivatives∇rE and∇xE can be obtained from the image information cap-tured by a polydioptric camera. Recall that a polydioptriccamera can be envisioned as a surface where every pointcorresponds to a pinhole camera (see Fig. 2).∇rE, theplenoptic derivative with respect to direction, is the deriva-tive with respect to the image coordinates that one finds ina traditional pinhole camera. One keeps the position andtime constant and changes direction (Fig. 2). The secondplenoptic derivative,∇xE, is obtained by keeping the di-rection of the ray constant and changing the position alongthe surface (Fig. 1). Thus, one captures the change of inten-sity between parallel rays. This is similar to computing the

derivatives in an affine or orthographic camera. In section 1we mentioned that a polydioptric camera captures perspec-tive and affine images.∇rE is found from the perspectiveimages and∇xE from the affine images. The ability tocompute all the plenoptic derivatives depends on the abilityto capture light at multiple viewpoints coming from multi-ple directions. This corresponds to the ability to incorpo-rate stereo information into motion estimation, since mul-tiple rays observe the same part of the world. For single-viewpoint cameras this is inherently impossible, and thusit necessitates nonlinear estimation over both structure andmotion to compensate for this lack of multi-view (or equiv-alently depth) information. This will amplify the systematicerrors in the estimated motion, as we described in the Sec-tion 2.

Although the estimation of structure and motion for asingle-viewpoint spherical camera is stable and robust, itis still non-linear, and the algorithms which give the mostaccurate results are search techniques, and thus rather elab-orate. Therefore, we will now focus on the advantages ofhaving access to both directional (∇rE) as well as posi-tional plenoptic derivatives (∇xE) by analyzing the rela-tionship between the plenoptic intensity derivatives. It willbe shown how the depth information is encoded in the rela-tions between the plenoptic derivatives∇xE,∇rE, Et, andthe camera motion parameterst andω. The relationshipbetween∇xE and∇rE had been previously utilized in dif-ferential stereo [10] and epipolar plane image analysis [5],while∇rE, Et, t, andω are used in differential motion es-timation algorithms. Our work for the first time integratesdifferential motion information and differential stereo in aplenoptic framework usingall the plenoptic derivatives.

3.3 Relationship between Plenoptic Derivatives

Comparing the plenoptic motion constraint (Eq. 9) fora single-viewpoint camera moving through the plenop-tic space to the single-viewpoint motion constraint on thesphere (Eq. 2), we see that these constraints are nearly iden-tical. Choosing the sensor viewpoint to be the origin of thecoordinate system (x = 0), the only difference is that in onecase the translational component is given by∇xE · t and inthe other case by∇rE ·t/|R|. We can interpret the classicalsingle-viewpoint motion constraint as being obtained fromthe plenoptic motion constraint by substituting for the posi-tional derivatives the directional derivatives and additionaldepth information. That is,

−Et = ∇xE · t +∇rE · (ω × r) (11)

= 1/|R| ∇rE · t +∇rE · (ω × r). (12)

This relationship between the derivatives can easily beshown by the law of similar triangles. Sincef(x+ |R|r) =

5

Page 6: Eyes from eyes: New cameras for structure from motion

Figure 4. Rela-tionship betweendirectional andpositional deriva-tives

Figure 5. Depthdependence ofderivative scale(matching scalein red)

0 and the surface is assumed to have slowly varying re-flectance, we can apply a simple triangulation argument (seeFig. 4) to get the following identity (because∇rE · r = 0we choosedx ‖ dr such thatr · dx = r · dr = 0):

E(r + dr;x; t) = E(r;x + |R|dr; t) or (13)

E(r;x + dx; t) = E(r + dx/|R|;x; t) (14)

If we now compute the directional derivative of the plenop-tic function along directiondx, it follows from ‖dx‖ =R‖dr‖ that

∇rE|dr = |R|∇xE|dx. (15)

and we see that depth is encoded as the ratio between thepositional and directional derivatives.

In differential stereo we have two (or more) cameras sep-arated by a small baselineb (translating camera with knownmotion), and we want to recover depth by relating the im-age difference between the two camerasE2 − E1 to thespatial derivative∇rE in one of the images (again, usingderivatives on the sphere we have∇rE · r = 0). The fa-miliar formulation of the differential stereo equation on theleft hand side can now be restated using the fundamental re-lationship between the positional and directional plenopticderivatives on the right hand side since for a small baselineE2 − E1 ≈ ∇xE · b:

E2 − E1 =∇rE · b|R|

→ ∇rE · bE2 − E1

= |R| ≈ ∇rE · b∇xE · b

.

(16)Thus, we can interpret plenoptic motion estimation as theintegration of differential motion estimation with differen-tial stereo.

There are three ways to compute depth with a polydiop-tric sensor based on the measured plenoptic derivatives. Us-ing the plenoptic motion constraint Eq. 9, we can expressthe depth using each pairwise relation between the tempo-

ral, positional, and directional derivatives as follows:

|R| = ∇rE

∇xE= − ∇rE · t

Et +∇rE · (ω × r)= − Et +∇xE · t

∇xE · (ω × r).

(17)For the latter two, we need to have an accurate estimate

of the parameters of the rigid motion; otherwise the depthwill be distorted.

3.4 Scale Dependence of the Plenoptic Derivatives

The accuracy of the linearization of the time-varyingplenoptic function (Eq. 6) depends on the compatibility ofthe plenoptic derivatives. This means that the computationof all the plenoptic derivatives needs to be based upon simi-lar subsets of the scene radiance. Notice that if we combineinformation from neighboring measurements in directionalspace at a fixed position, we integrate radiance informationover a region of the scene surface whose area scales withthe distance from the sensor to the scene. In contrast, ifwe combine information over neighboring measurements inpositional space for a fixed direction, we integrate informa-tion over a region of the scene surface whose area is inde-pendent of the depth (illustrated in Fig.5).

Unless the brightness structure of the scene has enoughsimilarity across scales (e.g., if the local scene radiancechanges linearly on the scene surface), so that the deriva-tive computation is invariant to our choice of filter size, wehave to make sure when we compute the plenoptic deriva-tives with respect to time, direction, and position that thedomains of integration of our derivative filters relative tothe scene are as similar as possible.

One way to adjust the filter sizes would be to computethe temporal, directional and positional derivatives at manyscales and use Eq. 17 as a constraint to find the best relativeshift in scale space between the three derivatives.

4 Experiments

The analysis in Section 3.4 suggests the followingplenoptic structure from motion algorithm. Using the pro-posed plenoptic motion framework, one can envision a feed-back loop algorithm, where we use all the plenoptic deriva-tives to compute an estimate of the camera motion usingEq. 9. Since we are solving a linear system, the computa-tion of the motion parameters is fast and we do not haveany convergence issues as in the nonlinear methods neces-sary for single-viewpoint cameras. Then we can use therecovered motion together with the plenoptic derivatives tocompute a scene depth estimate. If the three estimates inEq. 17 do not match, we adapt the integration domains ofthe temporal, directional and positional derivative filters un-til we compute consistent depth and motion estimates. This

6

Page 7: Eyes from eyes: New cameras for structure from motion

is repeated for each frame of the input video, while simulta-neously we use the computed motion trajectory to integrateand refine the instantaneous depth maps in a large-baselinestereo optimization.

Figure 6. Accuracy of Plenoptic Motion Es-timation. The plot shows the ratio of thetrue and estimated motion parameters (ver-tical axis) in dependence of the distance be-tween the sensor surface and the scene (hori-zontal axis). Unit depth corresponds to affineand perspective images having a matchingscale (see Fig. 5). The solid lines denotethe ratios of the rotation parameters, and thedashed lines the ratios of the translation pa-rameters.

To examine the performance of an algorithm usingthe plenoptic motion constraint, we did experiments withsynthetic data. We distributed spheres, textured with asmoothly varying pattern, randomly in the scene so that theyfilled the horizon of the camera (see Fig. 7). We then com-puted the plenoptic derivatives through raytracing, stackedthe linear equations (Eq. 9) to form a linear system, andsolved for the motion parameters. Even using derivativesonly at one scale, we find that the motion is recovered veryaccurately as seen in Fig.6. As long as the relative scales ofthe derivatives are similar enough (scene not too far away)the error in the motion parameters varies between 1% and3%.

5 The Physical Implementation of Argus andPolydioptric Eyes

Perhaps the biggest challenge in making a polydioptriccamera is to make sure that neighboring cameras are at adistance that allows estimation of the “orthographic” deriva-tives of rays, i.e., the change in ray intensity when the ray

Figure 7. Subset of an Example Scene.

is moved parallel to itself. By using special pixel readoutfrom an array of tightly spaced cameras we can obtain apolydioptric camera. For scenes that are not too close to thecameras it is not necessary to have the individual camerasvery tightly packed; therefore, miniature cameras may besufficient. The idea of lenticular sheets has appeared in theliterature [2] for 3D imaging. These sheets employ cylin-drical lenses and are not very appealing because of the blur-ring they create. There are, however, similar micro-imageformation ideas that would fully support the mathematicaladvantages of polydioptric eyes suggested in the previoussection. One such idea is shown in Fig. 8. A micro-lensarray is mounted on the surface of an image sensor directly,emulating an insect compound eye. Fig. 9 shows the imag-ing geometry. As an alternative to micro-lens arrays onecould also consider coherent optical fiber bundles as imageguides.

Figure 8. Forminga kind of poly-dioptric eye.

Figure 9. Plenop-tic projection ge-ometry for micro-lenses.

These are commonly used as image tapers to resize thefield of view in scientific cameras such as digital X-ray cam-eras. Also, image guides are manufactured for use in medi-cal instruments for laparoscopy. A plurality of image guides

7

Page 8: Eyes from eyes: New cameras for structure from motion

would be used. The acceptance faces of the image guideswould be tightly arranged in an imaging surface; the exitfaces could then be spaced apart and coupled to image sen-sors. A bundle of thousands of such fibers, appropriatelycalibrated, may constitute the ultimate design. We are cur-rently experimenting with small versions and investigatingthe possibility of different optical materials. Using imageguides is a more expensive proposition, but it is an attrac-tive alternative, as this approach builds a superb compoundeye with each “ommatidium” having a complete perspectiveimage of the scene.

6 Conclusion

According to ancient Greek mythology Argus, thehundred-eyed guardian of Hera, the goddess of Olympus,alone defeated a whole army of Cyclopes, one-eyed giants.The mythological power of many eyes became real in thispaper, which proposed a mathematical analysis of new cam-eras. This paper also, introduced for the first time, the re-lation between the local differential structure of the time-varying plenoptic function and the rigid motion of an imag-ing sensor. This relationship was used to formulate designprinciples for new cameras. Using the two principles re-lating camera design to the performance of structure frommotion algorithms, the field of view, and the linearity ofthe estimation, we defined a hierarchy of camera designs.Although the mathematics of visual computing have beenconsiderably advanced, the cameras we use in computer vi-sion applications have basically been using the same prin-ciples for the past century: They capture a pencil of lightrays with a limited field of view. In this paper, based uponthe two design principles that we have formulated, we haveintroduced a new family of cameras, that is, polydioptriccameras. Polydioptric cameras are constructed by placing alarge number of individual cameras very close to each other.Polydioptric cameras capture all the rays falling on a surfaceand allow estimation of the plenoptic ray flow of any lightray under any rigid movement. This provides polydioptriccameras with the capability of solving for scene models ina linear manner, as described in the last section, openingnew avenues for a variety of applications. For example,polydioptric domes open new avenues for 3D video devel-opment. We have analyzed the properties of the plenop-tic derivatives and proposed a feedback algorithm to opti-mize the recovered motion of the imaging sensor and sub-sequently the structure of the scene based on this analysis.

Currently, we are developing different physical imple-mentations of polydioptric eyes as described in Section 5,and we will evaluate the proposed plenoptic structure frommotion algorithm on a benchmark set of image sequencescaptured by these new cameras.

Acknowledgment

The support of the National Science Foundation is grate-fully acknowledged. Figures 8– 9 are courtesy of VladimirBrajovic, CMU.

References

[1] E. H. Adelson and J. R. Bergen. The plenoptic functionand the elements of early vision. In M. Landy and J. A.Movshon, editors,Computational Models of Visual Process-ing, pages 3–20. MIT Press, Cambridge, MA, 1991.

[2] E. H. Adelson and J. Y. A. Wang. Single lens stereo with aplenoptic camera.IEEE Trans. PAMI, 14:99–106, 1992.

[3] G. Adiv. Inherent ambiguities in recovering 3D motion andstructure from a noisy flow field. InProc. IEEE Conferenceon Computer Vision and Pattern Recognition, pages 70–77,1985.

[4] P. Baker, C. Fermuller, and Y. Aloimonos. A spherical eyefrom multiple cameras (or how to make better models). InProc. IEEE Conference on Computer Vision and PatternRecognition, pages 576–583, 2001.

[5] R. C. Bolles, H. H. Baker, and D. H. Marimont. Epipolar-plane image analysis: An approach to determining structurefrom motion.Int. Journal of Computer Vision, 1:7–55, 1987.

[6] E. Camahort and D. Fussell. A geometric study of light fieldrepresentations. Technical Report TR99-35, Departmentof Computer Sciences, The University of Texas at Austin,1999.

[7] C. Capurro, F. Panerai, and G. Sandini. Vergence and track-ing fusing log-polar images. InProc. International Confer-ence on Pattern Recognition, 1996.

[8] P. Chang and M. Hebert. Omni-directional structure frommotion. InProceedings of the 2000 IEEE Workshop on Om-nidirectional Vision, pages 127 – 133, June 2000.

[9] K. Daniilidis. On the Error Sensitivity in the Recovery ofObject Descriptions. PhD thesis, Department of Informatics,University of Karlsruhe, Germany, 1992. In German.

[10] H. Farid and E. Simoncelli. Range estimation by opticaldifferentiation. Journal of the Optical Society of America,15, 1998.

[11] J. M. Gluckman and S. K. Nayar. Egomotion with omnicameras. InProc. International Conference on ComputerVision, 1998.

[12] S. Gortler, R. Grzeszczuk, R. Szeliski, and M. Cohen. Thelumigraph. InProc. of ACM SIGGRAPH, 1996.

[13] M. D. Grossberg and S. K. Nayar. A general imaging modeland a method for finding its parameters. InProc. Interna-tional Conference on Computer Vision, July 2001.

[14] A. D. Jepson and D. J. Heeger. Subspace methods for recov-ering rigid motion II: Theory. Technical Report RBCV-TR-90-36, University of Toronto, 1990.

[15] M. Levoy and P. Hanrahan. Light field rendering. InProc.of ACM SIGGRAPH, 1996.

[16] S. J. Maybank. Algorithm for analysing optical flow basedon the least-squares method.Image and Vision Computing,4:38–42, 1986.

8