-
Realtime Motion Segmentation basedMultibody Visual SLAM
Abhijit Kundu∗
RRC, IIITHyderabad 500032, India
[email protected]
K. Madhava KrishnaRRC, IIIT
Hyderabad 500032, [email protected]
C. V. JawaharCVIT, IIIT
Hyderabad 500032, [email protected]
ABSTRACTIn this paper, we present a practical vision based
Simultane-ous Localization and Mapping (SLAM) system for a
highlydynamic environment. We adopt a multibody Structurefrom
Motion (SfM) approach, which is the generalization ofclassical SfM
to dynamic scenes with multiple rigidly mov-ing objects. The
proposed framework of multibody visualSLAM allows choosing between
full 3D reconstruction orsimply tracking of the moving objects,
which adds flexibil-ity to the system, for scenes containing
non-rigid objects orobjects having insufficient features for
reconstruction. Thesolution demands a motion segmentation framework
thatcan segment feature points belonging to different motionsand
maintain the segmentation with time. We propose a re-altime
incremental motion segmentation algorithm for thispurpose. The
motion segmentation is robust and is capa-ble of segmenting
difficult degenerate motions, where themoving objects is followed
by a moving camera in the samedirection. This robustness is
attributed to the use of ef-ficient geometric constraints and a
probability frameworkwhich propagates the uncertainty in the
system. The mo-tion segmentation module is tightly coupled with
featuretracking and visual SLAM, by exploring various feed-backsin
between these modules. The integrated system can si-multaneously
perform realtime visual SLAM and trackingof multiple moving objects
using only a single monocularcamera.
1. INTRODUCTIONBoth SfM from computer vision and the SLAM in
mobile
robotics research does the same job of estimating sensor mo-tion
and structure of an unknown static environment. Themotivation
behind vision based SLAM, is to estimate the3D scene structure and
camera motion from an image se-quence in realtime so as to help
guide robots. Vision basedSLAM [3, 11, 15, 17] and SfM systems [8]
have been the
∗Corresponding author
Permission to make digital or hard copies of all or part of this
work forpersonal or classroom use is granted without fee provided
that copies arenot made or distributed for profit or commercial
advantage and that copiesbear this notice and the full citation on
the first page. To copy otherwise, torepublish, to post on servers
or to redistribute to lists, requires prior specificpermission
and/or a fee.ICVGIP ’10, December 12-15, 2010, Chennai,
IndiaCopyright 2010 ACM 978-1-4503-0060-5/10/12 ...$10.00.
Figure 1: An Illustration of our system. Here thestatic
background is being reconstructed, while themoving persons are
being detected and tracked
subject of much investigation and research. But almost allthese
approaches assume a static environment, containingonly rigid,
non-moving objects. Moving objects are treatedthe same way as
outliers and filtered out using robust statis-tics like RANSAC [5].
Though this may be a feasible solu-tion in less dynamic
environments, but it soon fails as theenvironment becomes more and
more dynamic. Also ac-counting for both the static and moving
objects providesricher information about the environment. A robust
solu-tion to the SLAM problem in dynamic environments willexpand
the potential for robotic applications, especially inapplications
which are in close proximity to human beingsand other robots. As
put by [28], robots will be able to worknot only for people but
also with people.
The solution to the moving object detection and segmen-tation
problem will act as a bridge between the static SLAMor SfM and its
counterpart for dynamic environments. But,motion detection from a
freely moving monocular camera isan ill-posed problem and a
difficult task. The moving cam-era causes every pixel to appear
moving. The apparent pixelmotion of points is a combined effect of
the camera motion,independent object motion, scene structure and
camera per-spective effects. Different views resulting from the
cameramotion are connected by a number of multiview
geometricconstraints. These constraints can be used for the
motiondetection task. Those inconsistent with the constraints canbe
labeled as moving or outliers.
The last decade saw lot of developments in the “multi-body”
extension [20, 21, 23, 27] to multi-view geometry.These methods are
the natural generalization of the clas-sical structure from motion
theory [4, 8] to the challenging
-
case of dynamic scenes involving multiple rigid-body mo-tions.
Thus given a set of feature trajectories belonging todifferent
independently moving bodies, multibody SfM esti-mates the number of
moving objects in the scene, cluster thetrajectories on basis of
motion, and then estimate the modelas in relative camera pose and
3D structure with respect toeach body/object. Thus it refers to the
problem of fittingmultiple motion models to the scene, given a set
of imagefeature trajectories.
By multibody visual SLAM, we indicate a realtime versionof the
multibody SfM. The purpose of the multibody visualSLAM is to
extract as much information from the environ-ment as possible, even
those belonging to moving objects.We have taken a more practical
point of view, where wechoose not to reconstruct all the moving
objects. This deci-sion is motivated by the observation that
foreground objectsare generally small and may move rapidly and
non-rigidly,which makes them very difficult for full 3D
reconstruction.Moreover certain applications may just need to know
thepresence of moving objects, rather than its full 3D
structure.The proposed framework offers the flexibility of choosing
theobjects that needs to be reconstructed. Objects, not chosenfor
reconstruction are simply tracked. Fig. 1 illustrates sucha system,
where the static background is chosen for recon-struction, and
objects moving independently are detectedand tracked over
views.
The solution needs an incremental motion segmentationframework
which can segment feature points belonging todifferent motions and
maintain the segmentation with time.With every new frame it needs
to verify the existing seg-mentation, and associate new features to
one of the movingobjects. We propose a realtime incremental motion
segmen-tation algorithm for aiding multibody visual SLAM. Themotion
segmentation is robust and is capable of segmentingdifficult
degenerate motions, where the moving objects is fol-lowed by a
moving camera in the same direction. Efficientgeometric constraints
are used in detecting these degener-ate motions. We introduce a
probability framework thatrecursively updates feature probability
and takes into con-sideration the uncertainty in camera pose
estimation. Thefinal system integrates feature tracking, motion
segmenta-tion and 3D reconstruction by visual SLAM. We
introduceseveral feedback paths among these modules, which
enablesthem to mutually benefit each other. The integrated
systemallows simultaneous online 3D reconstruction and trackingof
multiple moving objects using only a single monocularcamera. A full
perspective camera model is used, and we donot have any restrictive
assumptions on the camera motionor environment. Unlike many of the
existing works, the pro-posed method is online and incremental in
nature and scalesto arbitrarily long sequences.
In this paper, we explore in detail the motion segmenta-tion
module (Sec. 5) and its interplay with the other modulesof feature
tracking (Sec. 4) and visual SLAM (Sec. 6). Re-sults of the
proposed system are shown in Sec. 7 for scenesinvolving degenerate
motions and varying number of mov-ing objects on different
datasets. Before that the previousworks are detailed in Sec. 2 and
Sec. 3 gives a gist overviewof the whole system.
2. RELATED WORKSThe task of moving object detection and
segmentation,
is much easier if a stereo sensor is available, which allows
additional constraints to be used for detecting
independentmotion [1, 2]. However the problem is very much
ill-posedfor monocular systems. In realtime monocular visual
SLAMsystems, moving objects have not yet been dealt properly.In our
literature survey, we have only found three works onvisual SLAM in
dynamic environments: a work by Sola [26]and two other recent works
of [30] and [13]. Sola [26] does anobservability analysis of
detecting and tracking moving ob-jects with monocular vision. He
proposes a BiCamSLAM [26]solution with stereo cameras to bypass the
observability is-sues with mono-vision.
In [30], a 3D object tracker runs parallel with the monoc-ular
camera SLAM [3] for tracking a predefined moving ob-ject. This
prevents the visual SLAM framework from incor-porating moving
features lying on that moving object. Butthe proposed approach does
not perform moving object de-tection; so moving features apart from
those lying on thetracked moving object can still corrupt the SLAM
estima-tion. Also, they used a model based tracker, which can
onlytrack a previously modeled object with manual
initializa-tion.
The work by Migliore et al. [13] maintains two separatefilters:
a monoSLAM filter [3] with the static features and abearing only
tracker for the moving features. As concludedby Migliore et al.
[13], the main disadvantage of their sys-tem is the inability to
obtain an accurate estimate of themoving objects in the scene. This
is due to the fact thatthey maintain separate filters for tracking
each individualmoving feature, without any analysis of the
structure of thescene; which for e.g., can be obtained from
clustering pointsbelonging to same moving object or performing same
mo-tion. This is also the reason that they are not able to usethe
occlusion information of the tracked moving object, forextending
the lifetime of features as in [30].
The problem of motion detection and segmentation froma moving
camera has been a very active research area incomputer vision
community. The multiview geometric con-straints used for motion
detection, can be loosely dividedinto four categories. The first
category of methods used forthe task of motion detection, relies on
estimating a globalparametric motion model of the background. These
meth-ods [10, 19, 29] compensate camera motion by 2D homog-raphy or
affine motion model and pixels consistent with theestimated model
are assumed to be background and out-liers to the model are defined
as moving regions. However,these models are approximations which
hold only for certainrestricted cases of camera motion and scene
structure.
The problems with 2D homography methods led to plane-parallax
[9, 22, 31] based constraints. The “planar-parallax”constraints
represents the scene structure by a residual dis-placement field
termed parallax with respect to a 3D refer-ence plane in the scene.
The plane-parallax constraint wasdesigned to detect residual motion
as an after-step of 2Dhomography methods. They are designed to
detect motionregions when dense correspondences between small
baselinecamera motions are available. Also, all the
planar-parallaxmethods are ineffective when the scene cannot be
approxi-mated by a plane.
Though the planar-parallax decomposition can be used
foregomotion estimation and structure, the traditional multi-view
geometry constrains like epipolar constraint in 2 viewsor
trillinear constraints in 3 views and their extension toN views
have proved to be much more effective in scene
-
understanding as in SfM or visual SLAM. This constraintsare well
understood and are now textbook materials [4, 8].
Most of the multibody motion segmentation research [20,21, 23,
24, 27] has focused on theoretical and mathemat-ical aspects of the
problem. They have only been exper-imented on very short sequences,
with either zero or veryless outliers and noise-free feature
trajectories. Also the highcomputation cost, frequent
non-convergence of the solutionsand highly demanding assumptions;
all have prevented themfrom being applied to real-world sequences.
Only recentlyOzden et al. [18] discussed some of the practical
issues, thatcomes up in multibody SfM. Though recent methods [6,
21]are more robust to outliers and noise, we are still far
fromdoing multibody structure from motion in realtime.
3. OVERVIEWThe feature tracking module tracks existing feature
points,
while new features are instantiated. The purpose of the mo-tion
segmentation module is to segment these feature tracksbelonging to
different motion bodies, and to maintain thissegmentation as new
frames arrives. In the initializationstep, an algebraic multibody
motion segmentation algorithmis used to segment the scene into
multiple rigidly movingobjects. A decision is made as to which
objects will beundergoing the full 3D structure and camera motion
esti-mation. The background object is always chosen to undergothe
full 3D reconstruction and camera motion estimationprocess. Other
objects may either undergo full SfM estima-tion or just simply
tracked, depending on the suitability forSfM estimation or
application demand. On the objects, cho-sen for reconstruction, the
standard monocular visual SLAMpipeline is used to obtain the 3D
structure and camera poserelative to that object. For these
objects, we compute aprobabilistic likelihood that a feature is
moving along ormoving independently of that object. These
probabilitiesare recursively updated as the features are tracked.
Alsothe probabilities take care of uncertainty in pose estimationby
the visual SLAM module. Features with less likelihood offitting one
model are either mismatched features arising dueto tracking error
or features belonging to either some otherreconstructed object or
one of the unmodeled independentlymoving objects. For the unmodeled
moving objects, we usespatial proximity and motion coherence to
cluster the resid-ual feature tracks into independently moving
entities.
The individual modules of feature tracking, motion seg-mentation
and visual SLAM are tightly coupled and variousfeedback paths in
between them are explored, which bene-fits each other. The motion
model of a reconstructed objectestimated from the visual SLAM
module helps in improvingthe feature tracking. Relative camera pose
estimates fromSLAM are used by motion segmentation module to
com-pute probabilistic model-fitness. The uncertainty in camerapose
estimate is also propagated into this computation, soas to yield
robust model-fitness scores. The computationof the 3D structure
also helps in setting a tighter bound inthe geometric constraints,
which results in more accurateindependent motion detection. Finally
the results from themotion segmentation are fed back to the visual
SLAM mod-ule. The motion segmentation prevents independent
motionfrom corrupting the structure and motion estimation by
thevisual SLAM module. This also ensures a less number ofoutliers
in the reconstruction process of a particular object.So we need
less number of RANSAC iterations [5] thus re-
sulting in improved speed in the visual SLAM module.
4. FEATURE TRACKINGFeature tracking is an important sub-module
that needs to
be improved for multibody visual SLAM to take place. Con-trary
to conventional SLAM, where the features belongingto moving objects
are not important, we need to pay extracaution to feature tracking
for multibody SLAM. For multi-body visual SLAM to take place, we
should be able to getfeature tracks on the moving bodies also. This
is challeng-ing as different bodies are moving at different speeds.
Also3D reconstruction is only possible, when there are
sufficientfeature tracks of a particular body.
In each image, a number of salient features (FAST cor-ners) are
detected at different image pyramidal levels. Con-trary to
conventional visual SLAM, new features are addedalmost every frame.
However only a subset of these, de-tected on certain keyframes are
made into 3D points. Theextra set of tracks helps in detecting
independent motion. Apatch is generated on these feature locations
and is matchedacross images on the basis of zero-mean SSD scores to
pro-duce feature tracks. A number of constraints are used toimprove
the feature matching:
a) Adaptive Search Window: Between a pair of im-age, features
are matched within a fixed distance (window)from its location in
one image. The size and shape of thiswindow is decided adaptively,
based on the past motion ofthat particular body. For 3D points,
whose depth has beencomputed from the vSLAM module, the 1D epipolar
searchis reduced to just around the projection of the 3D point
onthe image with predicted camera pose.
b) Warp matrix for patch: An affine warp is performedon the
image patches to maintain view invariance from thepatch’s first and
current observation. If the depth of a patchis unknown, only a
rotation warp is made. For the imagepatch of the 3D points, which
have been triangulated, a fullaffine warp is performed. This
process is exactly same asthe patch search procedure in Klein et
al. [11].
c) Occlusion Constraint: Motion segmentation givesrough
occlusion information, i.e. it says whether some fore-ground moving
object is occluding some other body. Thisinformation helps in data
association, particularly for fea-tures belonging to a background
body, which are predictedto lie inside the convex hull created from
the feature pointsof a foreground moving object. These occluded
features arenot associated, and are kept until they emerge out from
oc-clusion.
d) Backward Match and Unicity Constraint: Whena match is found,
we try to match that feature backwardin the original image.
Matches, in which each point is theother’s strongest match is kept.
Enforcing unicity constraintamounts to keeping only the single
strongest, out of severalmatches for a single feature in the other
image.
5. MOTION SEGMENTATIONThe input to the motion segmentation
framework is fea-
ture tracks from feature tracking module, the camera rela-tive
motion in reference to each reconstructed body from thevisual SLAM
module, and the previous segmentation. Thetask of the motion
segmentation module is that of modelselection so as to assign these
feature tracks to one of thereconstructed bodies or some unmodeled
independent mo-
-
tion. Efficient geometric constraints are used to form a
prob-abilistic fitness score for each reconstructed object.
Witheach new frame, existing features are tested for
model-fitnessand unexplained features are assigned to one of the
inde-pendently moving object. But before all this, we
shouldinitialize the motion segmentation, which is described
next.
5.1 Initialization of Motion SegmentationThe initialization
routine for motion segmentation and vi-
sual SLAM is somewhat different from rest of the algorithm.We
make use of an algebraic two-view multibody motion seg-mentation
algorithm of RAS [21] to segment the input setof feature
trajectories into multiple moving objects. Thereasons behind the
choice of [21] among other algorithms isits direct non-iterative
nature and faster computation time.This segmentation provides the
system, the choice of mo-tion bodies for reconstruction. For the
segment chosen forreconstruction, an initial 3D structure and
camera motion iscomputed via epipolar geometry estimation as part
of static-scene visual SLAM initialization routine.
5.2 Geometric ConstraintsBetween any two frames, the camera
motion with respect
to the reconstructed body is obtained from the visual
SLAMmodule. The geometric constraints are then estimated to de-tect
independent motion with respect to the reconstructedbody. So for
the static background, all moving objectsshould be detected as
independent motion. Epipolar con-straint is the commonly used
constraint that connects twoviews. Reprojection error or its first
order approximationcalled Sampson error, based on the epipolar
constraint isused throughout the structure and motion estimation by
thevisual SLAM module. Basically they measure how far a fea-ture
lies from the epipolar line induced by the correspondingfeature in
the other view. Though these are the gold stan-dard cost functions
for 3D reconstruction, it is not goodenough for independent motion
detection. If a 3D pointmoves along the epipolar plane formed by
the two views, itsprojection in the image move along the epipolar
line. Thusin spite of moving independently, it still satisfies the
epipolarconstraint. This is depicted in Fig. 2. This kind of
degen-erate motion is quite common in real world scenarios,
e.g.camera and an object are moving in same direction as incamera
mounted in car moving through a road, or camera-mounted robot
following behind a moving person. To detectdegenerate motion, we
make use of the knowledge of cameramotion and 3D structure to
estimate a bound in the positionof the feature along the epipolar
line. We describe this asFlow Vector Bound (FVB) constraint.
5.2.1 Flow Vector Bound (FVB) Constraint:For a general camera
motion involving both rotation and
translation R, t, the effect of rotation can be compensated
byapplying a projective transformation to the first image. Thisis
achieved by multiplying feature points in view1 with theinfinite
homography H = KRK−1 [8]. The resulting featureflow vector
connecting feature position in view2 to that ofthe rotation
compensated feature position in view1, shouldlie along the epipolar
lines. Now assume that our cameratranslates by t and pn, pn+1 be
the image of a static pointX. Here pn is normalized as pn = (u, v,
1)
T . Attaching theworld frame to the camera center of the 1st
view, the cameramatrix for the views are K[I|0] and K[I|t]. Also,
if z is depth
Figure 2: Left: The world point P moves non-
degenerately to P′
and hence x′, the image of P
′
does not lie on the epipolar line corresponding tox. Right: The
point P moves degenerately in the
epipolar plane to P′. Hence, despite moving, its im-
age point lies on the epipolar line corresponding tothe image of
P.
of the scene point X, then inhomogeneous coordinates ofX is
zK−1pn. Now image of X in the 2nd view, pn+1 =K[I|t]X. Solving we
get, [8]
pn+1 = pn +Kt
z(1)
Equation 1 describes the movement of the feature pointin the
image. Starting at point pn in In it moves alongthe line defined by
pn and epipole, en+1 = Kt. The extentof movement depends on
translation t and inverse depth z.From equation 1, if we know depth
z of a scene point, we canpredict the position of its image along
the epipolar line. Inabsence of any depth information, we set a
possible bound indepth of a scene point as viewed from the camera.
Let zmaxand zmin be the upper and lower bound on possible depthof a
scene point. We then find image displacements alongthe epipolar
line, dmin and dmax, corresponding to zmax andzmin respectively. If
the flow vector of a feature, does notlie between dmin and dmax, it
is more likely to be an imageof an independent motion.
The structure estimation from visual SLAM module helpsin
reducing the possible bound in depth. Instead of settingzmax to
infinity, known depth of the background enables insetting a more
tight bound, and thus better detection ofdegenerate motion. The
depth bound is adjusted on thebasis of depth distribution along the
particular frustum.
The probability of satisfying flow vector bound constraintP (FV
B) can be computed as
P (FV B) =1
1 +
„FV − dmean
drange
«2β (2)Here dmean =
dmin + dmax2
and drange =dmax − dmin
2,
where dmin and dmax are the bound in image displacements.The
distribution function is similar to a Butterworth band-pass filter.
P (FV B) has a high value if the feature lies insidethe bound given
by FVB constraint, and the probability fallsrapidly as the feature
moves away from the bound. Largerthe value of β, more rapidly it
falls. In our implementation,we use β = 10.
5.3 Independent Motion ProbabilityIn this section we describe a
recursive formulation based
on Bayes filter to derive the probability of a projected im-age
point of a world point being classified as stationaryor dynamic.
The relative pose estimation noise and image
-
pixel noise are bundled into a Gaussian probability
distri-bution of the epipolar lines as derived in [8] and denoted
byELi = N (µli,
Pli), where ELi refers to the set of epipolar
lines corresponding to image point i, and N (µli,P
li) refers
to the standard Gaussian probability distribution over
thisset.
Let pni be the ith point in image In. The probability that
pni is classified as stationary is denoted as P (pn
i|In, In−1) =Pn,s(p
i) or Pn,si in short, where the suffix s signifying static.
Then with Markov approximation, the recursive probabilityupdate
of a point being stationary given a set of images canbe derived
as
P (pni|In+1, In, In−1) = ηsiPn+1,siPn,si (3)
Here ηsi is normalization constant that ensures the proba-
bilities sum to one.The term Pn,s
i can be modeled to incorporate the dis-tribution of the
epipolar lines ELi. Given an image pointpn−1
i in In−1 and its corresponding point pni in In then
the epipolar line that passes through pni is determined as
lni = en × pni. The probability distribution of the feature
point being stationary or moving due to epipolar constraintis
defines as
PEP,si =
1p2π|Σl|
exp(−12
(lni − µni)τΣ−1l (ln
i − µni)) (4)
However this does not take into account the
misclassificationarising due to degenerate motion explained in
previous sec-tions. To overcome this, the eventual probability is
fused asa combination of epipolar and flow vector bound
constraints:
Pn,si = α · PEP,si + (1− α) · PFV B,si (5)
where, α balances the weight of each constraint. A χ2 test
isperformed to detect if the epipolar line ln
i due to the imagepoint is satisfying the epipolar constraint.
When Epipolarconstraint is not satisfied, α takes a value close to
1 ren-dering the FVB probability inconsequential. As the epipo-lar
line ln
i begins indicating a strong likelihood of satisfyingepipolar
constraint, the role of FVB constraint is given moreimportance,
which can help detect the degenerate cases.
An analogous set of equations characterize the probabil-ity of
an image point being dynamic, which are not delin-eated here due to
brevity of space. In our implementation,the envelope of epipolar
lines [8] is generated by a set of Fmatrices distributed around the
mean R, t transformationbetween two frames as estimated by visual
SLAM module.Hence a set of epipolar lines corresponding to those
ma-trices are generated and characterized by the sample set,
ELssi =
“l̂1i, l̂2
i.......l̂qi”
and the associated probability set,
PEL =“wl̂1
i, wl̂2i.......wl̂q
i”
where each wl̂ji is the probabil-
ity of that line belonging to the sample set ELssi computed
through usual Gaussian procedures. Then the probabilitythat an
image point pn
i is static is given by:
Pn,si =
qXj=1
αj ·PEP,l̂jiS ·pni+(1−αj) ·PFV B,l̂ji
S ·pni ·wl̂ji
(6)where, PEP,l̂ji
S and PFV B,l̂jiS are the probabilities of the
point being stationary due to the respective constraints
withrespect to the epipolar line l̂j
i.
5.4 Clustering Unmodeled MotionsFeatures with high probabilities
of being dynamic are ei-
ther outliers or belongs to potential moving objects. Sincethese
objects are often small, and highly dynamic, they arevery hard to
be reconstructed. So instead we adopt a sim-ple move-in-unison
model for them. Spatial proximity andmotion coherence is used to
cluster these feature tracks intoindependently moving entities. By
motion coherence, weuse the heuristic that the variance in the
distance betweenfeatures belonging to same object should change
slowly incomparison.
6. VISUAL SLAM FRAMEWORKThe monocular visual SLAM framework is
that of a stan-
dard bundle adjustment visual SLAM [11, 14, 17]. On theobjects
chosen for reconstruction, a 5-point algorithm [16]with RANSAC is
used to estimate the initial epipolar ge-ometry, and subsequent
pose is determined with 3-point re-section [7]. Some of the frames
are selected as keyframes,which are used to triangulate 3D points.
The set of 3Dpoints and the corresponding keyframes are then used
bythe bundle adjustment process to iteratively minimize
repro-jection error. The bundle adjustment is initially
performedover the most recent keyframes, before attempting a
globaloptimization. Our implementation closely follows to thatof
[11, 14]. The system is implemented as multi-threadedprocesses.
While one thread performs tasks like camera poseestimation,
keyframe decision and addition, another back-end thread optimizes
this estimate by bundle adjustment.
6.1 Feedback from Motion SegmentationHowever the main difference
with the existing SLAM meth-
ods, is its interplay with the motion segmentation module.The
motion segmentation prevents independent motion fromentering the
SfM computation, which could have otherwiseresulted in incorrect
initial SfM estimate and lead the bun-dle adjustment to converge to
local minima. The feedbackresults in less number of outliers in the
SfM process of a par-ticular object. Thus the SfM estimate is more
well condi-tioned and less number of RANSAC iterations is needed
[5].Apart from improvement in the camera motion estimate,the
knowledge of the independent foreground objects com-ing from motion
segmentation helps in the data associationof the features, which
are currently being occluded by thatobject. For the foreground
independent motions, we forma convex-hull around the tracked points
clustered as an in-dependently moving entity. Existing 3D points
lying insidethis region is marked as not visible and is not
searched fora match. This prevents 3D features from unnecessary
dele-tion and re-initialization, just because it was occluded by
anindependent motion for some time.
7. EXPERIMENTAL RESULTSThe system has been tested on a number of
real image
datasets, with various number and type of moving enti-ties.
Details of the image sequences used for experimentsare listed in
Table. 1.
7.1 Moving Box SequenceThis is same sequence as used in [30]. A
previously static
box is being moved in front of the camera which is also mov-ing
arbitrarily. However unlike [30], our method does not
-
Figure 3: Results from the Moving Box Sequence
uses any 3D model, and thus can work for any previouslyunseen
object. As shown in Fig. 3 our algorithm reliablydetects the moving
object just on the basis of motion con-straints. The difficulty
with this sequence is that the fore-ground moving box is nearly
white and thus provides veryless features. This sequence also
highlights the detection ofpreviously static moving objects. Upon
detection, 3D mappoints lying on the moving box are deleted. The
convexhull formed on the moving box is shown in red shade.
Thisdefines the occlusion mask, and corresponding actions aretaken
as described in Sec. 6.1. Left image of Fig. 5 showsthe epipolar
errors for an instance from this sequence.
Figure 4: Results from the New College Sequence.
7.2 New College SequenceWe tested our results on some dynamic
parts of the New
College dataset [25]. Only left of the stereo image pairshas
been used. In this sequence, the camera moves alonga roughly
circular campus path, and three moving personspasses by the scene.
Left image in Fig. 6 shows the aerialsnap of the environment and
the camera trajectory. Yellowdenotes the part of the trajectory,
when there is no inde-pendently moving body other than the static
background.Green, red and blue denotes the part of trajectory
where1st, 2nd and the 3rd “moving” persons were detected. Partof
the trajectory colored black denotes the time when both2nd and 3rd
moving persons are visible. Fig. 6 also showsa snap of the online
map of the static background, recon-structed by the Visual SLAM
framework. Fig. 4 depictsthe motion segmentation results for this
sequence. Fig. 5shows an example of degenerate motion detection, as
theflow vectors on the moving person almost move along epipo-lar
lines, but they are being detected due to usage the FVBconstraint.
This result verifies system’s performance for ar-bitrary camera
trajectory, degenerate motion detection andchanging number of
moving entities.
7.3 Indoor Lab SequenceThis is an indoor sequence taken from an
inexpensive
hand-held camera. As the camera moves around, movingpersons
enter and leave the scene. Fig. 7 shows the resultsfor this
sequence. The bottom right picture in Fig. 7 showshow two spatially
close independent motions is clustered cor-rectly by the algorithm.
This sequence also involves a lot ofdegenerate motion as the camera
and the persons move insame direction. The 3D structure estimation
of the back-ground helps in setting a tighter bound in the FVB
con-straint. The depth bound is adjusted on the basis of
depthdistribution of the reconstructed background along the
par-ticular frustum, as explained in Sec. 5.2.1.
7.4 System DetailsThe system is implemented as threaded
processes in C++.
The open source libraries of TooN, OpenCV and sparse bun-dle
adjustment [12] were used throughout the system. Therun-time of the
algorithm depends on lot of factors. Themost significant of them
are the number of bodies beingreconstructed, total number of
independent motions in thescene, image resolution and bundle
adjustment rules. Thesystem runs in realtime at the average of 25Hz
in a standardlaptop, when a single body is chosen for
reconstruction. Themotion segmentation module takes around 10ms for
each im-age of 512x284 resolution and with 3 independently
movingbodies.
7.5 DiscussionThe results verifies that the integrated system
can simul-
taneously perform 3D reconstruction, camera pose estima-tion,
and tracking of multiple moving objects using only asingle
monocular camera, while maintaining realtime per-formance as listed
in Table 1. Also the algorithm is online(casual) in nature as
opposed to batch operation prevalentin multibody SfM literature.
The proposed approach alsoscales to long sequences. We have shown
results for de-generate motion (Fig. 5), arbitrary camera
trajectory andchanging number of moving entities. In Fig. 6, we
demon-strated the 3D reconstruction and camera pose estimation
-
Figure 5: Epipolar lines in Grey, flow vectors after rotation
compensation is shown in orange. Cyan linesshow the distance to
epipolar line. Features detected as independently moving are shown
as red dots. Notethe near-degenerate independent motion in the
middle and right image. However the use of FVB constraintenables
efficient detection of degenerate motion.
Figure 6: LEFT: Aerial map and camera trajectory. Non Yellow
denotes part of the trajectory where amoving person is being
detected. RIGHT: The online map being of the background, 3D
structure points arein green, while white line is the camera
trajectory, and the blue dots are the key-frame positions.
in reference to the static background. 3D structure pointsare in
green, while white line is the camera trajectory, andthe blue dots
are the keyframe positions with respect to thebackground. The
camera trajectory is also highlighted onthe aerial map of the test
environment.
Figure 7: Results from the Indoor Lab Sequence
Table 1: Details of the datasetsDataset Resolution Length
RuntimeMoving Box 320x240 718 images 30HzNew College 512x384 1500
images 18HzIndoor Lab 640x480 1720 images 22Hz
8. CONCLUSIONSThis paper presents a realtime incremental motion
seg-
mentation algorithm that enables a practical multibody vi-sual
SLAM algorithm. The framework segments featurepoints belonging to
different motions and maintain this seg-mentation with time.
Multiview geometric constraints wereexplored to successfully detect
various independent motionincluding degenerate motions. A
probabilistic frameworkin the model of a recursive Bayes filter is
developed thatassigns probability to a feature being stationary or
mov-ing based on geometric constraints. Uncertainty in camerapose
estimation is also propagated into this probability es-timation.
The different modules of motion segmentation,feature tracking and
visual SLAM were integrated and wepresented, how each module helps
the other one. The in-tegrated system can simultaneously perform
realtime visualSLAM, and tracking of multiple moving objects using
onlya single monocular camera. Experiments on various real im-
-
age sequences shows the efficacy of the method. The
workpresented here can find immediate applications in
variousrobotics applications involving dynamic scenes.
9. REFERENCES[1] M. Agrawal, K. Konolige, and L. Iocchi.
Real-time
detection of independent motion using stereo. In IEEEWorkshop on
Motion and Video Computing, 2005.
[2] Z. Chen and S. Birchfield. Person following with amobile
robot using binocular feature-based tracking.In IEEE/RSJ
International Conference on IntelligentRobots and Systems (IROS),
2007.
[3] A. Davison, I. Reid, N. Molton, and O. Stasse.MonoSLAM:
Real-time single camera SLAM. IEEETransactions on Pattern Analysis
and MachineIntelligence (TPAMI), 29(6):1052–1067, 2007.
[4] O. Faugeras, Q. Luong, and T. Papadopoulo. Thegeometry of
multiple images. MIT press, 2001.
[5] M. Fischler and R. Bolles. Random sample consensus:A
paradigm for model fitting with applications toimage analysis and
automated cartography.Communications of the ACM, 24(6):381–395,
1981.
[6] C. G., S. Atev, and G. Lerman. Kernel SpectralCurvature
Clustering (KSCC). In ICCV’09 Workshopon Dynamical Vision,
2009.
[7] B. Haralick, C. Lee, K. Ottenberg, and M. Nolle.Review and
analysis of solutions of the three pointperspective pose estimation
problem. InternationalJournal of Computer Vision, 13(3):331–356,
1994.
[8] R. I. Hartley and A. Zisserman. Multiple ViewGeometry in
Computer Vision. Cambridge UniversityPress, 2nd edition, 2004.
[9] M. Irani and P. Anandan. A unified approach tomoving object
detection in 2D and 3D scenes. IEEETransactions on Pattern Analysis
and MachineIntelligence (TPAMI), 20(6):577–589, 1998.
[10] B. Jung and G. Sukhatme. Real-time motion trackingfrom a
mobile robot. International Journal of SocialRobotics, pages
1–16.
[11] G. Klein and D. Murray. Parallel tracking andmapping for
small AR workspaces. In Proc. SixthIEEE and ACM International
Symposium on Mixedand Augmented Reality (ISMAR), 2007.
[12] M. A. Lourakis and A. Argyros. SBA: A SoftwarePackage for
Generic Sparse Bundle Adjustment. ACMTrans. Math. Software,
36(1):1–30, 2009.
[13] D. Migliore, R. Rigamonti, D. Marzorati,M. Matteucci, and
D. G. Sorrenti. Avoiding movingoutliers in visual SLAM by tracking
moving objects.In ICRA’09 Workshop on Safe navigation in open
anddynamic environments, 2009.
[14] E. Mouragnon, F. Dekeyser, P. Sayd, M. Lhuillier, andM.
Dhome. Real time localization and 3dreconstruction. In Computer
Vision and PatternRecognition (CVPR), 2006.
[15] J. Neira, A. Davison, and J. Leonard. Guest
editorial,special issue in visual slam. IEEE Transactions
onRobotics, 24(5):929–931, October 2008.
[16] D. Nister. An efficient solution to the five-pointrelative
pose problem. IEEE Transactions on PatternAnalysis and Machine
Intelligence (TPAMI),26(6):756–770, 2004.
[17] D. Nister, O. Naroditsky, and J. Bergen. Visualodometry. In
Computer Vision and PatternRecognition (CVPR), 2004.
[18] K. E. Ozden, K. Schindler, and L. V. Gool.
Multibodystructure-from-motion in practice. IEEE Transactionson
Pattern Analysis and Machine Intelligence(TPAMI), 32:1134–1141,
2010.
[19] S. Pundlik and S. Birchfield. Motion segmentation atany
speed. In Proceedings of British Machine VisionConference (BMVC),
2006.
[20] S. Rao, R. Tron, R. Vidal, and Y. Ma. Motionsegmentation
via robust subspace separation in thepresence of outlying,
incomplete, or corruptedtrajectories. In Computer Vision and
PatternRecognition (CVPR), 2008.
[21] S. Rao, A. Yang, S. Sastry, and Y. Ma. RobustAlgebraic
Segmentation of Mixed Rigid-Body andPlanar Motions from Two Views.
InternationalJournal of Computer Vision (IJCV), 2010.
[22] H. Sawhney, Y. Guo, and R. Kumar. Independentmotion
detection in 3D scenes. IEEE Transactions onPattern Analysis and
Machine Intelligence (TPAMI),22(10):1191–1199, 2000.
[23] K. Schindler and D. Suter. Two-view
multibodystructure-and-motion with outliers through modelselection.
IEEE Transactions on Pattern Analysis andMachine Intelligence
(TPAMI), 28(6):983–995, 2006.
[24] K. Schindler, J. U, and H. Wang. Perspective
n-viewmultibody structure-and-motion through modelselection. In
European Conference on ComputerVision (ECCV), 2006.
[25] M. Smith, I. Baldwin, W. Churchill, R. Paul, andP. Newman.
The new college vision and laser data set.The International Journal
of Robotics Research(IJRR), 28(5):595, 2009.
[26] J. Sola. Towards visual localization, mapping andmoving
objects tracking by a mobile robot: a geometricand probabilistic
approach. PhD thesis, LAAS,Toulouse, 2007.
[27] R. Vidal, Y. Ma, S. Soatto, and S. Sastry.
Two-viewmultibody structure from motion. InternationalJournal of
Computer Vision (IJCV), 68(1):7–25, 2006.
[28] C. Wang, C. Thorpe, S. Thrun, M. Hebert, andH.
Durrant-Whyte. Simultaneous localization,mapping and moving object
tracking. TheInternational Journal of Robotics Research
(IJRR),26(9):889–916, 2007.
[29] J. Wang and E. Adelson. Layered representation formotion
analysis. In Computer Vision and PatternRecognition (CVPR),
1993.
[30] S. Wangsiripitak and D. Murray. Avoiding movingoutliers in
visual SLAM by tracking moving objects.In International Conference
on Robotics andAutomation (ICRA), 2009.
[31] C. Yuan, G. Medioni, J. Kang, and I. Cohen.Detecting motion
regions in the presence of a strongparallax from a moving camera by
multiviewgeometric constraints. IEEE Transactions on
PatternAnalysis and Machine Intelligence (TPAMI),29(9):1627–1641,
2007.