Video Scene Segmentation Using Spatial Contours and 3-D Robust Motion Estimation

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 4, APRIL 2004 485

Video Scene Segmentation Using Spatial Contoursand 3-D Robust Motion Estimation

Theophilos Papadimitriou, Member, IEEE, Konstantinos I. Diamantaras, Member, IEEE,Michael G. Strintzis, Senior Member, IEEE, and Manos Roumeliotis, Member, IEEE

Abstract—A novel image sequence segmentation method whichcombines both spatial and temporal information is presented inthis paper. The first step is an intensity segmentation scheme basedon the edgeflow method. The temporal information is introducedthrough three-dimensional (3-D) motion estimation parameters. Inthe second step, regions obtained from the first step are clusteredaccording to their 3-D motion models. In order to reduce the noisesensitivity of the motion estimation process, we introduce a robustmethod which produces accurate motion parameters and facili-tates the correct clustering that follows. This ensures that rigid ob-jects with luminance discontinuities can be segmented correctly.The method has been successfully tested in real imagery and typ-ical examples are presented in this paper.

Index Terms—Point correspondences, region merging,spatio-temporal segmentation, three-dimensional (3-D) motionestimation.

I. INTRODUCTION

SPATIAL segmentation refers to labeling pixels that are as-sociated with intensity homogeneous regions in an image.

Temporal segmentation is the procedure yielding a partition ofthe image scene according to the motion as presented by localintensity displacement, better known as optical flow. Spatio-temporal segmentation approaches attempt to localize the ob-jects of a scene based on both spatial and temporal information[1], [2]. A successfully completed spatio-temporal segmenta-tion process should result in the partition of the image frameinto regions, each representing the projection of a rigid body onthe image plane.

The spectrum of spatio-temporal segmentation applicationsis wide and contains three-dimensional (3-D) object iden-tification, automatic industrial inspection, autonavigation,videophony, video surveillance, and video compression. Theaccuracy of the spatial and temporal information is critical tothe resulting spatio-temporal segmentation. If the regions ac-quired by the still-image segmentation procedure are imperfect,the estimated motion parameters, carrying the temporal infor-

Manuscript received April 11, 2002; revised May 15, 2003. This paper wasrecommended by Associate Editor R. Lancini.

Th. Papadimitriou is with the Department of International Economic Rela-tions and Development, Democritus University of Thrace, GR-69100 Komotini,Greece.

K. I. Diamantaras is with the Department of Informatics, Technological Ed-ucation Institute of Thessaloniki, GR-54101 Sindos, Greece.

M. G. Strintzis is with the Information Processing Laboratory, Departmentof Electrical and Computer Engineering, Aristotle University of Thessaloniki,GR-54006 Thessaloniki, Greece.

M. Roumeliotis is with the Department of Applied Informatics, University ofMacedonia, GR-54006 Thessaloniki, Greece.

Digital Object Identifier 10.1109/TCSVT.2004.825562

mation, would be erroneous. The pixels clustered in the classof a region with inaccurately traced contours may incorporateerror in the step of motion estimation. Temporal information isextracted from two frames of a video sequence using a featurematching technique. Typically, such techniques return a denseor sparce displacement vector field. When a compact motionrepresentation is requested, vector fields are further processedin a two-dimensional or three-dimensional motion estimationalgorithm. In general, the motion estimation techniques sufferfrom instability due to quantization error, measurement noise,and outliers in the input dataset [3], [4]. The problem oftracing regions representing moving rigid bodies is nontrivial,containing many sources of error and noise.

In this paper, a novel spatio-temporal segmentation methodis presented. The proposed method relies only on the informa-tion existing on two frames of the image sequence. No a prioriknowledge about the moving objects on the scene is needed ex-cept that the scene involves rigid objects.

In Section II, the major approaches of spatio-temporal seg-mentation are presented with examples. In Section III can befound an overview of the proposed method. Section IV is de-voted to the spatial segmentation method. The adopted featureextraction and feature matching process is described in Sec-tion V. The robust 3-D motion estimation method can be foundin detail in Section VI. In Section VII, the region merging andregion adjustment is presented. Experimental results of the pro-posed algorithm are presented in Section VIII. Finally, in Sec-tion IX, the paper is summarized.

II. PREVIOUS WORK

The spatio-temporal segmentation techniques can be dividedin two major categories. The techniques of the first group try tosequentially extract one object at a time from the image scene.The approaches in the second group start from an overseg-mented partition of the image and try to build coherent regionsrepresenting moving objects using a merging procedure. Thetechniques in the first group are better known as top-down,while techniques in the second group are known as bottom-uptechniques [5].

The top-down techniques are based on robust statistics [6].The image pixels and their displacement are introduced in anoutlier detection procedure [7], [8]. The dominant motion is es-timated in every step of the method. Pixels are registered ac-cording to the dominant motion as inliers or outliers. The inlierpixels are well described by the estimated motion parameters.The top-down algorithms generally cluster inliers into image re-gions reflecting moving bodies of the scene. The outlier pixels

1051-8215/04$20.00 © 2004 IEEE

https://www.researchgate.net/publication/220660067_Motion_segmentation_and_qualitative_dynamic_scene_analysis_from_an_image_sequence?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/223091989_Object-oriented_analysis-synthesis_of_moving_images?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192480_Optic_Flow_Field_Segmentation_and_Motion_Estimation_Using_a_Robust_Genetic_Partitioning_Algorithm?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/27383596_Robust_estimation_of_rigid-body_3-D_motion_parameters_based_onpoint_correspondences?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192890_Spatio-temporal_segmentation_based_on_region_merging?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/224377887_Determining_Three-Dimensional_Motion_and_Structure_from_Optical_Flow_Generated_by_Several_Moving_Objects?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

486 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 14, NO. 4, APRIL 2004

are grouped in order to form the input set of the next iteration.The major disadvantage of top-down approaches is that mul-tiple moving objects in the scene yield datasets with high outlierlevels. Robust statistics techniques can be used to detect outliers;however, such methods are usually limited to low outlier levels(less than 50%). As a consequence, the estimated motion pa-rameters are often inaccurate due to imperfect outlier detection,introducing errors. Among the proposed solution to the previousproblem, we can mention pyramidal image decomposition [9],[10] and robust estimators [11], [12].

The top-down techniques are in general low-complexity algo-rithms. However, once the biggest objects in the scene are classi-fied, the remaining pixels often do not form consistent datasets.In that case, dominant motion cannot be estimated. As a conse-quence, the segmentation is imperfect, as it contains unlabeledregions.

The bottom-up methods are based on region merging tech-niques. Generally, these methods use a merging criterion inorder to fuse regions from an oversegmented initial image par-tition. Pure spatial segmentation techniques have the tendencyto classify intensity discontinuities as region contours. A singleobject would be segmented in many parts when its imageprojection contains high intensity discontinuities. A similaritycriterion is used in order to merge the spatial image parts of therigid object. Typically, similarity is estimated mainly on thebasis of temporal information [13].

Approaches may differ according to the motion model orthe similarity criterion chosen. The adopted motion model canbe the translational displacement motion model or the accurateaffine and perspective model [5]. The similarity criterion canbe the Euclidian distance of the motion parameters in the mo-tion parameter space [14], [15] or a more complex measure suchas the variance of the residual distribution [2]. In [16], initialluminance segmentation is performed in a watershed form andthe maximization of an a posteriori probability yields the finalsegmentation.

Every motion class estimated in a bottom-up technique is in-dependent of all the others. Moreover, the bottom-up techniquesyield fully segmented image scenes, which is very important inthe video compression schemes. On the other hand, the regionmerging techniques are computationally expensive and dependcritically on the proposed merging criterion.

III. OVERVIEW

The proposed method proceeds first in the spatial space. Thesecond step proceeds in the temporal space, using results fromthe spatial domain. A graphical description of the proposedmethod is shown in Fig. 1.

A spatial segmentation method based on the edgeflow methodconstitutes the first step of the presented method. The secondone is a spatial merging technique, aiming at the creation ofconsistent spatial regions. Once the spatial segmentation is com-plete, temporal information is collected in a two step approachincluding: 1) a feature extraction step and 2) a feature matchingstep. The feature points are extracted from every region ac-cording to their spatial variances. Matching for every featurepoint in the second frame is searched. The displacement vectors

Fig. 1. Graphical overview of the proposed spatio-temporal segmentationmethod.

are calculated using the image coordinates of the feature pointsin both images. For every region, the displacement vector set isinserted in a 3-D motion estimation algorithm yielding motionparameters. Complex motion is described by a translation anda rotation vector, while the simpler motion of planar surfaces isdescribed by a displacement vector. The combination of spatialand temporal information is performed in a merging procedure,where neighboring spatially coherent regions with similar mo-tion are grouped together.

IV. SPATIAL SEGMENTATION

Typically, the spatial segmentation is performed using edgedetection or region growing. In edge-detection techniques,intensity discontinuities are searched within the image frame.Each discontinuity is considered as part of the region contour.In region-growing techniques, pixels forming a region aregrouped sequentially. Starting from a pixel seed, every neigh-borhood pixel is examined under an intensity criterion: pixelssatisfying the criterion are considered into the region set andgrouped. The major problem in edge-detection techniques isthat intensity discontinuities do not always correspond to regionboundaries. Small intensity variances may yield edges on theoutput without really corresponding to region boundaries.Moreover, when a boundary is correctly detected, intensityalong the contours is not stable, and therefore the estimated

https://www.researchgate.net/publication/221303920_Detecting_and_Tracking_Multiple_Moving_Objects_Using_Temporal_Integration?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/2905075_Spatio-Temporal_Segmentation_of_Video_Data?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192055_A_three-frame_algorithm_for_estimating_two-component_image_motion?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3326182_Representing_moving_hands_with_layers?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192890_Spatio-temporal_segmentation_based_on_region_merging?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3193231_Video_segmentation_by_MAP_labeling_of_watershed_segments?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/224377887_Determining_Three-Dimensional_Motion_and_Structure_from_Optical_Flow_Generated_by_Several_Moving_Objects?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/37435181_Spatio_temporal_segmentation_based_on_motion_and_static_segmentation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

PAPADIMITRIOU et al.: VIDEO SCENE SEGMENTATION USING SPATIAL CONTOURS AND 3-D ROBUST MOTION ESTIMATION 487

boundary may have discontinuities. The region-growing tech-niques are computationally expensive and depend on the choiceof the seed pixels. However, the segments generated by regiongrowing are in general more accurate and stable.

The chosen spatial segmentation approach is a combinationof an edge-detection algorithm and a spatial merging technique.The output of this step aims at partitioning the image sceneinto spatially coherent regions. Boundary detection is performedusing the edge-flow technique of Ma and Manjunath [17]. Thetechnique tends to localize boundaries using the direction ofchange in color, texture, and luminance field at a given resolu-tion scale. Edge-flow is one of the most known spatial segmenta-tion techniques. The main advantage of the method is the use ofvarious sources of spatial information such as texture, color, andluminance seeking for precise results. In order to find consistentregions even from projections of low-texture objects, the imageresolution factor to our experiments is set at the finest level. Theoutput of such a scheme is an oversegmented scene. The truecontours are traced correctly. However, every luminance discon-tinuity is traced as well, yielding erroneous segmentation. Theproblem is solved using a region merging technique using in-tensity information. The mean value and the standard deviationof all pixel values within every region are calculated. Neigh-boring regions with similar statistical luminance properties aremerged. The final output of the second step is the segmentationof the image scene into a set of spatially coherent regions withconsistent contours.

V. 2-D MOTION VECTOR ESTIMATION

In the first step of the algorithm the spatial information is de-scribed by the intensity similarity. In the second step, the tem-poral information will be estimated. In general, temporal infor-mation extracted from an image sequence is well described by avector field. Each vector represents the projected displacementof a scene point in two consecutive moments. In reality, the pixeldisplacement does not always correspond to real motion, butcan be created by illumination instability. Therefore, the vectorfield is better known as optical flow. The determination of sucha field is an ill-posed problem, and the solution is rarely uniqueand stable. The problem can be treated as a global minimizationproblem in order to achieve a dense vector field [18]–[20] or asan image feature matching problem seeking for point correspon-dences [21]–[24] or line correspondences [25], [26], generatinga sparse vector field. Recent work by Chen et al. [27] treated theestimation of the optical flow using the wavelet theory.

The goal of dense vector field methods is to find correspon-dences for every image pixel. However, the estimation of sucha vector field is not easy due to the nature of the minimizationfunction which contains many local minima. As a consequence,the minimization process may not yield the optimum solution.Sparse vector field methods are based on the extraction of se-lected image features (points or lines) from the scene. Imperfectfeature extraction results in wrong correspondences and cor-rupted datasets which greatly affect the final motion estimation.Various adjustments of the problem are made using smoothnessconstraints [28], [29], but the proposed solutions are far fromperfect.

Fig. 2. Checking order of the pixels within a square region. First, dark graypixels labeled 1 are examined. Then pixels with the label 2 are searched, andfinally white pixels are examined when the algorithm cannot extract a featureset in the previous steps.

Since motion must be calculated for every region traced in theprevious step, a feature matching process is used to calculatethe temporal information in this approach. In the first level, asearch procedure finds the best pixel candidates of every regionfor feature matching. This procedure is based on the intensityvariance in the neighborhood of the pixels. The second step isthe tracking procedure where each pixel extracted from the firstimage frame is located on the second one.

The variance is used in order to estimate the homogeneityof block centered in the examined pixel. A less homogeneousblock means that the central pixel is easily distinguished fromthe others, i.e., a better feature. For a candidate pixel ,the variance of its block is calculated using the following for-mula:

(1)

where is the luminance value in pixel is themean luminance within the block centered in is thenumber of elements in block , and is the block ofpixels centered on . A set of features is searched for everyregion. Assuming that the best features are the ones with thehigher block variance, all of the candidates are sorted accordingto their variance. The first pixels of the sorted list form thefeature set. The feature candidates are taken from a coarseregular grid. If a complete set of features is extracted, thealgorithm proceeds to the feature matching step. Otherwise,more candidates are searched in a finer regular grid. Wheneverenough features are collected, the feature matching techniquestarts. A graphic presentation of the candidate searchingprocedure can be found in Fig. 2. In the presented examples inSection VIII, a feature set can contain from 50 to 200 features

.The exposed searching procedure aims at finding features

from all over the examined regions. The features will besearched within the second frame in order to perform motion

https://www.researchgate.net/publication/224377975_An_Investigation_of_Smoothness_Constraints_for_the_Estimation_of_Displacement_Vector_Fields_from_Image_Sequences?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/239549281_A_Computer_Algorithm_for_Reconstructing_a_Scene_from_Two_Projections?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/30869787_Three-Dimensional_Computer_Vision_a_Geometric_Viewpoint_The_MIT_Press?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3308357_Wavelet-based_optical_flow_estimation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/2903196_Determining_Constant_Optical_Flow?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3327223_EdgeFlow_A_technique_for_boundary_detection_and_image_segmentation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192491_Estimating_Motion_and_Structure_from_Correspondences_of_Line_Segments_Between_Two_Perspective_Images?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3308360_Simplex_minimization_for_single-_and_multiple-reference_motion_estimation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3497794_Estimating_motionstructure_from_line_correspondences_A_robust_linear_algorithm_and_uniqueness_theorems?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==


extraction. The more dispersed the features are, the betterdescription of pure motion we have from the motion extractionprocedure. In cases where a set of nonzero variance pixelscannot be extracted from an image region, then it is assumedthat the region is highly homogenous. Feature matching onthese regions is difficult and nonunique. These regions donot contain enough spatial information, and the temporalinformation calculated risks being inaccurate and erroneous.Therefore, these regions are eliminated from the rest of theprocedure. This test does not affect the procedure, since thesehighly homogenous regions are in general very small.

Various feature matching approaches exist. The proposed oneis based on block matching and appears to be simple and effi-cient. For every feature , the displace block difference (DBD)is calculated within a search window as follows:

(2)

where is the displacement of the pixel andis the intensity of pixel in frame . The displacement yieldingthe lower DBD is assumed to be the right 2-D vector for pixel .If subpixelic accuracy is sought, then is calculated ac-coring to the mean luminance of the corresponding neighboringpixels as shown in (3), given at the bottom of the page, whereoperator denotes the closest smaller integer.

In order to achieve better performance for the proposedfeature matching technique, the searched window size and theblock size may vary according to the image size.

VI. RIGID BODY 3-D MOTION ESTIMATION

Once all of the pixels and their displacement are estimated,the algorithm proceeds to the rigid body 3-D motion estimation.Still-image segmentation does not contain the necessary infor-mation for moving object extraction. An object with large lumi-nance discontinuity would be segmented in more than one re-gion. Therefore, the temporal information is needed to increasesegmentation accuracy. The displacement vectors may be usedfor this purpose, but in general a more refined motion model isused as the six-parameter affine model.

A. Problem Formulation

Let us define a 3-D coordinate system so that the axis isparallel to the camera optical axis and the and axes coin-cide with the and axes of the image plane. The motion of apoint in 3-D space is described by the followingequation: where the orthogonalmatrix describes the rotation, and the vector describes thetranslation of .

We assume that the camera geometry is described by perspec-tive projection with focal length . The perspective projectionof these points on the image plane of the first camera positionis given by the expressions

(4)

while the projection of points on the second frame are describedby

(5)

We also assume sufficiently small rotation and translation pa-rameters so that the 2-D displacements

for points , can be approximated as [30]

(6)

(7)

Eliminating from (6) and (7) and leads to the followinglinear equation [31]:

(8)

where the nine-dimensional vector

(9)

(3)

https://www.researchgate.net/publication/224274237_Total_least_squares_3-D_motion_estimation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==


involves all six free parameters of the problem, i.e.,. In the noise-free case, the solution

hinges on the SVD analysis of the system matrix

(10)

which is composed of the measurement data and thefeature points .

The general case where the projected object undergoes bothtranslation and rotation is presented. The interested reader mayrefer to [3] for a detailed presentation of the motion estimationmethod.

It can be shown that . Consider one observationof the matrix system (8) as follows:

The same equation is true for every observation in matrix .Thus, columns are linearly dependent and

. The SVD analysis of is the tool for estimating . The re-quested is associated with the null space of , i.e., it is pro-portional to the least significant left singular vector of [31].

B. Motion Parameters Estimation

The calculation of the motion described by the translationvector and the rotation vector has two steps.

Step 1) We find from the SVD analysis of matrix . Thetranslation can be directly estimated from . Recallthat and .

Step 2) The rotation vector is estimated using the esti-mates of obtained in step 1). Equation (8) is trans-formed into

for

(11)

The above system may be solved using the total least squarestechnique as presented in [32], because noise is incorporatedboth in the data vector and the observation.

As with most 3-D motion estimation methods, the solutiondepends heavily on the quality of the measurements. When theoutput of the feature matching process contains noise and out-liers, the proposed method will give very poor results. In thissection, a recursive robust scheme is introduced to reject out-liers from the dataset based on the residuals information.

C. Weighted Estimation

The contribution of robust statistics as presented in the early1980s by Huber [6] to the solution of the instability problem isof major importance. Scientists discovered in robust statistics apowerful tool to overcome the machine vision problems. Thecollaboration of robust statistics and computer vision has beenapplied to optic flow estimation [33], motion segmentation [34],3-D motion estimation [4], and 3-D object recognition [35].

The basic idea in robust statistics is the use of the residualinformation in a weighted least squares system. Each mea-surement contributes to the estimation process according toits weight that reflects the confidence of the system in thismeasurement. Recursively, inconsistent data (outliers) receivedecreasing confidence whereas data consistent with the under-lying model retain or increase their confidence. Eventually,the data with the highest confidence contribute most at theestimation of the hidden parameters.

We introduce different weighting on the rows of : we calla diagonal weight matrix where

reflects our confidence on the th measurement—0means no confidence, and 1 means perfect confidence. Then thelinear system to solve is [3]

(12)

Our goal is to determine and . This is done recursively.First, we start with an estimate of which leads to an estimateof from (12). Then the residual error of our estimate leads to anew weight matrix and a new estimate of , and so on. Sincewe have no prior knowledge of the quality of our measurements,we set initially all the weights equal to 1, so . An esti-mation of is made in every iteration using the previous motionestimation algorithm. The residuals of the estimation are usedto calculate the weights for the next iteration using the equation[36]

(13)

Let us consider the “hat” matrix of the linear system (12) asfollows:

(14)

where the superposed denotes the pseudoinverse. It canbe verified that is idempotent and symmetric

. Moreover, if is the diagonal element of matrix, then we can establish that if then the residual

. On the other hand, means that has largevalues. Taking advantage of this property, we can calculate aform of the studentized residuals as follows:

(15)

Using the studentized residuals in (13), we use the propertiesof the “hat” matrix in the estimation process, yielding fasterresults in the iterative loop. The properties of matrix are givenin detail in [11].

Once the rank of is less than nine, we can assume thatis constant or that , as shown in [31]. Since the motion

between two consecutive frames of an image sequence is quite

https://www.researchgate.net/publication/3192480_Optic_Flow_Field_Segmentation_and_Motion_Estimation_Using_a_Robust_Genetic_Partitioning_Algorithm?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/3192828_Robust_reweighted_MAP_motion_estimation?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==



https://www.researchgate.net/publication/3192907_Robust_affine_structure_matching_for_3D_object_recognition?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==

https://www.researchgate.net/publication/224184859_Camera_motion_parameter_recovery_under_perspective_projection?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==


https://www.researchgate.net/publication/3192803_Robust_adaptive_segmentation_of_range_images?el=1_x_8&enrichId=rgreq-1cb0bbdf-a6be-462e-afee-30ca644ef5ea&enrichSource=Y292ZXJQYWdlOzMzMDg2Mjk7QVM6OTczNjk5NjIwNTc3MzNAMTQwMDIyNjIxNjM5OQ==


Fig. 3. Algorithm for the robust motion segmentation.

low, the mean displacement of all the pixels in the examinedregion can give the necessary motion information for the regionclustering in those cases.

In the opposite cases that the linear system is fully ranked,the SVD algorithm needs at least a dataset of nine points toestimate a solution. The robust algorithm ends when a number(larger than 9) of weights remain larger than a threshold .The corresponding points are used to the SVD estimation. Inour experiments, we were searching for 12–15 points, while thethreshold was between 0.8–0.95.

D. Robust Estimation

The above improvement of the original algorithm is signifi-cantly more robust but it can be further improved [37]. The ideais to generate a number of smaller submatrices by randomlyselecting rows from . For each submatrix, we use the sameapproach outlined above. Using more submatrices, we increasethe probability that two of them contain a sufficiently low per-centage of outliers so that the algorithm will yield estimatesclose to the true .

In order to take advantage of this idea, the algorithm must beable to judge the quality of the estimate for each submatrix

. We achieve this by making use of two properties inherent tothe data model. First, let , be the singular valuesof sorted in decreasing order. In the noiseless case, it can beshown [31] that . In the noisy case, this is not true but wemay assume that the ratio between the smallest and the largestsingular value should remain very small: . Second,note that good estimates are those that cluster in a small regionaround the true , whereas bad estimates tend to disperse in .Therefore, if two estimates from two different submatrices lieclose to each other it is more likely that they lie close to the truevalue than that they lie in some random neighborhood of .

Based on the above discussion, our proposed algorithm aimsat finding two submatrices and such that: 1)for both submatrices and 2) the estimates are close to eachother. Then our estimate is the mean of and . The completealgorithm is presented in Fig. 3.

VII. REGION MERGING AND REGION ADJUSTMENT

Thus, the spatially coherent regions were traced and their mo-tion information was calculated. In this section, the motion in-formation will be used in a region merging criterion in order to

fuse neighboring regions with different spatial intensity but sim-ilar motion parameters.

The region merging is based on the motion model and the mo-tion parameters of each region. First, the regions are clusteredinto two major classes according to their motion model. Regionswith full rank are supposed to have 3-D motion, while we as-sume that regions with made in step 1) have 2-D motion.

In the first case, the inner product of ’s is used to clusterregions. If , then regions are merged, where isa predefined threshold. In our experiments, .

In the second case, regions are merged according to their Eu-clidean distance. If the distance between two displacement vec-tors is , then regions aremerged. In our experiments, .

A closing procedure can conclude the proposed approach forregions with artifacts due to small undetermined regions.

VIII. RESULTS

The performance of the algorithm was tested on several testsequences. In this section, results on the “table-tennis” se-quence, the “MOVI” sequence, the “flower-garden” sequence,and an outdoor image sequence are presented. In SubsectionVIII-E, the usefulness of the proposed method in a regiontracking procedure is searched.

A. Table-Tennis Sequence

The tennis image sequence presents a hand holding a table-tennis racket while a white ball rebounds on the surface of theracket. There is a slight translational camera egomotion. Theoriginal image size is 720 480 pixels (Fig. 4), however, theresolution was reduced to 360 240 pixels, for computationalreasons.

The spatial segmentation was performed using the edgeflowmethod as mentioned. The scale factor was set to 2, seekingoversegmentation. The method resulted in the partition of thescene into 203 regions as shown in Fig. 4.

In order to decrease the number of regions, spatial segmen-tation is performed using luminance statistical properties. Themean value and the standard deviation of every region lumi-nance is estimated. Neighboring regions with similar luminanceproperties are merged. In the presented example, the absolutemean difference and the absolute standard deviation were re-stricted to 22. This procedure output was a partition of the imageinto 20 regions as shown in Fig. 4.

The feature matching procedure searches for 70 pixels withnonzero variances for every region. The pure motion informa-tion is extracted from the video sequence using the outlined fea-ture matching process. The first-born motion information is re-fined in the affine six-parameter model using the proposed ro-bust motion-estimation approach. The regions with more than65% identical motion vectors are considered to have 2-D mo-tion. In the opposite case, 3-D motion parameters are estimated.After the 3-D motion estimation process, every spatial regionis connected with a nine-parameter motion vector. The motionmerging step starts in order to create the final spatio-temporalcoherent regions. The regions with 3-D motion parameters aregrouped together, when the inner product of the motion vectors



Fig. 4. Original fifth frame of the table-tennis sequence is presented in the left image. In the center image, we can see the output of the edge-flow technique.The image scene is segmented into 203 regions. In the right frame, the image partition after spatial merging is presented. The image scene is segmented into 20luminance coherent regions.

Fig. 5. Final image partition. The image scene is segmented into three regions: the image background, the white ball, and the arm. In the top row we presentthe white ball and the background region. In the bottom, we can see the arm region before and after the closing procedure. The small artifacts in the fingers areeliminated.

is less than 0.7, while the 2-D motion regions are merged whenthe distance of their displacement vector is less than 1.5. A finalclosing procedure is performed in the regions of interest, elim-inating the small unlabeled regions. Fig. 5 shows the final seg-mentation.

The ball region can be easily distinguished due to large lu-minance difference from the background. On the opposite side,the background and the arm region are created using the mo-tion parameters. The arm region contains seven spatial coherentregions, while the arm is created by grouping eight regions.There are four small regions that the closing process eliminates.The presented segmentation is consistent to one identified by ahuman operator (Fig. 5).

An important criteria for evaluating the performance of a seg-mentation algorithm is stability over time. A method can havehigh-accuracy results over a selected couple of frames, whileaccuracy during sequence evolution may decrease. In order toprove the stability of the proposed method over time, the ex-

tracted arm region in 15 consecutive frames of the table-tennissequence is presented (Fig. 6). It is important to outline that theextraction of the arm region is the most challenging task in thesegmentation of the selected sequence. The proposed methodoutputs a scene segmentation of high accurate over sequenceevolution.

B. MOVI Sequence

The MOVI1 sequence contains several nonmoving rigidobjects while the camera performs rotation around the scene(Fig. 7). The original image scene is 512 512 pixels, however,the resolution was reduced to 256 256 pixels for computa-tional reasons.

The edgeflow method was used in order to perform the initialspatial segmentation. The scale factor was set to 6. The image

1The Movi sequence was downloaded from the IRISA laboratory Internet site[38].


Fig. 6. Spatio-temporal segmentation of 15 consecutive frames of the table-tennis sequence using the proposed method.

Fig. 7. Original fifth frame of the MOVI sequence is presented in the left image. The output of the edgeflow technique is presented in the middle image. Oncethe spatial merging technique is performed, we derive the right image, where the image scene is cut into 12 intensity coherent regions.

scene was cut into 54 spatially coherent regions. The spatial seg-mentation targets an oversegmentation of the image scene, inorder to trace accurately object contours. However, a decreaseof regions can be achieved without losing contour accuracy, per-forming a spatial merging technique as the one described in theedgeflow subsection. The spatial merging parameter was set to30. The output of both oversegmentation and segmentation afterspatial merging can be found in Fig. 7.

In the selected image sequence, the extraction of the back-ground is trivial, due to the high contrast between the objectsand the image scene. However, most objects are cut into sev-eral pieces. Therefore, further processing of the scene is neces-sary in order to achieve accurate scene segmentation. The fea-ture extraction/feature matching technique can extract the mo-tion information, while the motion-estimation module will re-fine the pure motion information into a single motion vector forevery region. The motion vectors will be used to merge piecesdescribing a single image object.

In the feature matching procedure, 120 feature candidates forevery region are searched. The pure motion information is in-troduced into the motion estimation method, yielding a six-pa-rameter motion vector. As was mentioned, the motion on the

sequence is made by the apparent rotation of the camera. Asa consequence, the motion of every object is better describedby the nine-parameter 3-D motion vector [see (9)]. The innerproduct of motion vectors is used in order to judge motion sim-ilarity and merge image regions. The inner product thresholdselected on this example was set to 0.8. The final segmentationafter a closing procedure can be seen in Fig. 8.

Segmenting an image scene when the camera undergoes rota-tion is a demanding task. The camera egomotion gives differenttrajectories to every object of the scene. The correct spatio-tem-poral segmentation can be achieved in the motion merging step.The spatialy coherent regions are merged according to their mo-tion vectors. Thus, in such image sequences, the correct seg-mentation, needs accurate motion estimation. In the present ex-ample the correct motion vectors are estimated, yielding accu-rate spatio-temporal segmentation results as can be witnessed inFig. 8.

C. Flower-Garden Sequence

The flower garden sequence presents a tree trunk in front of agarden taken by a camera undergoing translation along the hor-izontal image axis. The original image size is 720 480 pixels.


Fig. 8. Spatio-temporal image segmentation. Five important objects can be extracted from the image scene: the house, the mug, the glass, the piece of paper, andthe toy car.

(a) (b)

Fig. 9. (a) The original 14th frame of the flower garden sequence can be found in the left side. (b) The spatial segmentation output is presented in the right image.

The selected images were subsampled to 300 200 pixels forcomputational reasons. The segmentation process aims at sepa-rating the tree region from the moving background (Fig. 9).

The rich texture of the image scene renders a typical spa-tial segmentation method inadequate. Instead, a complex spatialsegmentation method considering texture is more convenient.


Fig. 10. Final segmentation of the spatio-temporal method. The tree trunk can be easily separated from the background.

Fig. 11. (a) Twenty-seventh original frame of the outdoor sequence. (b) The edgeflow output. The image is cut into 95 regions. (c) The spatial merging decreasesthe regions to four.

Edgeflow is such a method and it is used in order to create theinitial spatial segmentation. The scale factor of the edgeflowtechnique was set to 8 and the image scene was cut in 84 re-gions, as shown in Fig. 9.

Considering the spatial segmentation results, it was decidednot to perform a spatial merging step as in the previous exam-ples. The motion of the image scene is highly translational.Therefore, the spatio-temporal segmentation does not needthe 3-D framework. In the first step of the motion estimationprocess, the method judges if the examined image regionmotion can be best described by a 2-D motion vector or anine-parameter 3-D motion vector. The feature matchingprocess extracts 110 displacement vectors, forming a vectorfield for every region. The homogeneous vector fields can beeasily described by 2-D vectors, while complex vector fieldsneed complex motion models. Homogeneity is located bythe rank of submatrix . The full rank submatrix is born bynonhomogeneous vector fields, while in the opposite case iscreated by highly homogeneous vector fields (see Section VI).In the presented example, the proposed method decides to use2-D motion vectors in order to describe the motion.

The region merging process gives the final segmentation. Themerging is made according to the displacement vectors of tworegions. If the Euclidean distance of the motion vectors is lessthan a threshold, then regions are merged. In the presented ex-ample, the 2-D threshold was set to 1.5. The results were in-

serted in a closing technique in order to eliminate small unla-beled holes in the background (Fig. 10).

Often 3-D motion estimation techniques are inflexible incases of objects undergoing translation. The flower garden issuch a case. Methods tends to estimate parameters of complexmotion models for every region, yielding erroneous resultsin regions undergoing 2-D motion. The proposed approachexamines the motion model of the region, driving to a 3-D ora 2-D motion estimation technique. The regions in the flowergarden sequence are processed as regions undergoing 2-Dmotion, yielding accurate results.

D. Outdoor Sequence

The outdoor scene presents a car undergoing a translationalong the optical axis (the axis) in front of an outdoor back-ground (Fig. 11). The image size is 256 256 pixels. As in theprevious examples, the first step of the algorithm is the spatialsegmentation performed by the edgeflow technique. The pro-cessing yields the partition of the image into 95 regions, whenthe resolution factor is 6. Further spatial processing was judgednecessary before stepping into the motion estimation method.Regions were merged according to the mean luminance and theluminance variance. The luminance threshold was set to 7. Theinitial partition was merged into 34 larger regions as shown inFig. 11.


Fig. 12. Final partition of the outdoor scene using the proposed spatio-temporal segmentation approach. The car/shadow is correctly extracted from thebackground.

Fig. 13. Results of the temporal tracking procedure over eight frames of the table-tennis sequence.

In the feature matching step, 130 displacement vectors arecollected for every region. The vectors are inserted in the mo-tion estimation process calculating motion parameters for everyregion. The background regions motion was described by the2-D displacement vector, while the regions corresponding to thecar were described by the nine-parameter . The backgroundregions are merged together easily in the motion mergingprocess, and the car regions are clustered using the 3–D motionparameters. The closing procedure eliminates small artifacts asin the previous examples. The final segmentation is presentedin Fig. 12.

In the flower garden example, all of the motion was describedby 2-D displacement vectors, and the outdoor sequence hassome parts undergoing complex motion and others undergoingzooming translation. In general, whenever an object movesalong the optical axis, the optical flow tends to radially dispersefrom the image center (zoom on) or centralize over the imagecenter (zoom off). Dealing with these cases is nontrivial. Theproposed method classifies correctly the motion model ofevery region. Accurate 3-D motion parameters are estimatedfor every region of the car, yielding a correct spatio-temporalsegmentation. It is important to outline that the car and the

shadow consist of a single image object, since they bothundergo the same motion.

E. Region Tracking

The typical structure of a region tracking procedure has twosteps: 1) the initial segmentation step and 2) the tracking step[16], [39]. We investigated the usefulness of the proposed seg-mentation method in a simple region tracking scheme. First theproposed spatio-temporal segmentation method is performed inthe starting frame. A simple tracking procedure is then per-formed on the initially traced regions. The tracking procedureis based on the minimum error over a displacement window forthe whole region. Using annotations of (2), we calculate the dis-placed region difference (DRD), defined as

(16)

where is the displacement of the pixel, isthe intensity of pixel in frame , and is the region labeled. The displacement yielding the lower DRD is assumed to be

the right 2-D displacement vector for region .


The scheme was tested in the table-tennis sequence (Fig. 13).The fourth frame was used in the initial segmentation. Thetracking was performed over both temporal directions. Thetracking procedure starts in the second frame and stops inthe ninth frame. After the ninth frame, the deformation ofthe arm region is larger and the tracking procedure must bere-initialized.

IX. CONCLUSION

In this paper, we presented a novel scheme for segmenta-tion of rigid bodies from the video sequence combining spatialand temporal information. The spatial information is collectedusing the edgeflow technique, while the temporal informationis calculated using a feature matching process. The displace-ment vectors of the feature matching process are introduced intothe motion estimation method, which calculates motion param-eters depending on the motion model of every region. The mo-tion parameters similarity is the criterion for region merging.The final segmentation yields a partition of the image scene intospatio-temporal coherent regions. The performance of the algo-rithm is successfully tested with several image sequence.

REFERENCES

[1] P. Bouthemy and E. François, “Motion segmentation and qualitative dy-namic scene analysis from an image sequence,” Int. J. Comput. Vis., vol.10, no. 2, pp. 157–182, 1993.

[2] G. Adiv, “Determining three-dimensional motion and straucture fromoptical flow generated by several moving objects,” IEEE Trans. PatternAnal. Machine Intell., vol. PAMI-7, pp. 384–401, Apr. 1985.

[3] Th. Papadimitriou, K. I. Diamantaras, M. G. Strintzis, and M. Roumeli-otis, “Robust estimation of rigid body 3-D motion parameters based onpoint correspondences,” IEEE Trans. Circuits Syst. Video Technol., vol.10, pp. 541–549, June 2000.

[4] Y. Huang, K. Palaniappan, X. Zhuang, and J. E. Cavanaugh, “Optic flowfield segmentation and motion estimation using a robust genetic parti-tioning algorithm,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12,pp. 1177–1189, Dec. 1995.

[5] F. Moscheni, S. Bhattacharjee, and M. Kunt, “Spatiotemporal segmen-tation based on region merging,” Trans. Image Processing, no. 9, pp.897–915, Sept. 1998.

[6] P. J. Huber, Robust Statistics. New York: Wiley, July 1981.[7] M. Irani, B. Rousso, and S. Peleg, “Computing occludign and trans-

parent motions,” Int. J. Comput. Vis., vol. 12, no. 1, pp. 5–16, 1994.[8] H. G. Mussmann, N. Hoetter, and J. Ostermann, “Object-oriented

analysis-synthesis coding of moving images,” Signal Processing:Image Commun., vol. 1, no. 2, pp. 117–135, Oct..

[9] J. R. Bergen, J. B. Burt, J. Hingorani, and S. Peleg, “A three-frame algo-rithm for estimating two-component image motions,” IEEE Trans. Pat-tern Anal. Machine Intell., vol. 12, pp. 43–77, Jan. 1992.

[10] M. Irani, B. Rousso, and S. Peleg, “Detecting and tracking multiplemoving object using temporal integration,” in Proc. 2nd Eur. Conf. Com-puter Vision, G. Sandini, Ed, S. Margherita, Italy: Springer-Verlag, 1992,pp. 282–287.

[11] P. J. Rousseeuw and A. M. Leroy, Robust Resgression and Outlier De-tection. New York: Wiley, 1987.

[12] S. Ayer and P. Schroeter, “Hierarchical robust motion estimation for seg-mentation of moving objects,” in Proc. IEEE Workshop Image and Mul-tidimensional Signal Processing, Cannes, France, 1993, pp. 122–123.

[13] J. Y. A. Wang and E. H. Anderson, “Representing moving images withlayers,” IEEE Trans. Image Processing, vol. 3, pp. 625–638, May 1994.

[14] F. Dufaux, F. Moscheni, and A. Lippman, “Spatiotemporal segmentationbased on motion and static segmentation,” in Proc. ICIP, Washington,DC, Oct. 1995, pp. 306–309.

[15] J. Y. A. Wang and E. H. Adelson, “Spatiotemporal segmentation of videodata,” in SPIE Proc. Image and Video Processing II, vol. 2, San Jose, CA,Feb. 1994.

[16] I. Patras, E. A. Hendriks, and R. L. Lagendijk, “Video segmentationby MAP labeling of watershed segments,” IEEE Trans. Pattern Anal.Machine Intell., vol. 23, pp. 326–332, Mar., 2001.

[17] W.-Y. Ma and B. S. Manjunath, “Edgeflow: A technique for boundarydetection and image segmentation,” IEEE Trans. Image Processing, vol.9, pp. 1375–1388, Aug. 2000.

[18] D. Heeger and A. Jepson, “Subspaces methods for recovering rigid mo-tion i:algorithm and implementation,” Int. J. Comput. Vis., vol. 7, pp.95–117, Aug., 1992.

[19] E. Mémin and P. Pérez, “Dense estimation and object-based segmen-tation of the optical flow with robust techniques,” IEEE Trans. ImageProcessing, vol. 7, pp. 703–719, May 1998.

[20] M. E. Al Mualla, C. N. Canagarajah, and D. R. Bull, “Simplex mini-mization for single- and multiple-reference motion estimation,” IEEETrans. Circuits Syst. Video Technol., vol. 11, pp. 1209–1220, Dec. 2001.

[21] H. C. Longuet-Higgins, “A computer program for reconstructing a scenefrom two projections,” Nature, vol. 293, pp. 133–135, Sept. 1981.

[22] R. Y. Tsai and T. S. Huang, “Uniqueness and estimation of 3-D motionparameters of rigid bodies with curved surfaces,” IEEE Trans. PatternAnal. Machine Intell., vol. PAMI-6, pp. 13–27, Jan. 1984.

[23] R. M. Haralick, H. Joo, C. N. Lee, X. Zhuang, V. G. Vaidya, and M.B. Kim, “Pose estimation from corresponding point data,” IEEE Trans.Syst., Man, Cybern., vol. 19, pp. 1426–1446, Aug. 1989.

[24] O. Faugeras, Three-Dimensional Computer Vision: A Geometric View-point. Cambridge, MA: MIT Press, 1993.

[25] J. Weng, Y. Liu, T. S. Huang, and N. Ahuja, “Estimating motionstructures from line correspondences a robust linear algorithm anduniqueness theorems,” in Proc. IEEE Conf. Computer Vision andPattern Recognition, Ann Arbor, MI, 1988, pp. 387–392.

[26] Z. Zhang, “Estimating motion and structure from corresponding of linessegments between two perspective images,” IEEE Trans. Pattern Anal.Machine Intell., vol. PAMI-17, pp. 1129–1139, Dec. 1995.

[27] l.-F. Chen, H.-Y. M. Liao, and J.-C. Lin, “Wavelet-based optical flowestimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, pp. 1–12,Jan., 2002.

[28] B. K. P. Horn and E. J. Schunck, “Determining optical flow,” Artif. In-tell., vol. 23, pp. 185–203, 1981.

[29] H.-H. Nagel and W. Enkelman, “An investigation of smoothness con-straints for the estimation of displacement vector field from image se-quences,” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-8, pp.565–593, May 1986.

[30] K. I. Diamantaras, Th. Papadimitriou, M. G. Strintzis, and M. Roume-liotis, “Total least squares 3-D motion estimation,” in Proc. ICIP,Chicago, IL, Oct. 1998, paper MP11_4.

[31] K. I. Diamantaras and M. G. Strintzis, “Camera motion parameter re-covery under perspective projection,” in Proc. IEEE Conf. Image Pro-cessing, vol. 3, Lausanne, Switzerland, Sept. 1996, pp. 807–810.

[32] G. H. Golub and C. F. Van Loan, Matrix Computation. Baltimore, MD:Johns Hopkins Univ. Press, 1998.

[33] D.-G. Sim and R.-H. Park, “Robust reweighted MAP motion estima-tion,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 353–365,Apr. 1998.

[34] K.-M. Lee, P. Meer, and R.-H. Park, “Robust adaptive segmentation ofrange images,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp.200–205, Feb. 1998.

[35] T. A. Cass, “Robust affine structure matching for 3D object recognition,”IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 1265–1274, Nov.1998.

[36] R. G. Staudte and S. J. Sheather, Robust Estimation & Testing, ser. Prob-ability and Mathimatical Statistics. New York: Wiley, 990.

[37] Th. Papadimitriou, K. I. Diamantaras, M. G. Strintzis, and M. Roumeli-otis, “A new image sequence segmentation method based on luminanceand 3-D motion information,” in Proc. Int. Workshop Synthetic–Nat-ural Hybrid Coding and Three Dimensional Imaging (IWSNHC3DI-99),Santorini, Greece, Sept. 15–17, 1999, pp. 53–57.

[38] Rotating Around the MOVI House[Online]. Available:http: //www.irisa.fr/ texmex/base_images/sequences/g3_vp_ra_s1/index.html

[39] Y. Tsaig and A. Averbuch, “Automatic segmentation of moving objectsin video sequences: A region labeling approach,” IEEE Trans. CircuitsSyst. Video Technol., vol. 12, pp. 597–612, July 2002.


Theophilos Papadimitriou (S’98–M’03) was bornin Thessaloniki, Greece, in 1972. He received theDiploma in mathematics from the Aristotle Uni-versity of Thessaloniki, Thessaloniki, Greece, andthe D.E.A. A.R.A.V.I.S (Automatique, Robotique,Algorithmique, Vision, Image, Signale) from theUniversity of Nice—Sophia Antipolis, France, bothin 1996 and the Ph.D. degree from the AristotleUniversity of Thessaloniki in 2000.

He joined the Department of InternationalEconomic Relations and Development, Demokritos

University of Thrace, Komotini, Greece, in 2000, teaching Informatics. In2002, he became Lecturer with the same faculty. In 2000, he also beganteaching the digital image and video courses in the Department of AppliedInformatics, University of Macedonia, Thessaloniki. His interests includesignal, image, and video processing, robust statistics, and data analysis.

Konstantinos I. Diamantaras (S’90–M’92) wasborn in Athens, Greece, in 1965. He received theDiploma from the National Technical Universityof Athens, Athens, Greece, in 1987 and the Ph.D.degree from Princeton University, Princeton, NJ, in1992, both in electrical engineering.

Subsequently, he joined Siemens CorporationResearch, Princeton, as a Post-Doctoral Researcher,and in 1995, he worked as a Researcher with theDepartment of Electrical and Computer Engineering,Aristotle University of Thessaloniki, Thessaloniki,

Greece. Since 1998, he has been with the Department of Informatics,Technological Education Institute of Thessaloniki, where he currently holdsthe position of Associate Professor and Chairman. His research interestsinclude signal processing, neural networks, image processing, and VLSIarray processing. He is the author of the book Principal Component NeuralNetworks: Theory and Applications, coauthored with S. Y. Kung (New York:Wiley, 1996). He was a member of the organizing committee of ICIP-2001and has been a technical committee member for various international signalprocessing and neural networks conferences. Since 1997, he has been servingas Editor for the Journal of VLSI for Signal Processing.

Dr. Diamantaras is a member of the Technical Chamber of Greece. He servedas an Associate Editor for the IEEE TRANSACTIONS ON NEURAL NETWORKS

from 1999 to 2000. In 1997, he was co-recipient of the IEEE Best Paper Awardin the area of Neural Networks for Signal Processing.

Michael G. Strintzis (M’70–SM’80) received theDiploma degree from the National Technical Uni-versity of Athens, Athens, Greece, in 1967, and theM.A. and Ph.D. degrees from Princeton University,Princeton, NJ, in 1969 and 1970, respectively, all inelectrical engineering.

He then joined the Electrical Engineering De-partment, University of Pittsburgh, Pittsburgh, PA,where he served as Assistant Professor (1970–1976)and Associate Professor (1976–1980). Since 1980,he has been Professor of Electrical and Computer

Engineering with the University of Thessaloniki, Thessaloniki, Greece, and,since 1999, Director of the Informatics and Telematics Research Institute,Thessaloniki. His current research interests include two- and three-dimensionalimage coding, image processing, biomedical signal and image processing, andDVD and Internet data authentication and copy protection.

Dr. Strintzis has served as an Associate Editor for the IEEE TRANSACTIONS

ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY since 1999. In 1984, hewas the recipient of one of the Centennial Medals of the IEEE.

Manos Roumeliotis (S’82–M’84) received theDiploma in electrical engineering from the AristotleUniversity of Thessaloniki, Greece in 1981, and theM.S. and Ph.D. degrees in computer engineeringfrom Virginia Polytechnic Institute and State Uni-versity (VPI&SU), Blacksburg, in 1983 and 1986,respectively.

At VPI&SU, he taught as a Visiting AssistantProfessor in 1986. From 1986 through 1989, hewas an Assistant Professor with the Department ofElectrical and Computer Engineering, West Virginia

University. Currently he is an Assistant Professor with the Department ofApplied Informatics, University of Macedonia, Thessaloniki, Greece. Hisresearch interests include digital logic simulation and testing, computerarchitecture and parallel processing, and computer network optimization.

Dr. Roumeliotis is a member of the IEEE Computer Society’s Technical Com-mittee on Computer Architecture.

Video Scene Segmentation Using Spatial Contours and 3-D Robust Motion Estimation

Documents