Near-real-time stereo matching with slanted surface modeling and sub-pixel accuracy

Pattern Recognition 44 (2011) 2701–2710

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/pr

Near-real-time stereo matching with slanted surface modeling andsub-pixel accuracy

Minglun Gong a,n, Yilei Zhang b, Yee-Hong Yang b

a Memorial University of Newfoundland, St. John’s, Newfoundland, Canada A1B 3X5b University of Alberta, Edmonton, Alberta, Canada T6G 2E8

a r t i c l e i n f o

Article history:

Received 24 August 2010

Received in revised form

12 March 2011

Accepted 23 March 2011Available online 29 March 2011

Keywords:

Real-time stereo

Adaptive-weight cost aggregation

GPU computing

03/$ - see front matter & 2011 Elsevier Ltd. A

016/j.patcog.2011.03.028

esponding author.

ail address: [email protected] (M. Gong).

a b s t r a c t

This paper presents a new stereo matching algorithm which takes into consideration surface

orientation at the per-pixel level. Two disparity calculation passes are used. The first pass assumes

that surfaces in the scene are fronto-parallel and generates an initial disparity map, from which the

disparity plane orientations of all pixels are estimated and refined. In the second pass, the matching

costs for different pixels are aggregated along the estimated disparity plane orientations using adaptive

support weights, where the support weights of neighboring pixels are calculated using a combination of

four terms: a spatial proximity term, a color similarity term, a disparity similarity term, and an

occlusion handling term. The disparity search space is quantized at sub-pixel level to improve the

accuracy of the disparity results. The algorithm is designed for parallel execution on Graphics

Processing Units (GPUs) for near-real-time processing speed. The evaluation using Middlebury bench-

mark shows that the presented approach outperforms existing real-time and near-real-time algorithms

in terms of subpixel level accuracy.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

The binocular stereo matching problem has been extensivelystudied in the past few decades because of its many applications.As well, the taxonomy and evaluation method proposed by Scharsteinand Szeliski [17] also contribute to the increase in attention to thisproblem. According to their taxonomy, stereo matching algorithmsperform all or some of the four steps: matching costs initialization,cost aggregation, optimization, and refinement.

Different stereo algorithms can be classified into global andlocal approaches based on the optimization techniques used.Global optimization methods, such as graph cuts [3] and beliefoptimization [19], in general give more accurate results. Local(winner-take-all [21]) and scanline (dynamic programming [23])optimization approaches, on the other hand, have lower compu-tational cost and are the popular choice in real-time applications.

The selection of cost aggregation approach also has importantimpact on result accuracy. Previous study [4] shows that, amongthe cost aggregation approaches evaluated under the real-timestereo setting, the one based on adaptive-weight [28] performsthe best. Our recent research suggests that the adaptive-weightcost aggregation can be further improved through aggregating thecosts along surface orientation [31]. It is therefore worth noting to

ll rights reserved.

investigate whether it is possible to incorporate this improve-ment into real-time stereo applications.

As a follow up to our preliminary work [31], we herebypropose a novel two-pass real-time stereo approach, whichintroduces per-pixel non-fronto-parallel disparity plane modelingand performs adaptive-weight cost aggregation in 3D cost volumealong slanted planes. In addition, the following improvements aremade over the previous approach [31]: (1) the new adaptiveweight defined not only considers the spatial proximity and colorsimilarity, but also includes an occlusion term to better handleareas that are occluded in the second view; (2) a dynamicprogramming based optimization approach is used, instead ofthe local winner-take-all, for better result accuracy; and (3) thewhole algorithm is re-designed for parallel processing and imple-mented on GPU for real-time performance.

The remaining of the paper is organized as following: Relatedwork is discussed in Section 2. Section 3 gives the details of theproposed two-pass algorithm. The experimental results are pre-sented in Section 4. Finally, the paper concludes in Section 5.

2. Related work

2.1. Cost aggregation

Due to image capturing noise and non-Lambertian reflection,selecting the best disparity hypotheses based on matching costs

www.elsevier.com/locate/pr

dx.doi.org/10.1016/j.patcog.2011.03.028

mailto:[email protected]

dx.doi.org/10.1016/j.patcog.2011.03.028

M. Gong et al. / Pattern Recognition 44 (2011) 2701–27102702

calculated using individual pixel pairs is often prone to error. Costaggregation is therefore an important step, which replaces aninitial matching cost with the (weighted) average cost within alocal support window [17]. Ideally, the support window should belarge enough to capture sufficient intensity variation for handlingweakly textured areas, while at the same time, small enough toexclude pixels of different disparities to preserve object bound-aries. Furthermore, when there are horizontally slanted surfacesin the scene, different window sizes shall be used for the twostereo views to match the unequal projection sizes of slantedsurfaces [14].

Unlike the adaptive window approaches [2,10], which focus onvarying the size, shape, and position of the support window, theadaptive-weight method gives promising results using a large fix-sized window with varying support weight for each pixel in thewindow [28]. The weight is calculated based on Gestalt Principles,which state that the grouping of pixels should be based on spatialproximity and chromatic similarity.

Several approaches have been proposed to further improve thematching accuracy upon the original adaptive-weight approach.Instead of computing the weights based the Gestalt Principles, Minand Sohn treat the cost aggregation as an optimization problem andcompute the optimal weights via energy minimization [13]. Hosniet al., on the other hand, set the weight of each neighboring pixelbased on the geodesic distance between the neighboring pixel andthe center pixel [8]. They show that, since the connectivity betweenthe two pixels is considered, more accurate results can be obtainedwith the same local optimization.

The above approaches assume all surfaces in the scene arefronto-parallel and aggregate matching costs within the 2Dconstant disparity planes. This assumption is often violated dueto the large support window used — even when the slant is verysmall, the big neighborhood span can still go through multipledisparity levels. To address this problem, our preliminary work[31] models the slanted surfaces through estimating the disparityplane orientation at each pixel location. The cost aggregation isthen performed in 3D disparity space along the estimateddisparity plane orientation. As a result, smoother and moreaccurate disparity maps can be generated using local optimiza-tion. Aggregating per-pixel matching costs in 3D disparity spacealso implicitly uses different window sizes for the two views,without the need for explicitly stretching the one of the images asin [14].

To compute the aggregated cost for a give pixel under a givendisparity hypothesis, the above adaptive-weight methods need tocalculate the support weights for all pixels within the supportwindow. Assuming the support window size is W�W, thecomputation needed for aggregating a single cost is O(W2). Sincethe window size W must be big enough for the aggregation to beeffective, these adaptive-weight techniques are too computation-ally intensive for real-time application, even when used with thesimplest local optimization approach.

2.2. Segmentation-based stereo

Instead of performing all computations on individual pixels,segmentation-based stereo algorithms first over-segment theinput image into small homogeneously colored regions. Thesegmentation information is then used as a priori knowledge forfurther stereo calculation.

The disparity plane-fitting approaches [11,20] model the scenestructure using a set of planar surface patches. Assuming that thepixels from the same color segment belong to the same patch,these approaches apply plane-fitting technique to find candidatedisparity planes for each segment. The optimal disparity planeassignment can be determined using either local [20] or global

[11] optimization. Since the fitted disparity planes naturallyprovide sub-pixel disparity values, the scene can be reconstructedat a much finer level. Techniques for handling slanted surfacesusing segmentation results are also proposed [1].

Attempts have been made to combine segmentation informa-tion with adaptive-weight cost aggregation. Tomabari et al. pro-pose to assign full support weights to pixels in the same segmentwith the pixel of interest, whereas the weights for pixels outsideof the segment are calculated using the original adaptive weightapproach [22]. Yang et al. integrate techniques such as colorsegmentation, plane fitting, adaptive weight, and belief propaga-tion in an iterative process [25]. Their approach can generatehighly accurate disparity maps, but is rather slow since the globaloptimization process is performed multiple times.

While utilizing segmentation information generally helps toobtain more accurate results, generating the segmentation itself isa challenging problem and consumes additional processing time.As a result, these techniques are not suitable for real-timeapplications. Our approach does not require a priori imagesegmentation and is designed for parallel processing on GPUs.

2.3. Real-time stereo matching

Due to the processing speed requirement, most real-timestereo systems employ per-pixel [21] or per-scanline [23] opti-mization. Recent advances in computing power, however, make itpossible to perform more complex operations in real-time. Forexample, real-time belief propagation based techniques has beendeveloped [24,26].

Several techniques have been proposed for incorporatingadaptive-weight cost aggregation in real-time systems. Amongthem, Wang et al. use the adaptive-weight calculation in adynamic programming (DP) framework, where the matchingcosts are aggregated within 1D vertical scanlines and the dynamicprogramming is performed along horizontal scanlines [23]. As aresult, smoothness is enforced along both directions, reducing the‘‘streaking’’ artifacts associated with traditional DP algorithms.

With two simplifications, it is also possible to perform adaptive-weight aggregation within 2D squared window in real-time [4,23].The first simplification is to compute the weights using the centralimage only and omit the weight term obtained using the referenceimage. The second one is to approximate the weighted average usinga two-pass approach, with one pass aggregates along the horizontalscanlines and the other computes along the vertical ones. Hence, thecomputation needed for aggregating a single cost is cut down toO(W), where W is the width of the support window. The simplifiedapproach, though yields more matching errors than the originaladaptive-weight approach, still outperforms other real-time aggrega-tion techniques evaluated in [4].

The exponential step size adaptive weight (ESAW) algorithm[29] proposed by Yu et al. further simplifies the above real-timeadaptive-weight implementation. Instead of directly calculatingthe weight of a distant pixel on the same scanline, ESAWapproximates it with the production of the weights calculatedfor using multiple pixel pairs. This simplification further cutsdown the computational cost for a single pixel to O(log W).

The process of adaptive-weight aggregation can also be for-mulated as cross-bilateral filtering, where the filtering of thematching cost volume is guided by the color variation in the inputimage. Inspired by acceleration techniques for bilateral filtering,Richardt et al. recently propose a real-time implementation usingdual-cross-bilateral (DCB) grid [15]. To generate disparity maps atsubpixel accuracy, they also incorporate Yang et al.’s sub-pixelrefinement technique [27].

The approach presented in this paper uses the same twosimplifications proposed in [4,23] to reduce the computational

M. Gong et al. / Pattern Recognition 44 (2011) 2701–2710 2703

cost for cost aggregation. However, we perform cost aggregationalong the estimated disparity plane orientation with subpixeldisparity accuracy. As a result, when evaluating the error rate atsubpixel accuracy, our approach is the top performer among allreal-time and near-real-time algorithms.

3. The proposed algorithm

The workflow of the proposed algorithm is shown in Fig. 1. Inthe first pass, the algorithm computes an initial disparity mapusing a dynamic programming based stereo algorithm [6]. Thesame procedure is used to compute the disparity map for thereference view, which is then used to remove inconsistentdisparity values from the center view through cross checking.The remaining disparity values, which are considered reliable, areused to estimate the gradient of the disparity plane at each pixelthrough a least square fitting approach. The estimated gradientinformation is encoded into an image, referred as a disparityplane orientation (DPO) image. With estimated per-pixel DPOinformation, the second stereo matching pass initialize matchingcosts at sub-pixel accuracy, and then perform cost aggregationusing a novel 3D adaptive cost aggregation approach.

It is worth noting that the above workflow is quite differentfrom the one used in our preliminary paper. The latter oneestimates DPO images for both left and right views, performsstereo matching at subpixel level for both views as well, applies

(a)Central

image:

(c)Central

disparity:

Coarse stmatching usi

Cross checking

(e)Validateddisparity

Dispaplane f

Subpixel stereo matchinweight aggr

(g)Final disparity

map

Fig. 1. The workflow of th

cross-checking to remove mismatches, and finally conducts hole-filling using DPO images. In contrast, the new approach performsDPO orientation for the center view only, does not require hole-filling, and relies on the occlusion term to handle occluded area.These changes greatly reduce the computational costs, making itpossible for near-real-time processing speed.

3.1. Coarse stereo matching

Taking a central image II and a reference image IR as input, thisstep estimates the initial disparity maps for both images. Theprocess starts with computing a 3D cost matrix using thetruncated absolute color difference between corresponding pixelsof the input stereo image pair. That is, the cost of assigningdisparity hypothesis d to pixel (u,v) is computed using

C½u,v,d� ¼minð9ILðu,vÞ�IRðu�d,vÞ9,CmaxÞ ð1Þ

where 9x–y9 computes the Manhattan color distance between thetwo pixels in RGB color space; Cmax is the truncation value usedand is set to 25 for all the experiments.

The matching costs calculated based on individual pixels areaggregated using a 3�3 shiftable window, which replaces thecost of a given pixel with the average costs within a squarewindow anchored at different locations [17]. The small 3�3shiftable window is chosen here based on the following twoconsiderations: (1) it can be efficiently computed using a 3�3box filter pass followed by a 3�3 min filter pass; and (2) without

(b)Reference image:

(d)Reference

disparity:

ereo ng RDP

rity itting

g using 3D adaptive egation

(f)Disparity plane

orentation

e proposed algorithm.

Fig. 2. The 2D color texture that stores the matching costs. A single pixel encodes the costs obtained under four different disparity hypotheses in its four color channels,

resulting a total of 16 tiles used for storing the costs for all 64 hypotheses (for interpretation of the references to color in this figure legend, the reader is referred to the web

version of this article).


knowing the orientation and the boundary of the local disparitysurfaces, using a small window can avoid introducing errorscaused by slanted surfaces and depth discontinuities.

Both cost initialization and cost aggregation are implementedon the GPU. As shown in Fig. 2, the resulting 3D aggregated costvolume is stored in a 2D color texture, which is then used fordisparity calculation. Different real-time disparity optimizationtechniques can be applied using the proposed framework, thoughGPU-based approaches are preferred as data transfer betweenGPU and CPU can be avoided. Here we choose the OrthogonalReliability-based Dynamic Programming (ORDP) algorithm [6].The ORDP algorithm searches for the optimal matches along bothhorizontal and vertical scanlines using dynamic programming,and removes potential mismatches using reliability-based thresh-olding. It is also designed for parallel execution on GPUs and canachieve real-time processing speed.

When calculating a disparity map, the ORDP algorithm per-forms dynamic programming (DP) passes along horizontal andvertical scanlines in alternating order to search for reliablematches. All reliable matches found in previous passes are usedto guide the following searches through updating the matchingcosts. In our experiments we use two DP passes (horizontal thenvertical), which typically find reliable matches for 80–95% ofpixels. We also modify the algorithm so that the disparity mapsfor both central and reference images can be generated using thesame aggregated cost volume, removing the need for computing aseparate cost volume for the reference view.

Fig. 1(c) and (d) shows the disparity maps estimated for thetwo input images. The green color in the disparity maps indicatesthat no reliable disparity values have been found for the corre-sponding pixels after two DP passes. It is worth noting that theORDP algorithm assigns disparity values to all pixels in the image,even though some are deemed unreliable. Here we use D(u,v) andR(u,v) to denote the disparity value and the reliability labelobtained for pixel (u,v) of a given image, with R(u,v)¼1 indicatingthe disparity value D(u,v) being reliable and R(u,v)¼0 otherwise.

3.2. Cross checking

The ORDP algorithm can detect potential mismatches causedby ambiguities and label them as unreliable. However, due tonoise or occlusion, the matches that are considered reliable maystill be incorrect (see Fig. 1(c) and (d)). Removing these incorrectmatches is important, as otherwise errors may propagate throughthe later steps, i.e., disparity plane orientation estimation andadaptive weight calculation.

Using the disparity map generated for the reference view, wecan cross validate the disparity values in the central disparitymap. That is, the reliability label for pixel (u,v) of the central viewis set to unreliable if its disparity value does not agree with the

one for the corresponding pixel in the reference view:

RLðu,vÞ ¼ 0 if 9DLðu,vÞ�DRðu�DLðu,vÞ,vÞ941 ð2Þ

Fig. 1(e) shows the result of cross validation. Close inspectionindicates that most mismatches are labeled as unreliable.

4. Disparity plane fitting

With initial disparity values calculated for the central view, thenext step is estimate the slanted surface orientation at each pixellocation. To simplify the calculation, here we assume that allsurfaces in the scene can be locally approximated using a plane inthe 3D disparity space. The orientation of a given disparity planeis specified using the horizontal and vertical gradients (Gx,Gy) inthe disparity space, where

Gxðu,vÞ ¼@Dðu,vÞ

@u, Gyðu,vÞ ¼

@Dðu,vÞ

@vð3Þ

and Dðu,vÞ is the unknown ground truth disparity map.To estimate Gx and Gy from inaccurate disparity map D(u,v)

obtained, a simple least squares fitting method is applied. Forexample to compute Gx(u,v) , we want to find a horizontal linethat passes through D(u,v) and gives the smallest weightedsquared error. The weighted squared error between the dataand the fitting straight line is defined as

E¼Xþ r

k ¼ �r

zkðu,vÞðDðuþk,vÞ�ðGxðu,vÞkþDðu,vÞÞÞ2 ð4Þ

where the weight function zku,vð Þ is used to suppress outliers. Here

a disparity sample D(uþk,v) is considered as an outlier if it isunreliable or it differs too much from the disparity value of thecenter pixel D(u,v). That is, the weight is set using

zkðu,vÞ ¼

1 if Rðuþk,vÞ ¼ 149Dðuþk,vÞ�Dðu,vÞ9r2

0:01 otherwise

(ð5Þ

when E is the minimum, we have

@E

@Gxðu,vÞ¼�2

Xþ r

k ¼ �r

zkðu,vÞkðDðuþk,vÞ�ðGxðu,vÞkþDðu,vÞÞÞ ¼ 0 ð6Þ

Therefore dx(u,v) can be calculated as

Gxðu,vÞ ¼

Pþ rk ¼ �rðz

kðu,vÞkDðuþk,vÞÞ�Dðu,vÞ

Pþ rk ¼ �r zk

ðu,vÞk� �

Pþ rk ¼ �r zk

ðu,vÞk2

� � ð7Þ

Vertical gradient Gy(u,v) is computed in a similar manner.Fig. 1(f) shows the disparity plane orientation image generated,which encodes the horizontal and vertical gradient values in thered and green channels, respectively.

Fig. 4. The effects of adaptive weight approximation for different pixels shown in

the blue square. Higher intensities in (b, c) indicate that the corresponding

neighboring pixel has higher weights. (a) input image, (b) oginal adaptive weights

and (c) approximate weights (for interpretation of the references to color in this

figure legend, the reader is referred to the web version of this article).


4.1. Subpixel stereo matching using 3D adaptive weight aggregation

The final step estimates disparity values at subpixel accuracyusing the source images (II and IR), initial disparity estimates (DL), anddisparity plane orientation information (Gx and Gy). In order toachieve subpixel accuracy, we first update the matching cost volumeC[u,v,d] computed in Section 3.1 to include matching cost at subpixeldisparity levels. To balance between the computational cost andsubpixel disparity accuracy, here we choose to sample the disparityspace at 0.25 spacing. Hence, for each matching cost C[u,v,d] alreadystored in the matrix, we need to compute three additional costs usingthree disparity hypotheses dþs, where SA{0.25,0.5,0.75}:

C½u,v,dþs� ¼minð9ILðu,vÞ�ðð1�sÞIRðu�d,vÞþsIRðu�d�1,vÞÞ9,CmaxÞ ð8Þ

Next, the matching costs are aggregated within a (2rþ1)� (2rþ1)window. The aggregation is performed along the estimated disparityplane orientation. The formula for the 3D adaptive-weight aggrega-tion is as follows:

AC u,v,dþs� �

¼

Pm,nA ½�r,r� wðm,nÞ

ðu,vÞ C½uþm,vþn,dþsþmGxðu,vÞþnGyðu,vÞ��

Pm,nA ½�r,r�w

ðm,nÞðu,vÞ

ð9Þ

where (u,v) is the pixel of interest, wðm,nÞðu,vÞ represents the weight of

neighboring pixel (uþm,vþn), which is calculated using four terms:a proximity term, a color similarity term, a disparity similarity term,and an occlusion handling term:

wðm,nÞðu,vÞ ¼wproxðu,v,m,nÞwclrðu,v,m,nÞwdispðu,v,m,nÞwoccðu,v,m,nÞ ð10Þ

The first two terms assign weights to neighboring pixels based onhow close they are to the center pixel in spatial domain and colorspace, respectively. Similar to [28], these two terms are definedusing Gaussian functions:

wproxðu,v,m,nÞ ¼ e�ðm2 þ n2 Þ

sprox2

wclrðu,v,m,nÞ ¼ e� Iðu,vÞ�Iðuþm,vþ nÞj j2

sclr2

ð11Þ

where parameters sprox and sclr control the shapes of the twoGaussian functions.With the initial coarse disparity map alreadycomputed, we can also define a disparity similarity term, whichgives higher weights to nearby pixels with the same or similardisparity values and lower weights to those with different dispa-rities. Pixels with unreliable disparity values are also assigned withsmall weights to limit their effect on the cost aggregation. That is,

wdispðu,v,m,nÞ ¼e�9Dðu,vÞ�Dðuþm,vþ nÞ92

sdisp2

if Rðuþm,vþnÞ ¼ 13m¼ n¼ 0

e�2552

sdisp2

otherwise

8><>:

ð12Þ

when m¼n¼0, all the three terms above return value 1, which isthe maximum weight. As a result, the matching costs computed for

Fig. 3. Handling of pixels in occluded areas (for interpretation of the references to

the center pixel always have a strong influence on the aggregationresult. While this is a preferred property under most cases, it maybecome problematic in occluded areas.

Fig. 3 shows the problem using pixels on one scanline of theimage (shown in yellow). The true disparity value for the pixel inquestion (shown in red) should be the one for the background.However, since the corresponding 3D point is occluded in thereference image, none of the matching costs calculated for thepixel measures the color difference between the true correspond-ing pixel pair. As a result, the disparity value obtained for thepixel is erratic and is deemed as unreliable. The best way torecover the disparity value for this pixel is through extrapolatingthe disparity values of nearby unoccluded background pixels.However, this would require a separate hole-filling step, whichcan be hard to implement in parallel since disparity informationneeds to propagate into the occluded region iteratively.

In this paper, we propose to perform hole-filling implicitlythrough an additional occlusion handling weight term. For a pixelin occluded background area, this weight term should give ahigher weight to nearby unoccluded background and a lowerweight to both occluded background (unreliable pixels) andforeground (pixels on the right, under the assumption that thereference image is the right view). Accordingly, we define theocclusion handling term as

woccðu, v, m, nÞ ¼

1 if Rðu, vÞ ¼ 1

1 else if mo04Rðuþm,vþnÞ ¼ 1

docc else

8><>: ð13Þ

when the center pixel is reliable (R(u,v)¼1), an indication that thepixel is unoccluded, the above term outputs the same weight ‘‘1’’

True disparity

Estimated disparity

Estimated reliability

Desired aggregation weights

color in this figure, the reader is referred to the web version of this article).


to all neighboring pixels. As a result, the cost aggregation isperformed as if there were no occlusion handling term. When thecenter pixel is occluded (R(u,v)¼0), the above term gives a higherweight to reliable pixels on the left (mo4R(uþm,vþn)¼1),which are very likely the unoccluded background. A lower weightdocc(docco1) is given to the remaining pixels, which are eitherpixels in the occluded area, including the center pixel, or fore-ground pixels on the right. As a result, the aggregated costs will bemostly affected by unoccluded background pixels, allowing thedisparity value that extrapolates the unoccluded backgroundhaving smaller aggregated cost than other disparity hypotheses.

While the above weight terms are defined for all pixels withinthe local square window, actually calculating the weightsfor all these pixels and performing the weighted sum are toocomputationally expensive for real-time processing. To cut the

Fig. 5. Results of our algorithm on Middlebury stereo datasets. The input stereo images

corresponding error maps are shown in the next two rows. For comparison, the last tw

term and the occlusion handing term are not used (for interpretation of the references

computational cost, here we approximate the weight for a givenpixel (uþm,vþn) using the product of two weights:

wðm,nÞðu,vÞ �wðm,0Þ

ðu,vþnÞwð0,nÞðu,vÞ ð14Þ

Therefore, the above cost aggregation equation can be approxi-mated using

AC u,v,dþs� �

�

Pm,nA ½�r,r� wðm,0Þ

ðu,vþnÞwð0,nÞðu,vÞC½uþm,vþn,dþsþmGxðu,vÞþnGyðu,vÞ�

� �P

m,nA ½�r,r�wðm,nÞðu,vÞ

¼

Pþ rn ¼ �r wð0,nÞ

ðu,vÞ

Pþ rm ¼ �r wðm,0Þ

ðu,vþnÞC½uþm,vþn,dþsþmGxðu,vÞþnGyðu,vÞ��

Pm,nA ½�r,r�w

ðm,nÞðu,vÞ

ð15Þ

As a result, the 3D adaptive-weight aggregation can be computedusing a two-step approach. The first step performs 1D aggregation

are shown in the first row. Disparity maps generated using our algorithm and the

o rows show the results of our algorithm when the proposed disparity similarity

to color in this figure, the reader is referred to the web version of this article).


along horizontal scanlines and caches the results into an intermediatecost volume. The second step takes the intermediate cost volume asinput and aggregates along vertical scanlines. Fig. 4 shows the effectsof this approximation. It suggests that the approximation is reason-ably accurate at smooth regions, but less so at areas with fine details.

Once the aggregated cost matrix AC[u,v,dþs] is generated, thesame ORDP procedure is used to compute the disparity map forthe central view. Since our goal now is to output disparity valuesfor all pixels, here we use three DP passes, with no reliability-based filtering being performed for the last horizontal pass.

5. Experimental results

The proposed algorithm is suitable for parallel processing on GPUsas all computations, such as DPO estimation and 3D cost aggregation,

Fig. 6. Results of different variants of adaptive-weight approaches. The disparity maps

our preliminary paper [31], respectively (both are non-real-time). The last three row

approach [15], and the ESAW approach [29], respectively.

can be performed on different pixels independently. We choose toimplement the algorithm using Direct3D and HLSL shaders, so that itnicely integrates with the ORDP algorithm used and runs on differentgraphics cards. Using newer platform such as CUDA and exploreshared memory may further improves the processing speed.Currently for the Tsukuba datasets, which has image resolution of384�288 pixels and disparity range of 16 pixels, our implementationachieves 5 fps on a Lenovo ThinkStation S20 with nVidia GeForce GTX480 GPU (a midrange graphics card available for less than $500). It isalso noteworthy that, since our approach samples the disparity rangeat subpixel levels, the number of disparity hypotheses evaluated inthe last pass is four times of the input disparity range.

The algorithm is evaluated using the Middlebury stereo bench-mark [18]. The parameters used in the experiments are tunedindividually, but are kept the same for all datasets tested. Inparticular, the support window size parameter r is set to 16, with

in the first two rows are reported by the original adaptive-weight paper [28] and

s show the results of the simplified adaptive-weight approach [4], the DCB grid


the resulting a 33�33 window being used. The parameters fordifferent support weight terms are set as sprox¼25, sclr¼30,sdisp¼4, and docc¼0.5.

Fig. 5 shows the result of our algorithm with and withoutusing the disparity similarity term and the occlusion handlingterm. It demonstrates that the two new terms proposed caneffectively remove mismatches in occlude areas. The effect ismost noticeable in areas highlighted using red rectangles.

For comparison, results of existing adaptive-weight-basedapproaches are shown in Fig. 6. The following observations canbe made through visual inspection:

�

TabThe

A

P

A

R

R

O

R

F

E

R

O

R

R

R

D

TabThe

A

O

R

P

A

D

R

R

R

R

O

E

F

R

R

Since our preliminary approach [31] uses the same local WTAoptimization as the original adaptive-weight approach [28],the comparison between the two suggests that aggregatingcosts in 3D along estimated disparity plane orientation caneffectively improve the matching accuracy for slanted surfacesshown in the Venus, Teddy, and Cones datasets.
� The occluded areas are nicely handled in the adaptive-weight
approach [28] and our preliminary approach [31], since bothperform hole-filling to pixels invalidated by the left–rightconsistency check. However, they are too slow for real-timeapplications. In contrast, the presented algorithm estimatesthe disparity in occluded areas directly with the help of theproposed occlusion handling weight term.

le 1error rates evaluated from the Middlebury vision website, with error threshold¼1

lg. Tsukuba Venus

Non-occ All Disc Non-occ All

laneFitBP [24] 0.97 1.83 5.26 0.17 0.51dapt weight [28] 1.38 1.85 6.90 0.71 1.19

eal time ABW [7] 1.26 1.67 6.83 0.33 0.65

eal time BP [26] 1.49 3.40 7.87 0.77 1.90

ur approach 3.04 3.59 11.3 1.56 2.00

eal time BFV [30] 1.71 2.22 6.74 0.55 0.87

ast aggreg [21] 1.16 2.11 6.06 4.03 4.75

SAW [29] 1.92 2.45 9.66 1.03 1.65

eal time Var [12] 3.33 5.48 16.8 1.15 2.35

ptimized DP [16] 1.97 3.78 9.80 3.33 4.74

T census [9] 5.08 6.25 19.2 1.58 2.42

eal time GPU [23] 2.05 4.22 10.6 1.92 2.98

eliabilityDP [5] 1.36 3.39 7.25 2.35 3.48

CB grid [15] 5.90 7.26 21.0 1.35 1.91

le 2error rates evaluated from the Middlebury vision website, with error threshold ¼

lg. Tsukuba Venus

Non-occ All Disc Non-occ All

ur approach 9.86 10.5 17.0 3.31 3.82T census [9] 12.9 14.1 28.1 3.67 4.63

lane FitBP [24] 12.7 13.6 16.2 8.58 9.12

dapt weight [28] 18.1 18.8 18.6 7.77 8.40

CB grid [15] 21.7 22.8 29.6 2.85 3.97

eal time ABW [7] 12.1 12.6 16.5 8.39 8.90

eal time Var [12] 9.69 11.8 29.8 11.3 12.5

eal time BFV [30] 24.5 25.0 21.3 8.64 9.12

eal time BP [26] 19.9 21.6 22.2 8.68 9.93

ptimized DP [16] 24.0 25.5 22.7 12.4 13.7

SAW [29] 19.2 19.7 22.8 11.0 11.7

ast aggreg [21] 23.1 23.9 19.6 15.8 16.6

eliability DP [5] 19.0 20.7 17.5 12.7 14.0

eal time GPU [23] 24.2 26.0 24.9 10.9 12.1

�

. Th

Di

16

3

9

8

2

6

6

12

13

14

20

12

11

0.5.

Results generated by the proposed approach are much moreaccurate than the simplified adaptive-weight approach [4],which uses the same two simplifications for weight calcula-tion. We attribute this improvement to both the 3D costaggregation technique proposed and the ORDP optimizationapproach used.
� Our proposed approach provides visually better results than
the DCB grid approach [15], which also output disparity valuesat subpixel accuracy.
� The results of our approach and the ESAW approach are of
similar quality. Nevertheless, our results are much smoother,thanks to the subpixel matching.

The quantitative evaluation is also conducted using theMiddlebury online evaluation tool. Tables 1 and 2 give theperformance comparison between our proposed approach andall real-time and near-real-time stereo algorithms available onMiddlebury site to date (August 2010) [18]. The comparisonsupports the following observations:

�
When subpixel accuracy is not required, our algorithm isoutperformed by only belief propagation based techniques[24,26], the recently proposed adaptive binary windowapproach [7], and the original adaptive weight approach [28](non-real-time).
e top performance in each category is shown in bold.

Teddy Cones

sc Non-occ All Disc Non-occ All Disc

.71 6.65 12.1 14.7 4.17 10.7 10.6

.13 7.88 13.3 18.6 3.97 9.79 8.26

.56 10.7 18.3 23.3 4.81 12.6 10.7

.00 8.72 13.2 17.2 4.61 11.6 12.4

.21 6.90 12.3 16.0 4.87 10.3 10.7

.88 9.90 15.0 19.5 6.66 12.3 13.4

.43 9.04 15.2 20.2 5.37 12.6 11.9

.89 8.48 14.2 18.7 6.56 12.7 14.4

.8 6.18 13.1 17.3 4.66 11.7 13.7

.0 6.53 13.9 16.6 5.17 13.7 13.4

.2 7.96 13.8 20.3 4.10 9.54 12.2

.3 7.23 14.4 17.6 6.41 13.7 16.5

.2 9.82 16.9 19.5 12.9 19.9 19.7

.2 10.5 17.2 22.2 5.34 11.9 14.9

The top performance in each category is shown in bold.

Teddy Cones

Disc Non-occ All Disc Non-occ All Disc

11.0 11.8 18.3 24.2 7.94 13.5 15.117.8 11.4 18.6 27.7 5.54 11.8 15.9

12.7 18.4 24.3 31.4 15.3 21.9 23.9

15.8 17.6 23.9 34.0 14.0 19.7 20.6

16.5 16.1 24.0 33.0 10.8 18.2 22.6

12.9 22.0 29.0 38.4 15.1 22.2 22.4

31.8 15.6 23.6 32.7 9.72 17.2 21.3

13.5 18.1 24.2 32.1 15.0 20.6 22.9

20.1 19.2 24.8 33.8 14.2 21.2 25.9

23.7 17.1 25.0 30.3 14.1 22.2 24.6

18.4 19.4 25.9 33.6 18.5 24.0 28.8

19.6 21.1 27.4 34.0 15.5 22.0 23.1

26.1 26.3 32.5 36.8 23.7 29.9 31.5

27.6 19.6 27.0 33.0 16.5 23.7 29.5


�
When error threshold is lowered to 0.5, our approach performsthe best among all existing real-time and non-real-timetechniques. It also outperforms the original adaptive-weightapproach. � The difference in ranking under thresholds 1.0 and 0.5 indi-
cates that our algorithm makes much fewer errors in the rangeof [0.5, 1.0] than existing ones. Since the algorithm is designedfor handling slanted surface at subpixel accuracy, having fewererrors in this range suggests it achieves the goal.
� Even though no explicit hole-filling is performed, our algo-
rithm gives the lowest error rates in discontinuity areas undersubpixel accuracy for all four datasets. We attribute this to theocclusion handling term, which properly handles the occludedareas detected by the left-and-right consistency check used inthe first disparity calculation pass.

While the above experiments show that our algorithm workswell for slanted surfaces, the algorithm does assume that thesurfaces contain sufficient textures for proper matching. Fortextureless slanted surfaces, such as the one that Ogale andAloimonos used in Fig. 5 of their paper [14], our algorithm willfail since the coarse stereo matching step won’t provide a goodinitial estimation in textureless areas.

6. Conclusion

A novel two-pass stereo matching algorithm is presented inthis paper, which combines adaptive-weight cost aggregationwith slanted surface modeling. The first pass generates a coarsedisparity map, from which the disparity plane orientation at eachpixel location is robustly estimated using least squares fitting. Theorientation information is later used in the second pass to guidecost aggregation along slanted surfaces in 3D cost volume. Theexperimental results show that 3D cost aggregation helps toproduce better disparity maps than the original adaptive-weightalgorithms. Two new weight terms are also introduced foradaptive weight calculation, which helps to reduce mismatchesin occluded areas. When implemented for near-real-time stereomatching through simplifications, our approach performs the bestunder subpixel accuracy among all existing real-time and near-real-time algorithms reported in Middlebury evaluation site.

Acknowledgments

The authors would like to thank financial supports from NSERC,the Memorial University of Newfoundland, and the University ofAlberta.

References

[1] S. Birchfield, C. Tomasi, Multiway cut for stereo and motion with slantedsurfaces, in: Proceedings of the IEEE International Conference on ComputerVision, pp. 489–495.

[2] Y. Boykov, O. Veksler, R. Zabih, A variable window approach to early vision,,IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (12)(1998) 1283–1294.

[3] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization viagraph cuts, IEEE Transactions on Pattern Analysis and Machine Intelligence23 (11) (2001) 1222–1239.

[4] M. Gong, R. Yang, L. Wang, M. Gong, A performance study on different costaggregation approaches used in real-time stereo matching, InternationalJournal of Computer Vision 75 (2) , pp. 1573–1405.

[5] M. Gong, Y.-H. Yang, Near real-time reliable stereo matching using program-

mable graphics hardware, Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (2005) 924–931.[6] M. Gong, Y.-H. Yang, Real-time stereo matching using orthogonal reliability-

based dynamic programming, IEEE Transactions on Image Processing 16 (3)

(2007) 879–884.[7] R.K. Gupta, S.-Y. Cho, Real-time stereo matching using adaptive binary

window, in: Proceedings of the 3DPVT, 2010.[8] A. Hosni, M. Bleyer, M. Gelautz, C. Rhemann, Local Stereo matching using

geodesic support weights, Proceedings of the International Conference on

Image Processing (2009).[9] M. Humenbergera, C. Zinnera, M. Webera, W. Kubingera, M. Vinczeb, A fast

stereo matching algorithm suitable for embedded real-time systems, Com-

puter Vision and Image Understanding (2010).[10] T. Kanade, M. Okutomi, Stereo matching algorithm with an adaptive window:

theory and experiment, IEEE Transactions on Pattern Analysis and Machine

Intelligence 16 (9) (1994) 920–932.[11] A. Klaus, M. Sormann, K. Karner, Segment-based stereo matching using

belief propagation and a self-adapting dissimilarity measure, Proceed-

ings of the International Conference on Pattern Recognition (2006) 15–18.[12] S. Kosov, T. Thormahlen, H.-P. Seidel, Accurate real-time disparity estimation

with variational methods, Advances in Visual Computing (2009).[13] D. Min, K. Sohn, Cost aggregation and occlusion handling with WLS in stereo

matching, IEEE Transactions on Image Processing 17 (8) (2008) 1431–1441.[14] A.S. Ogale, Y. Aloimonos, Stereo correspondence with slanted surfaces:

critical implications of horizontal slant, Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (2004) 568–573.[15] C. Richardt, D. Orr, I. Davies, A. Criminisi, N.A. Dodgson, Real-time spatio-

temporal stereo matching using the dual-cross-bilateral grid, in: Proceedings

of the European Conference on Computer Vision, 2010.[16] J. Salmen, M. Schlipsing, J. Edelbrunner, S. Hegemann, S. Luke, Real-time

stereo vision: making more out of dynamic programming, Computer Analysis

of Images and Patterns (2009) 1096–1103.[17] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame

stereo correspondence algorithms, International Journal of Computer Vision

47 (1–3) (2002) 7–42.[18] D. Scharstein, R. Szeliski, Middlebury Stereo Vision Page (2010).[19] J. Sun, N.-N. Zheng, H.-Y. Shum, Stereo matching using belief propagation,

IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (7) (2003)

1–14.[20] H. Tao, H.S. Sawhney, R. Kumur, A global matching framework for stereo

computation, Proceedings of the ICCV (2001), pp. 532–539.[21] F. Tombari, S. Mattoccia, E. Addimanda, Near real-time stereo based on

effective cost aggregation, Proceedings of the ICPR (2008).[22] F. Tombari, S. Mattoccia, L.D. Stefano, Segmentation-based adaptive support

for accurate stereo correspondence, Advances in Image and Video Technol-

ogy (2007) 427–438.[23] L. Wang, M. Liao, M. Gong, R. Yang, D. Nister, High quality real-time stereo

using adaptive cost aggregation and dynamic programming,, Proceedings of

the International Symposium on 3D Data Processing, Visualization and

Transmission (2006), pp. 798–805.[24] Q. Yang, C. Engels, A. Akbarzadeh, Near real-time stereo for weakly-textured

scenes, Proceedings of the British Machine Vision Conference (2008)

pp. 80–87.[25] Q. Yang, L. Wang, R. Yang, H. Stewenius, D. Nister, Stereo matching with

color-weighted correlation, hierarchical belief propagation and occlusion

handling, Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition (2006).[26] Q. Yang, L. Wang, R. Yang, S. Wang, M. Liao, D. Nister, Real-time global stereo

matching using hierarchical belief propagation, Proceedings of the British

Machine Vision Conference (2006).[27] Q. Yang, R. Yang, J. Davis, D. Nister, Spatial-depth super resolution for range

images, Proceedings of the CVPR (2007).[28] K.J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence

search, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4)

(2005) 650–656.[29] W. Yu, T. Chen, F. Franchetti, J.C. Hoe, High performance stereo vision

designed for massively data parallel platforms, IEEE Transactions on Circuits

and Systems for Video Technology (2010).[30] K. Zhang, J. Lu, G. Lafruit, R. Lauwereins, L.V. Gool, Real-time accurate stereo

with bitwise fast voting on CUDA, Proceedings of the IEEE Workshop on

Embedded Computer Vision (2009).[31] Y. Zhang, M. Gong, Y.-H. Yang, Local stereo matching with 3D adaptive

cost aggregation for slanted surface modeling and sub-pixel accuracy,

Proceedings of the International Conference on Pattern Recognition

(2008).


Dr. Minglun Gong is an associate professor at the Memorial University of Newfoun
dland and an adjunct professor at the University of Alberta. He obtained his Ph.D. fromthe University of Alberta in 2003, his M.Sc. from the Tsinghua University in 1997, and his B.Engr. from the Harbin Engineering University in 1994. After graduation, he wasa faculty member at the Laurentian University for 4 years before joined the Memorial University.
Minglun’s research interests include a variety of topics in computer graphics, computer vision, image processing, pattern recognition, and optimization techniques. Hehas published over 50 technical papers in refereed journals and conference proceedings and served as program committee member and reviewer for international journalsand conferences.

Mr. Yilei Zhang was an M.Sc. student at University of Alberta co-supervised by Dr. Gong and Dr. Yang. His thesis topic is ‘‘Towards real-time adaptive support weightstereo algorithms.’’ He graduated in 2008 and is now with TRLabs in Edmonton.

Dr. Yee-Hong Yang’s research interest covers a wide range of topics in computer graphics and computer vision. In computer graphics, his interests include animation,environment matting, hardware accelerated graphics, motion editing, physics-based modelling, texture analysis and synthesis, and static and dynamic image-basedrendering. Topics in computer vision include edge detection, face detection and recognition, light source estimation, motion estimation, segmentation, 2D and 3D shapeanalysis, and real-time multiview stereo.

Dr. Yang is a senior member of the IEEE and serves on the Editorial Board of the journal Pattern Recognition. He has published over 100 technical papers in internationaljournals and conference proceedings, co-edited one book and served as guest editor of an international journal. In addition, He has served as reviewer to numerousinternational journals, as committee members to many conferences and review panels. He also co-chaired Vision Interface 98.

Near-real-time stereo matching with slanted surface modeling and sub-pixel accuracy

Documents