arXiv:1706.05999v1 [cs.CG] 19 Jun 2017 · Guided Depth Upsampling for Precise Mapping of Urban Environments Sascha Wirges1, Bjorn Roxin¨ 2, Eike Rehder2, Tilman Kuhner¨ 1 and Martin

Guided Depth Upsampling for Precise Mapping of Urban Environments

Sascha Wirges1, Bjorn Roxin2 , Eike Rehder2, Tilman Kuhner1 and Martin Lauer2

Abstract— We present an improved model for MRF-baseddepth upsampling, guided by image- as well as 3D surfacenormal features. By exploiting the underlying camera model wedefine a novel regularization term that implicitly evaluates theplanarity of arbitrary oriented surfaces. Our method improvesupsampling quality in scenes composed of predominantly pla-nar surfaces, such as urban areas. We use a synthetic datasetto demonstrate that our approach outperforms recent methodsthat implement distance-based regularization terms. Finally,we validate our approach for mapping applications on ourexperimental vehicle.

I. INTRODUCTION

Perception and localization algorithms developed for au-tomated driving tasks rely on accurate environment models.These models are usually generated using information pro-vided by mobile sensors such as cameras or range sensors.Whereas cameras provide 2D projections of surface re-flectances with high spatial resolution, range sensors usuallyprovide precise 3D surface positions. However, the spatialresolution of modern range sensors is sparse compared tocameras.

Currently, most systems perform environmental mappingwithin one sensor domain which has several drawbacks.Common methods usually perform feature estimation andmatching to find corresponding surface landmarks betweensubsequent measurement frames. For camera-based mappingmethods, the scale might be either subject to drift or hard toestimate accurately in the calibration process which results inglobally inconsistent maps. For range sensor-based methodsthe resulting map may consist of accurate but spatially sparse3D points which inherently induces errors on surface featureestimation and reconstruction. Thus, our aim is to combinethe strengths of both sensor types to generate a map thatconsists of spatially dense surface features.

Here, we propose a guided depth upsampling methodthat estimates surfaces accurately for each camera pixelwithin scenes composed of predominantly planar surfaces,such as urban areas. Provided with a calibrated camera-laser setup, the 3D surface point position can be determinedby evaluating the viewing ray corresponding to an imagecoordinate at an estimated depth.

However, as different image areas usually have varying3D point densities, the quality of depth upsampling mightvary drastically. Therefore, we are also interested in finding

1Sascha Wirges and Tilman Kuhner are with the FZI Research Center forInformation Technology and Karlsruhe Institute of Technology, Germany.{wirges,kuehner}@fzi.de

2Bjorn Roxin, Eike Rehder and Martin Lauer are with the Instituteof Measurement and Control, Karlsruhe Insitute of Technology, [email protected],{rehder,lauer}@kit.edu

Fig. 1: Top: Input RGB image with range sensor data inputoverlay. Bottom: Upsampled high-resolution depth image

a confidence measure for each depth estimate. We show thatour method is capable of performing accurate upsamplingwithin image areas that contain only few 3D point observa-tions. Finally, we provide a filtering method that stems fromour optimization model to filter out ill-conditioned depths.

By describing similarities and differences of related workin guided depth upsampling in section II, we show commondrawbacks and emphasize our ideas to overcome these prob-lems. Based on these findings, we formalize our objectivesin depth upsampling and derive the underlying MarkovRandom Field model in section III. We will then validateour approach on a photorealistic indoor dataset and ourexperimental vehicle (section IV). Finally, we conclude ourfindings in section V and show our next plans in guideddepth upsampling.

II. MOTIVATION AND RELATED WORK

Our general objective in depth upsampling is to estimatedepths di for each image coordinate i ∈ I of the image I.Depth observations dj,obs, j ∈ O ⊂ I from range sensorsmight be available only for a small subset O of imagecoordinates.

Assuming a calibrated camera-laser setup, each di can betransformed into a corresponding 3D point

pi = rayi(di) = orii + di · diri (1)

using the viewing ray function ˆrayi that includes for eachimage coordinate i the direction diri and the viewpoint oriiof the respective line of sight. In the following, we will usepi and di interchangeably.

arX

iv:1

706.

0599

9v1

[cs

.CG

] 1

9 Ju

n 20

17

Upsampling methods may be divided into local or globalmethods. Whereas local methods [1], [2] can be used forupsampling images with mostly dense and uniformly dis-tributed depth observations, global methods show better per-formance on data with sparse and non-uniformly distributedobservations. In urban mapping, however, the number ofobservations might vary drastically, depending on the scenesetting. Therefore, we focus on global upsampling methods.

Within global methods, the optimal depths d∗i , i ∈ Iarranged in

d∗ = arg mind

Φ(d) (2)

minimize the cost function Φ that may be composed ofvarious cost terms. [3] models this problem as a MarkovRandom Field with a cost term

Φ = Φdata + Φreg, (3)

that does not only minimize costs towards the given obser-vation data, but also within the direct neighborhood Ni ofeach image coordinate i, where the regularization term

Φreg =∑i∈I

∑n∈Ni

win(di − dn)2 (4)

is used to enforce that estimated depths of direct imagecoordinate neighbors (e.g. within a 4-connected grid) aresimilar. However, their model assumption does not hold forarbitrary planes as it regularizes towards similar depths.

The weights win in equation (4) might be used to includeadditional information on the problem. Whereas invariantweights are used in image filtering applications [4], weightsdepending on image features can guide upsampling and thusimprove quality. Moreover, in all guided approaches, imagefeatures are used to indicate depth discontinuities (see tableI). In particular, [5] shows that image values and rangemeasurements share second order statistics. Based on thiswork, either gray scale [3] or color intensity gradients areused. [6] includes semantic information in the regulariza-tion term and determines extended neighborhoods based ongeodesic distances. Even higher-order terms such as theanisotropic diffusion tensor [7] or [8] might be used. Theauthors add a non-local means regularization term, whichuses an anisotropic structural-aware filter to allow similarpixels in extended neighborhoods to reinforce each other.

Although guided approaches based on image featureshave been studied extensively, a major drawback of existingmethods is the lack of incorporating 3D features into theupsampling process. Therefore, we show the benefit ofincluding 3D surface normals into our problem.

Even if recent methods achieve accurate results, they donot account for confidences in the estimation problem. Weprovide a simple method based on estimating the parametercovariance of the underlying optimization problem at the endof the next section.

III. GUIDED DEPTH UPSAMPLING

For each image coordinate i, we aim to determine itsdepth di and a depth confidence measure σi. To achieve

[3] [8] [7] [6]RGB / gray scale values x x x x

Spatial distance x xAnisotropic diffusion tensor x x

Semantic information x

TABLE I: Image features used in different contributions

this, we require a calibrated camera-laser rig that providesviewing ray lookup functions rayi as described in equation(1) and the transform pext ∈ SE(3) between range sensorand camera frame to be known. Given pext, observed 3Dpoint features fj , j ∈ O can be transformed into the cameraframe and mapped to the image coordinate j.

As in equation (3) we model our upsampling problemas a Markov Random Field containing data costs Φdata andregularization costs Φreg. We can include additional imagefeatures into the optimization problem which we explain insection III-C. These cost terms should be minimized startingfrom depth priors di,0 determined by our initialization strat-egy explained in section III-D. In the following, we describethe different energy functions included in our model.

A. Data Costs

For each observation dj,obs, j ∈ O we set up data costs

Φdata =∑j∈O

wdata(φ2j,depth +

∑n∈Nj

φ2jn,normal) (5)

weighted by wdata. Here, Nj is the direct neighborhood ofimage coordinate j which we choose to be a 4-connectedgrid.

Since we want to include depth observations from rangesensors, depth residuals

φj,depth = dj − dj,obs (6)

evaluate the difference between estimated and observeddepths for each image coordinate j ∈ O.

In addition, we include estimated surface normals fromrange sensors as pseudo measurements into our problem.Therefore, normal residuals

φjn,normal = nTPj

pn − dPj(7)

evaluate the signed point-to-plane distance between the con-structed plane Pj and the point pn = rayn(dn). The planeis constructed from the surface normal nobs and can beexpressed in normal form

nTPj,x− dPj

= 0. (8)

B. Regularization Costs

To model coupling in Markov sense, we add regularizationcost terms for each image coordinate i within its directneighborhood Ni. Whereas 8-connected grids provide abetter coupling with a large number of residual blocks whichdecreases optimization speed, choosing two neighbors willlead to poor coupling and decrease convergence. Thus, we

choose 4-connected grids as they provide a good trade-offbetween coupling and the amount of coupling residuals inthe problem.

We aim to minimize the regularization cost

Φreg =∑i∈I

∑D⊂Ni

wi(D)φi,planar(D)2 (9)

for each image coordinate i in the image I. Here, wi(D)is a weighting term depending on i and a subset D ofthe neighborhood Ni. We will explain in section III-C howwi(D) is composed. In the simplest case the residual termsφi,planar are evaluated between pairs of image coordinates(i, n) as presented in equation (4), where n ∈ Ni. Here,we extend the residual computation to be dependent on setsD ⊂ Ni of multiple image coordinates.

Instead of regularizing towards constant depth (e.g. as in[3]), we enforce the surface points to be coplanar. Thus, weaim to find an appropriate residual term that shows goodconvergence properties.

One option would be to estimate surface normals explicitlybased on all points in the corresponding neighborhood Ni

and find an appropriate point-to-plane residual, similar toequation (7).

We are, however, not interested in computing normalsdirectly, but instead finding a residual term that evaluatesthe planarity of surface points. Here, we assume neighboringviewing rays along one row or column to be coplanar.Although, this assumption might not hold for arbitrarycamera models, it can be justified for a sufficiently smallneighborhood around a reference image coordinate. As theintersection between the plane spanned by these viewing raysand an ideal surface plane forms a line, we can add theresidual

φijk,collinear =∆ji

‖∆ji‖− ∆ik

‖∆ik‖(10)

that evaluates whether triples of points are collinear. Asdepicted in figure 2,

∆ji = pi − pj = rayi(di)− rayj(dj). (11)

and ∆ik are the pairwise differences between the pointsi, j, k, where j, k ∈ D and i are coplanar.

Using the collinearity residual in equation (10), we canadd one residual term for direct neighbors with the sameimage row and one term for neighbors with the same column.As only three parameters are coupled within each residual,the problem sparsity is increased which leads to betterconvergence properties compared to explicitly estimatingsurface normals.

C. Regularization weights

The collinearity residual in equation (10) should only beapplied to areas satisfying the assumption of planar surfaces.To accomplish this, we use additional image features ex-pressed as weights wi(D).

Here, we employ weights wij that are defined betweenneighboring image coordinates i and j. For the collinearity

ij k

pj

pi

pk

φijk,collinear

Fig. 2: Collinearity residual computation. We assume theviewing rays corresponding to the neighbors i, i and k to becoplanar. The intersection between this plane and an idealplane forms a line. We then evaluate whether the pointspi, pj and pk are collinear. This residual can be evaluatedhorizontally and vertically.

residual, D consists of three image coordinates {i, j, k} andwe determine pairwise weights

wij = g(∆) = g(fi − fj). (12)

as components of the regularization weights

wi(D) = wijwik (13)

added for each image coordinate i.Pairwise weights wij are composed of a scalar weighting

function g and image features fi and fj . The weighting func-tion might be exponential, sigmoid, step or even constant,which means that local image features have no influence onthe regularization cost Φreg. However, it is important to notethat arbitrary features might be used as long as they provideinformation about scene planarity.

D. Prior Estimation

In our contribution, we do not focus on solving theoptimization problem efficiently by analyzing the underlyingproblem structure. Please refer to [4] or [6] for hints onimplementation details. Instead, we suggest an initializa-tion method based on linear interpolation that significantlyreduces optimization time and the number of iterations,respectively.

For our initialization method, projected 3D points needto be found for every query image coordinate. Therefore,we generate a kd-tree as described in [9] that includes theset of point projections within the image coordinate frame.This kd-tree search structure quickly provides referencesto the nearest laser point projections for any query imagecoordinate.

i

Fig. 3: Initialization method based on intersecting the view-ing ray of image coordinate i with the plane constructed bya triangle mesh of the input point cloud.

As depicted in Figure 3, we generate a triangle mesh ofthe point cloud and project it into the image. For each imagecoordinate i within a triangle, we intersect the correspondingrayi with the plane constructed by the three points definingthe triangle which we use as depth initialization. Imagecoordinates that are not covered by any triangle, will beassigned the depth of its nearest neighbor. This might bethe case at image borders or 3D points not connected by themeshing algorithm.

E. Covariance Estimation

In some scenarios, depth estimation may not work well.On the one hand the range sensor’s field of view might notcover the camera’s field of view. On the other hand, byevaluating equation (13) image areas might be decoupledfrom their neighborhood and no depth observations exist inthis area. This might be the case when the image features ofa pixel neighborhood indicate non-planar surfaces in a closedarea and thus scale down regularization costs.

To resolve these problems, we aim to assign a confidencemeasure to each estimated depth after optimization by eval-uating the covariance

C = (J′(d∗) · J(d∗))−1, (14)

where the variances σ∗i can then be obtained by evaluatingC(i, i).

Knowing an estimate σ∗i for d∗i we can then set a thresholdand keep only those distances with variances below thatthreshold.

F. Implementation

Equations (5) and (9) show that the Markov Random Fieldformulation can be expressed as a nonlinear least-squaresproblem for which we aim to find to optimal parameters,i.e. parameters that minimize the overall costs. The problemconsists of many residual terms, each of them depending oneither one or three parameters. In total, we add one residual

term for each depth observation and approximately two resid-ual terms for each pixel if image borders are disregarded.The resulting problem can then be solved by Trust-Regionmethods using a linear solver efficiently exploiting the sparseproblem structure.

We implemented our Markov Random Field-basedupsampling method as a C++ library which willbe publicly available on https://github.com/fzi-forschungszentrum-informatik/mrf. It isbased on Ceres Solver [10], an optimization frameworkused to solve large-scale, non-linear least squares problems.As residual blocks can be added one by one, Ceres itselfexploits the sparse structure and uses state-of-the-artsparse linear solver libraries in its backend. Additionally,parameters can be constrained on minimum or maximumbounds that we set to the minimum and maximum depthobserved by the range sensor.

IV. APPLICATIONS AND EXPERIMENTS

In section IV-A, we introduce our performance metrics andshow evaluation results on a photorealistic RGB-D dataset.

We then present our experimental platform for the map-ping of urban environments as an application of guidedupsampling and perform a qualitative evaluation in sectionIV-B.

For both applications, we compare our approach to themodel presented in [3] where a constant distance regulariza-tion is used.

A. Photorealistic Indoor DatasetWe evaluate our approach on a subset of 150 images of

the SceneNet RGB-D [11] dataset. It provides RGB-D sensordata from photo-realistic synthetic indoor scenes which aresemantically labelled by instances and a camera model.

In the dataset ground-truth depth information di is avail-able for each estimated depth di for all image coordinates iin the image I. Here, we determine the mean

1

|I|∑i∈I|ei| (15)

Grayscale RGB RGB with semantics0

2

4

6

8

10

12

14

16

Mea

n-an

dm

edia

nab

solu

teer

ror

/cm

[3]Ours

Fig. 4: Mean and median absolute error depending ondifferent image features used. Semantic features drasticallyimprove the upsampling quality.

https://github.com/fzi-forschungszentrum-informatik/mrf

https://github.com/fzi-forschungszentrum-informatik/mrf

0.5 1 1.5 2 2.5

0

10

20

Abs

olut

eE

rror

/cm

Mean [3]Median [3]Mean [Ours]Median [Ours]

0.5 1 1.5 2 2.5Downsample rate in %

0

10

20

30

Abs

olut

eE

rror

/cm Mean [3]

Median [3]Mean [Ours]Median [Ours]

Fig. 5: Mean and median absolute error depending on thedownsampling ratio. Top: Equidistant downsampling, bot-tom: random downsampling

and the median of the absolute error |ei| = |di − di|. Foreach evaluation, we also provide the downsampling ratio

r =|O||I|

(16)

which is defined by the amount of 3D observations dividedby the image size.

Figure 4 shows the upsampling quality for different imagefeatures used. We observe that semantic features drasticallyimprove the upsampling quality. Our method achieves anmean absolute error of about 17 mm and an even lowermedian absolute error if RGB and semantic features are used.Here, the average downsampling rate is 1%.

Figure 5 depicts the absolute errors depending on differentdownsampling ratios, i.e. the sparsity of 3D observations. Forour evaluations, we performed equidistant as well as randomdownsampling. Whereas the mean absolute errors are com-parable for a larger number of observations, our approachoutperforms for few observations. The reason might be amore realistic regularization in scenes containing a moderateamount of planar surfaces.

B. Experimental Vehicle

Figure 6 depicts the upsampling pipeline implemented forour experimental vehicle.

Our platform is equipped with a Velodyne HDL64E-S2lidar and a high definition RGB camera. The lidar is mountedon top of the vehicle to generate range sensor data structuredas a 3D point cloud. RGB images are provided by a TeledyneDalsa Genie TS-C4096 color camera with an approximateresolution of 12 Megapixels which is mounted externallyabove the windshield. Camera and laser are triggered at thesame rate and the pose between laser scanner and cameracan be assumed calibrated to an accuracy of ±0.5 deg. For

Normalestimation

Cloud

Image Featureestimation

Calibration

UpsampledcloudGuided

depthupsampling

Fig. 6: System overview. Before upsampling, surface normalfeatures for laser data and semantic image features areestimated. The upsampled cloud contains surface normals,image features and the depth confidence measure.

one scenario, the projection of laser points into the cameraimage is depicted in the top image of Figure 1.

Based on 3D point cloud information provided, we esti-mate surface normals similar to [12] for each point observed.The method is based on a Principle Component Analysisof all points within a search radius around a query point.The search radius might be adapted depending on the rangesensor model. We assign the Eigenvector corresponding tothe smallest Eigenvalue to the surface normal of that querypoint. These surface normals may then be included as pseudomeasurements into our guided depth upsampling system. Anexemplary normal estimation result is depicted on the topright corner of Figure 7.

Our image features

fi =

[fi,rgb

fi,semantic

](17)

are composed of the RGB value fi,rgb and a semanticcertainty fi,semantic. Therefore, we predict semantic classesusing GoogLeNet [13] adapted as FCN-8s [14]. The networkwas trained on a 14-class subset of the cityscapes dataset[15]. Apart from the arg-max class predictions, we utilizethe semantic certainty

fi,semantic = (Npi − 1)/(N − 1) ∈ [0, 1], (18)

of that class, where N is the number of classes. It is com-puted from the network’s softmax output pi, i.e. the output’simprovement over guessing normalized to the maximumpossible improvement. For certain predictions this valuebecomes 1 while at class boundaries, it drops to 0.

Finally, pair-wise weights

wij = g(fi, fj) = g(‖fi,RGB − fj,RGB‖2 · fi,semantic (19)

are calculated where a scaling function g is applied to thedifference in the RGB space between pixel i and j, weightedwith the semantic class certainty at pixel i. The regularizationweights wijk as applied in equation (13) are depicted in thetop left corner of Figure 7.

Based on regularization weights, camera model and 3Dsurface point normals, upsampling is performed. The upsam-pled depth image for this scenario is depicted on the bottomof figure 1. Using the ray lookup function in equation (1),we can transform this depth image into a 3D point cloud

Fig. 7: Top left: Regularization weights determined from RGB values and semantic output certainty, top right: Based on thesparse input cloud normals are estimated, bottom left: Building front in the upsampled scene, bottom right: Normals withinthe upsampled cloud are mostly smooth

which is depicted for a shifted viewpoint on the bottom ofFigure 7. We observe that for a soft regularization scalingdue to the semantic certainty, some objects in the scene arenot completely separated from the environment. However,our approach accurately estimates planar surfaces such ashouse fronts or ground surfaces.

V. CONCLUSION

We presented an approach for guided depth upsamplingof range sensor data based on a novel regularization termthat preserves plane surfaces. Furthermore, we do not onlyincorporate 2D image features into our model but also3D surface normals. By using a novel regularization termevaluating surface planarities, we show that our methodoutperforms state-of-the-art methods regularizing towardsconstant depths. Finally, we suggest a method to filter ill-conditioned data based on estimating the covariance matrixafter optimization. As the upsampling quality is sensitive tocalibration and synchronization errors, we would also liketo include the transformation between laser and camera intothe optimization problem which might lead to a one-shotextrinsic calibration technique.

REFERENCES

[1] J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele, “Jointbilateral upsampling,” ACM Transactions on Graphics, vol. 26, no. 3,p. 96, 2007.

[2] M. Y. Liu, O. Tuzel, and Y. Taguchi, “Joint geodesic upsampling ofdepth images,” Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pp. 169–176, 2013.

[3] J. Diebel and S. Thrun, An Application of Markov Random Fields toRange Sensing. MIT Press, 2005.

[4] Q. Chen and V. Koltun, “Fast mrf optimization with application todepth reconstruction,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2014, pp. 3914–3921.

[5] A. B. Lee, K. S. Pedersen, and D. Mumford, “The complex statisticsof high-contrast patches in natural images,” 2001.

[6] N. Schneider, L. Schneider, P. Pinggera, U. Franke, M. Pollefeys, andC. Stiller, “Semantically Guided Depth Upsampling,” Gcpr, 2016.

[7] D. Ferstl, C. Reinbacher, R. Ranftl, and H. Bischof, “Image GuidedDepth Upsampling using Anisotropic Total Generalized Variation,”2013.

[8] J. Park, H. Kim, Yu-Wing Tai, M. S. Brown, and I. Kweon, “Highquality depth map upsampling for 3D-TOF cameras,” Proceedings ofthe IEEE International Conference on Computer Vision, pp. 1623–1630, 2011.

[9] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms forhigh dimensional data,” Pattern Analysis and Machine Intelligence,IEEE Transactions on, vol. 36, 2014.

[10] S. Agarwal, K. Mierle, and Others, “Ceres solver,” http://ceres-solver.org.

[11] J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison,“SceneNet RGB-D: 5M Photorealistic Images of Synthetic IndoorTrajectories with Ground Truth,” 2016. [Online]. Available: http://arxiv.org/abs/1612.05079

[12] R. B. Rusu, “Semantic 3d object maps for everyday manipulationin human living environments,” Ph.D. dissertation, Computer Sciencedepartment, Technische Universitaet Muenchen, Germany, October2009.

[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in The IEEE Conference on Computer Vision andPattern Recognition (CVPR), June 2015.

[14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.

[15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes datasetfor semantic urban scene understanding,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2016, pp.3213–3223.

http://ceres-solver.org

http://ceres-solver.org

http://arxiv.org/abs/1612.05079

http://arxiv.org/abs/1612.05079

arXiv:1706.05999v1 [cs.CG] 19 Jun 2017 · Guided Depth Upsampling for Precise Mapping of Urban Environments Sascha Wirges1, Bjorn Roxin¨ 2, Eike Rehder2, Tilman Kuhner¨ 1 and Martin

Documents