Computers & Graphics · 2020. 11. 23. · 40 years of Computer Graphics in Darmstadt Adiscriminativeapproachtoperspectiveshapefromshading in uncalibrated illumination Stephan R. Richtern,

40 years of Computer Graphics in Darmstadt

A discriminative approach to perspective shape from shadingin uncalibrated illumination

Stephan R. Richter n, Stefan RothDepartment of Computer Science, TU Darmstadt, Germany

a r t i c l e i n f o

Article history:Received 21 August 2015Received in revised form23 August 2015Accepted 23 August 2015

Keywords:Shape from shadingDiscriminative predictionSilhouette cuesSynthetic training data

a b s t r a c t

Estimating surface normals from a single image alone is a challenging problem. Previous work madevarious simplifications and focused on special cases, such as having directional lighting, known reflec-tance maps, etc. This is problematic, however, as shape from shading becomes impractical outside thelab. We argue that addressing more realistic settings requires multiple shading cues to be combined aswell as generalized to natural illumination. However, this requires coping with an increased complexityof the approach and more parameters to be adjusted. Starting from a novel large-scale dataset fortraining and analysis, we pursue a discriminative learning approach to shape from shading. Regressionforests enable efficient pixel-independent prediction and fast learning. The regression trees are adaptedto predicting surface normals by using von Mises–Fisher distributions in the leaves. Spatial regularity ofthe normals is achieved through a combination of spatial features, including texton as well as novelsilhouette features. The proposed silhouette features leverage the occluding contours of the surface andyield scale-invariant context. Their benefits include computational efficiency and good generalization tounseen data. Importantly, they allow estimating the reflectance map robustly, thus addressing theuncalibrated setting. Our method can also be extended to handle perspective projection. Experimentsshow that our discriminative approach outperforms the state of the art on various synthetic and real-world datasets.& 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license

(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Shape from shading – the problem of estimating surface nor-mals from just a single image – is a heavily ill-posed problem. Forthis reason many simplifying assumptions have been made, suchas assuming smooth surfaces, uniform albedo, a known reflectancemap, or even light coming from a single directional light source inknown direction. Such strong assumptions strongly limit theapplicability in practice, however. Outside of controlled lab set-tings, less restrictive assumptions are needed. In this paper, weestimate the surface of a diffuse object with uniform albedotogether with its reflectance map in uncontrolled illumination,given only a single image (Fig. 1). To recover fine surface detail, ourgoal is to avoid strong spatial regularization. To that end, wegeneralize shading cues to more realistic lighting, as well ascombine them owing to their complementary strengths. Whilethis affects the model and computational complexity, and leads toan increased number of parameters, we show how to address

these challenges with a discriminative learning approach to shapefrom shading.

A key property of our approach is that it allows to combineseveral shading cues. We consider (1) the color of the pixel itself,which is a strong cue in hued illumination [2], and is oftenexploited by using a second order approximation of Lambertianshading [3]. Our experiments (Section 9.1) show, however, that thecue becomes less reliable in the presence of correlated colorchannels (e.g. in near white light) or noise. We aid disambiguationby adding (2) local context [4], which to date has been limited tothe case of directional lighting. We capture the local appearancecontext using a texton filter bank [5], instead of using the colors inthe neighborhood directly. Through cue combination in ourlearning framework, we achieve automatic adaptation to uncon-trolled lighting and reconstruct fine surface detail. Finally, weintroduce novel (3) silhouette features. While the use of silhouetteinformation in shape from shading dates back to foundationalwork by Ikeuchi, Horn, and Koenderink [6,7], previous work hasonly constrained surface normals at the occluding contour andemployed global reasoning to propagate the information to theinterior [8]. We show how to generalize the occluding contourconstraint to the surface interior, which yields (spatial) contourinformation at every pixel that is furthermore invariant to the local

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/cag

Computers & Graphics

http://dx.doi.org/10.1016/j.cag.2015.08.0040097-8493/& 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

n Corresponding author.E-mail address: [email protected] (S.R. Richter).

Please cite this article as: Richter SR, Roth S. A discriminative approach to perspective shape from shading in uncalibrated illumination.Comput Graph (2015), http://dx.doi.org/10.1016/j.cag.2015.08.004i

Computers & Graphics ∎ (∎∎∎∎) ∎∎∎–∎∎∎

www.sciencedirect.com/science/journal/00978493

www.elsevier.com/locate/cag

http://dx.doi.org/10.1016/j.cag.2015.08.004



mailto:[email protected]





scale of the object. These silhouette features are applicable to bothorthographic and perspective cameras. Moreover, our novel sil-houette features also give a coarse estimate of the surface bythemselves, which allows us to estimate the unknownreflectance map.

A number of challenges arise in discriminative learning foruncalibrated shape from shading: First, we require a trainingdatabase of surfaces captured in the same conditions as the objectto be reconstructed. It seems infeasible to capture all possiblecombinations of surfaces and lighting conditions, and insertingknown reference objects in the scene [9] is typically impractical.For this reason, some learning approaches [10,11] create databaseson-the-fly by rendering synthetic shapes once the lighting con-dition is known. Our approach adopts this strategy, but relies on asignificantly larger database of 3D models than previous work inorder to capture the variation of realistic surfaces. Second, and incontrast to [10,11], we cope with unknown illumination at testtime. To that end we estimate the reflectance map from our sil-houette features, and train the discriminative approach once thereflectance has been estimated. Third, (re-)training for the specificlighting condition at test time requires efficient learning andinference. Enabled by the diverse cues discussed above, we adoptregression forests for efficient pixel-independent surface normalprediction by storing von Mises–Fisher distributions in the leaves.Finally, an optional refinement step enforces integrability of thepredicted surface normals. Fig. 2 depicts the entire pipeline. Notethat this work is based on a previous conference publication [12],which we generalize here to the perspective case. Moreover, weprovide additional detail and illustrations.

After introducing our approach, we assess the contribution of thedifferent cues using a statistical evaluation and as components of our

pipeline. Moreover, we evaluate our method both qualitatively andquantitatively on synthetic data as well as a novel real-world dataset,where it outperforms several state-of-the art algorithms.

2. Related work

As shape from shading has an extensive literature, we onlyreview the most relevant, recent work here and refer the reader to[13,14]. Lambertian shape from shading has historically assumed asingle white point light source, presuming this to simplify theproblem. Recently, it became apparent [15,2], however, thatchromatic illumination not only resembles real-world environ-ments more closely, but also yields additional constraints on shapeestimation, thus substantially increasing accuracy. Nevertheless,these methods focus on the case of favorable illumination and donot address nearly monochromatic lighting. Moreover, assumingthe illumination to be known limits their practical applicability.

Recent work also aimed to infer material properties or illumi-nation alongside the shape. Oxholm and Nishino [8] exploitorientation cues present in the lighting environment to estimatethe object's bidirectional reflectance function (BRDF) togetherwith its shape. They require a high-quality environment map to becaptured, however. Barron and Malik [16,17] integrate shapeestimation into the decomposition of a single image into itsintrinsic components. Training and inference in their generativemodel take significant time; moreover, extending the model withadditional cues is not necessarily straightforward. Since their for-mulation requires strong regularity assumptions, the amount offine surface detail recovered is quite limited.

Fig. 1. Shape and reflectance estimation from a single internet image: input image [1], estimated normals and reflectance map from our method, and rendering from a novelview (from left to right).

...

silhouette

rgb+textonestimatetest image

map

regression forest

training set sampled patches(normal+silhouette)

patches(+rendered+texton)

rendering

training

enforcingintegrability

Fig. 2. Pipeline for both training and testing. For each test image, we estimate a reflectance map to train the regression forest on synthetically generated data. Pixel-independent surface normal predictions are made using the trained regression forest. Integrability of the normal field can be enforced optionally.

S.R. Richter, S. Roth / Computers & Graphics ∎ (∎∎∎∎) ∎∎∎–∎∎∎2





Only a comparably small fraction of work addresses the moregeneral perspective projection case [18–20]. Recent methodsdealing with more complex lighting [2,15–17] are often limited tothe simpler orthographic projection case.

While learning approaches to shape from shading have beeninvestigated and often outperform their hand-tuned counterparts,they have been limited by simple shape priors and the lack ofadequate training data. Relying on range images or synthetic data[11,21,22] can be problematic: while the noise of range images is alimiting factor in predicting fine-grained surface variations, syn-thetic datasets often fail to capture real-world environments withtheir variability. Khan et al. [11], for example, used synthetic dataand a database of laser scans to train a Gaussian mixture model onthe isophotes. Barron and Malik [16] trained their shape model onone half of the MIT intrinsic image dataset [23]. Example-basedmethods [10,24] have also shown reasonable qualitative results,but their quantitative performance remains unclear.

Hertzmann and Seitz [9] used objects of known geometryimaged under the same illumination to perform a photometricstereo reconstruction. Multiple images need to be captured, eachof which contains a known reference object. Our approach, incontrast, only requires a single image of an unknown object anduses “example geometry” only to synthesize our training data.

Our approach relates to Geodesic Forests [25], as both use aregression tree-based predictor. Both enable pixel-independentpredictions by incorporating spatial information directly into thetree-based approach. Kontschieder et al. [25] address discretelabeling tasks, such as semantic segmentation, have a complexentanglement, and use generalized geodesic distances as spatialfeatures. We instead predict the normal direction, i.e. a two-dimensional continuous variable, employ newly proposed silhou-ette features, and rely on only a single stage.

3. Overview

Fig. 2 shows an overview of our discriminative approach. Givena test image of a diffuse object with uniform albedo, taken underorthographic or perspective projection, we begin by extracting

color, textons and the proposed silhouette features (Section 6). Oursilhouette features additionally enable us to estimate the reflec-tance map (Section 7), with which we render patches of objectsfrom our database of example geometries in turn (Section 4). Thetraining set is obtained as the surface normal of the central pixel ofeach synthetic patch; the features are the same as for testing. Aftertraining the regression forest (Section 5), it allows predictingsurface normals independently for each pixel of the test imagefrom the extracted features. Optionally, we enforce integrability ofthe normal field. The prediction can be adapted to the perspectiveprojection case by rotating each normal according to its position inthe image (Section 8).

4. Data for analysis and training

High-quality data for training models of surface variation hasbeen scarce. The situation has somewhat improved with theadvent of low-cost depth sensors, but range images are typicallytoo noisy to exhibit and allow learning fine-grained structures.Synthetic data have been generated as an alternative, oftenresembling simple geometric shapes like cylinders or blobs[2,10,21,22]. However, the underlying parametric models often donot capture real-world surface variations, like self-occlusions orfine detail, such as in wrinkles of clothing.

We instead leverage a dataset of shapes from artists [26], whichyields the advantages of both range maps and synthetic data:being created by modeling experts, the shapes resemble real-world objects with parts of varying size and complex phenomena,e.g. self-occlusions. Moreover, rendering many 3D models in dif-ferent orientations allows to obtain very large training sets. The 3Dmodels cover a range of categories, mainly with an organic shape,such as humans and animals (Fig. 3).

Consisting of 100 objects, our dataset is much larger and morevaried than those considered in other learning approaches toshape from shading. For example, only 6 realistic surfaces of thesame object class (faces) were used in [11], and 10 objectsobtained by taking half of the MIT intrinsic image dataset wereused for training (the other half for testing) by [17]. Although the

Fig. 3. Sample objects from the artist-created dataset. We pre-render the surface normals of all objects from several viewing positions (top row). For training we render thenormal maps using our estimated reflectance map. The bottom row shows re-renderings with an exemplary illumination from [2].

S.R. Richter, S. Roth / Computers & Graphics ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3





dataset on which we train is qualitatively rather different from anyof the test datasets, we obtain state-of-the-art performance acrossa variety of settings (Section 9).

5. Discriminative prediction of normals

Building on the success of decision and regression tree-basedmethods in various applications, including human pose estimation[27], image restoration [28], semantic labeling [25] and others, wehere use regression forests for discriminative shape from shading.For now we only outline the basic learning approach; the featuresthat serve as input will be discussed later. Regression forests arevery useful for our purposes as both learning and prediction arecomputationally efficient. This is crucial, since learning and pre-diction are carried out at test time once the reflectance map hasbeen estimated (Fig. 2). Training the different trees of the forestcan proceed in parallel. As the prediction of surface normals ofobjects is done independently for each pixel, this step is efficientas well as parallelizable. Since the predicted normal field is notnecessarily integrable, we optionally enforce integrability in apost-processing step.

Basic regression forest model: Regression forests average theoutput of several regression trees to improve robustness. Each treeyields a prediction of an output variable by traversing a path thatdepends on the input features [29]. A split criterion at each (non-leaf) node guides the traversal into either the left or right branch,until a leaf node is reached. As is most common, thresholds on theinput features are used as split criteria. Each leaf node stores aprobability distribution over the output variable, which ultimatelyenables the prediction.

Normal vectors as output: Predicting normal vectors usingregression trees incurs some additional challenges. First, the out-put variable is continuous, which is typically achieved by storingthe average output of all training samples that fall in that parti-cular leaf. This is equivalent to storing the mean of a multivariateGaussian in each leaf, which is a reasonable assumption when theposterior is sufficiently close to a Gaussian distribution in Rd.Second, surface normals are actually distributed on a 3-dimensional unit hemisphere, which means that a Gaussianassumption does not appear appropriate. In our approach, weaddress this by modeling the output distribution in each leaf as avon Mises–Fisher distribution [30]; we store the mean and dis-persion parameters.

The von Mises–Fisher distribution models unit vectors on a d-dimensional hypersphere. For our case of d¼3, the probabilitydensity function of a normal nAS2 is given as

pðn;μ; κÞ ¼κ

2πðeκ$e$κÞexpðκμTnÞ; ð1Þ

where μ; JμJ ¼ 1 is the mean vector and κAR the dispersion(analogous to the precision of a Gaussian).

Learning: For the most part, learning proceeds as usual forregression forests. To train each tree of the forest, a randomlychosen 90% subset of the training data is used. The split criterionfor each node is chosen from a random subset of features. Inparticular, we choose the feature that minimizes the aggregatedentropy of the new child nodes compared to the entropy of theirparent node.

We estimate the von Mises–Fisher parameters using theapproach of Dhillon and Sra [31]. They show that the maximumlikelihood estimate can be approximated well by

μ̂ ¼r

JrJand κ̂ ¼

3R$R3

1$R2 : ð2Þ

Here, r¼PN

i ¼ 1 ni is the resultant vector and R ¼ J r JN is the average

resultant length.Most previous learning approaches to shape from shading require

re-training the model on each unseen reflectance map [10,11]; ourapproach does so as well. To carry this out efficiently, we randomlysample 5%5 normal patches from our geometry dataset (Section 4),precompute the silhouette features ahead of time, as they are inde-pendent from the lighting condition, and store them alongside thenormals. When a new illumination condition occurs, we rendertraining images from the normals, compute the remaining features,and train the forests. Rendering and feature extraction are efficientand take below one second for the entire dataset; training the foresttakes approximately 90 s, and inference below a second. It isimportant to note that unlike previous learning approaches [10,11],we do not require the lighting at test time to be known, but insteadestimate it as well (Section 7).

For each of the 100 models in our training dataset, we obtain 10training images by placing an orthographic camera looking at themodel center from random positions. We evaluated how manytraining patches are needed on a validation set (different from themodels used at test time) and found that 100–200 samples perimage yield the best trade-off between performance and compu-tational effort for training.

Integrability: The regression forests independently predict thesurface normals of each pixel, without considering neighboringpredictions. In the absence of any spatial regularization, the sur-face predictions are susceptible to image noise; on the other handpenalizing discontinuities usually results in oversmoothed sur-faces and a loss of detail. For this reason we fuse pixel-independent predictions only by enforcing integrability, as this isa necessity for obtaining a valid surface. Integrability requires thederivatives of the surface normal to fulfill

∂2z∂u∂v

¼∂2z∂v∂u

; ð3Þ

where the depth z depends on the image coordinates u; v. Severalapproaches have been proposed to penalize violations of Eq. (3), e.g.[32,33]; we consider and evaluate different choices in Section 9.2.

6. (Spatial) features

Shape from shading, like other pixel labeling/prediction pro-blems, benefits from taking the spatial regularity of the output intoaccount, in other words modeling the expected smoothness of therecovered surface. That is, neighboring predictions should accountfor the fact that their normals are often very similar. Regressionforests [29], which we use here, perform pixelwise independentpredictions and thus do not necessarily model such regularitieswell. The recent regression tree fields [28] address this by esti-mating the parameters of a Gaussian random field instead of theoutput variables; using maximum a-posteriori (MAP) estimationin the resulting conditional random field yields the final predic-tion. Unfortunately, the MAP estimation step creates a computa-tional overhead, which also renders training inefficient. Ourmethod is inspired by geodesic forests [25] instead, which includea geodesic distance feature to circumvent explicit modeling ofmore global dependencies of the output; we introduce spatialfeatures that enable encouraging spatial consistency despite pixel-independent prediction.

Basic color feature: Regression trees excel when the desiredoutput strongly correlates with the input features [29], becausethis allows for splits that reduce the entropy well. It is in turn afundamental assumption of shape from shading that strong cor-relations between surface normals and color exist, as their relationcan be described by the rendering equation. The common






Lambertian case is well described by a second order approxima-tion [3]: Ic ¼ n̂TMcn̂ at each point on the surface, where Ic is theintensity of color channel c, n̂ is the surface normal in homo-geneous coordinates, and Mc is a symmetric 4%4 matrix repre-senting the reflectance map for that color channel. A single inputimage thus puts three nonlinear constraints onto the twounknowns of the surface normal at each pixel. Under ideal cir-cumstances, the reflectance maps of the individual color channelsare independent from each other. In that case, they produce smallisophotes (areas with the same luminance), which turns the shapefrom shading problem into photometric stereo such that a surfacecan be recovered very well with just the color [2]. However, if thereflectance maps and corresponding constraints are more corre-lated, e.g. in nearly white light, large isophotes cause many surfacepatches to explain the same color. Moreover, image noise weakensthe correlation significantly. Hence, to avoid making strongassumptions about the type of lighting present, we not only con-sider the color, but also look for spatial features that depend on aneighborhood of pixels as well as the object contour, and are ableto reduce the remaining ambiguity, even in the absence of anexplicit spatial model.

Texton features: To capture how the local variation of the inputimage correlates with the output, we first compute features from atexton filter bank [5]. The filter bank contains Gaussians, theirderivatives, as well as Laplacians at multiple scales, and has beenused in many areas, such as material classification, segmentation,and recognition. While having been used in shape from texture[34], to our knowledge they have not been considered in shapefrom shading. Before filtering, we convert the image to the Lnanbnopponent color space. Gaussian filters are computed on all chan-nels, while the remaining filters are applied only to the luminancechannel.

As we will see below, texton features provide local context thatstrongly boosts accuracy compared to using color alone. Embed-ding them in a discriminative learning framework allows foradaptation to various types of surface discontinuities instead ofsimply assuming smoothness as has been common in shape fromshading. Magnifying the local context by enlarging the filters canlead to better adaptation to various surface types and also fasterconvergence to an integrable surface later. It, however, requires amuch larger dataset to capture fine detail and achieve similargeneralization. In our experiments, we used filters that match thenormal patches in size (5%5).

6.1. Silhouette features

Projected onto the image plane, normals are not distributedequally across the object. Most objects are roughly convex, orcomposed of convex parts. Thus, normals at the center of an objecttend to face the viewer and normals at the occlusion boundary areperpendicular to the viewing direction [6,7]. Consequently, theprobability of a normal facing a certain direction given its positionwithin the projection is non-uniform. Previous work has exploited

this fact only by placing priors on the normals at the occlusionboundary and propagating information to the interior with asmoothness prior [8]. As both priors do not consider scale, bal-ancing them can be challenging. This becomes especially proble-matic if different scales are present within the same object (e.g. thetail vs. head of the dinosaur in Fig. 4). Here, we consider a moreexplicit relation between the silhouette and the normal, whichautomatically adapts to scale.

To that end, let us first examine the correlation between apoint's surface orientation and its position within the object'sprojection onto the image plane. Consider the object in Fig. 4. Asexpected [6,7] and can be seen in the visualization of the out-of-plane component (d) (white – toward the viewer, black – away),normals are orthogonal to the viewing direction starting at thesilhouette. Moving inwards, the normals change until they finallyface the viewer. If we now look further at the distance of aninterior point to the silhouette (b), we can see some apparentcorrelation. Similarly, we can see apparent correlation betweenthe direction to the nearest point on the silhouette (c) and theimage-plane component of the normal (e). We now formalize andanalyze this relationship.

We define the absolute distance of an interior point p to thecontour as

dabsðpÞ ¼minbAB

Jp$bJ ; ð4Þ

where B denotes the set of points on the occlusion boundary. Theabsolute distance, however, depends on the scale of the object.Normalizing it by the length of the shortest line segment thatpasses through p and connects boundary and the medial axis ofthe object makes it scale-invariant. The medial axis is the set of allpoints that have two closest points on the boundary. If M denotesthe medial axis and pb the (infinite) line that passes through p andb, we define the relative distance to the silhouette as

drelðpÞ ¼minbAB

minmAM\pb

Jp$bJJm$bJ

; ð5Þ

i.e. the relative distance is normalized by the minimal line thatpasses through p and connects medial axis and contour. In prac-tice, we approximate Eq. (5) using two distance transforms, dB forthe contour set and dM for the medial axis. We thus define thescale-invariant boundary distance

d0relðpÞ ¼dBðpÞ

dBðpÞþdMðpÞ: ð6Þ

Finally, we define the direction to the contour as

βðpÞ ¼ $∇d0relðpÞ

J∇d0relðpÞJ: ð7Þ

Statistical analysis: We analyze the correlation of our silhouettefeatures and surface orientation on three different datasets asshown in Fig. 5: synthetic data (“blobby shapes”) [2] in the firstcolumn, real world data from the MIT intrinsic image dataset [23]in the second column, and the set of artist-crafted 3D models used

Fig. 4. Objects that are convex or composed of convex parts (a) exhibit a strong correlation between silhouette-based features like relative distance (b) or direction to thesilhouette (c) and out-of-plane (d) and in-plane components (e) of their surface normals.






for training in the third column. For all datasets we calculated therelative distance and direction to the silhouette and graphed themagainst the out-of-plane and the image-plane component of thesurface normals. We consistently observe a strong correlationbetween the plotted variables. While the relation between direc-tion to the silhouette and image-plane component is stronglylinear, the relative distance relates roughly quadratically to theout-of-plane component. The strong correlation clearly suggeststhe proposed silhouette features to be helpful for reconstructing asurface from a single image. In Section 9.1 we investigate theimportance of our input features for surface prediction.

7. Reflectance map estimation

All observable reflectance values of an object with uniformalbedo can be mapped one-to-one onto a hemisphere, assumingdistant light sources and no self-reflections or occlusions. More-over, to approximate a Lambertian reflectance map well, only9 spherical harmonics coefficients per color channel suffice [3].Thus, to calibrate against a reflectance map, [2] provided eachobject of interest with a calibration sphere of the same BRDF.Barron and Malik [16] obviated the sphere and jointly recoveredthe reflectance map and the surface with a generative model.

Our discriminative approach reconstructs the reflectance mapdirectly from an initial surface estimate that we derive solely fromthe object silhouette. In particular, we map the input image to asphere according to our silhouette features (Fig. 6, left). The fea-tures define a mapping from a pixel p to polar coordinates on a

unit sphere:

σðpÞ : Ω-S2; σðpÞ ¼ cos $1d0relðpÞ;βðpÞ! "

ð8Þ

However, since the mapping from pixels to polar coordinates ismany-to-one, we average the colors of points with similar distanceand direction to the silhouette. In particular, the color at a polarcoordinate (i.e. normal or lighting direction) sAS2 is obtained byaveraging the colors of those input pixels p, whose mapping is a k-nearest neighbor of s:

CðsÞ ¼1k

Xk

i ¼ 1

IðPiÞ; P ¼ p j σðpÞAk$NNðsÞ# $

; ð9Þ

where IðPiÞ is the observed color. The number of neighbors kconsidered is adjusted for the size of the object. This acts as a low-pass filter, effectively reducing estimation errors from incorrectlymapped points.

We thus obtain a robust approximation of a calibration spherewithout actually having one. While the silhouette features aloneyield only a coarse estimate of the surface normals, we only needto recover a small number of spherical harmonics coefficients ofthe reflectance (Fig. 6, second to right column), which can be donein closed form. We found that adjusting the mean and standarddeviation of the reflectance map to match the input image(effectively matching brightness and contrast) improves the finalestimate (Fig. 6, right column).

Note that certain objects do not fulfill our assumption of beingcomposed of convex parts. A bowl seen from above will causeproblems, for example, but likely also for other algorithms thatestimate shape and reflectance. Most objects, however, containlimited concavities whose effect on the estimated reflectance mapis generally compensated by other convexities.

π

π−π−π

direction to contourimag

e-pl

ane

com

pone

nt

0 0.15

blob MIT training

0 10

out-o

f-pla

ne c

ompo

nent

relative distance to contour

π2

0 0.25

blob MIT training

Fig. 5. The correlation between pixel position and surface orientation on multiple datasets. We plot the direction to the contour vs. image-plane component of the surfacenormal and the relative distance vs. out-of-plane component of the normal for three datasets each (from left to right): blobby shapes [2], MIT intrinsic images [23], and ourtraining dataset from Section 4.

approximate

input image

ground truth

map to sphere

re-rendered ground truth surfaces

estimate estimateFig. 6. We estimate the reflectance map by mapping the input image to a hemisphere and approximate it by 9 spherical harmonics coefficients per color channel. Matchingbrightness and contrast to the input image yields our final estimate. Above we show re-rendered ground truth surfaces for comparison.






8. Adaption to perspective projection

To adapt both the estimation of surface normals and thereflectance map to the perspective case, we re-visit the idea ofoccluding contours: at an occluding contour, surface normals areperpendicular to the direction d to the camera center. Thus, at anobject's center, we expect surface normals to be parallel to d. Fororthographic projection, the camera center is thought to be at aninfinite distance, resulting in the same vector d¼ ð0;0;1ÞT for eachpixel. For perspective projection, however, the direction to thecamera depends on the position ðuðpÞ; vðpÞÞ on the image planeand on the focal length f.

To adapt the whole surface estimate to perspective projection,we rotate each normal by a rotation matrix Rd that maps thevector ð0;0;1ÞT to d and keeps the up-direction. This is equivalentto subsequently rotating around the y-axis by RϕðpÞ and the x-axisby RθðpÞ, where

RϕðpÞ ¼cosϕðpÞ 0 sinϕðpÞ

0 1 0$ sinϕðpÞ 0 cosϕðpÞ

0

B@

1

CA;

RθðpÞ ¼

1 0 00 cosθðpÞ sinθðpÞ0 $ sinθðpÞ cosθðpÞ

0

B@

1

CA

and

ϕðpÞ ¼ tan $1uðpÞf

; θðpÞ ¼ tan $1vðpÞf

:

Analogously we adapt the reflectance map estimate. If theobject is not centered at the principal point, the calibration spherewe construct from it is neither. Hence, instead of mapping theinput image to a hemisphere as in the orthographic case, we mapit to a rotated hemisphere. However, instead of rotating each pointon the hemisphere differently, we consider a single rotationmatrix for the whole hemisphere. We obtain the rotation matrixfor the centroid of the object on the image plane, as this averagesthe pixel positions. From the rotated hemisphere we again recoverthe spherical harmonics coefficients.

9. Experiments

9.1. Feature evaluation

To gain insight into the importance of our input features, weanalyze their qualitative (Fig. 7) and quantitative (Table 1a) effectson surface normal prediction. We investigate the unary features on

the surfaces from the training part of the MIT intrinsic imagedataset, rendered under all illuminations from [2]. In contrast tothe illuminations used by [16], these are not sampled from alearned prior, but were captured in natural environments andinclude nearly white illumination. To simulate image noise, weadded Gaussian noise (σ ¼ 0:001) to the rendered images andthresholded values below 0 and above 1. The training dataset isthe same as described in Section 4. In Table 1a we show the resultsevaluated using the median angular error (MAE) and the mean-squared error of the normal (nMSE, see [16]).

The basic color feature (RGB) acts as our baseline. The silhouette-based features (þSilh) increase the overall performance. Nonetheless,they excel at round objects or those that are composed of convexparts with a curved surface. Consequently, they boost performance onobjects fulfilling these assumptions, but only marginally on planarobjects or in presence of self-occlusions. In colorful illuminations(Fig. 7, bottom) the silhouette features can be partially deceptive, butoverall clearly improve performance. Adding the texton filters (þTex)particularly improves estimates under chromatic illumination, indi-cating that the captured spatial information eliminates many ambi-guities; yet even in white illumination they ameliorate. The bestoverall performance stems from the combination of all features(þSilhþTex) and is robust w.r.t. the illumination conditions.

9.2. Integrability

We investigate four variants of enforcing the integrabilityconstraint upon the estimated surface in Table 1b. We start with

90°

0°

ortho-l2, 14.9°

ortho-l2, 6.1°

RGB+Silh+Tex, 15.4°

RGB+Silh+Tex, 6.1°

RGB+Tex, 18.1°

RGB+Tex, 5.8°

RGB+Silh, 16.9°

RGB+Silh, 15.8°

RGB, 25.4°

RGB, 13.8°

Fig. 7. Importance of unary features. For the images in the first column (white illumination – top, colored – bottom), we estimate surfaces using only subsets of features; seetext for details. The remaining columns depict the angular error per pixel and its median below.

Table 1Influence of unary features and integrability constraints. The run-times includetraining, inference, and post-processing. (a) Results for unary features. (b) Resultsfor enforcing integrability.

Features MAE nMSE

(a)RGB 13:771 0.179RGBþSilh 10:901 0.130RGBþTex 7:921 0.097RGBþSilhþTex 7:091 0:069

Method MAE nMSE Run-time

(b)No integrability 7:091 0.069 89:0 sl2, orthographic 7:331 0.057 98:5 sl2, orthographic, conv. 6:461 0:056 1172:8 sl1, orthographic 7:341 0.058 98:0 sl2, perspective 7:421 0.059 97:2 s






an l2-penalty on violations of Eq. (3). Next, we restrict the methodto form surface normals from a convex combination of samplesdrawn from the leaf distributions at each pixel. We further enforceintegrability with an l1-penalty [33], and finally an l2-penaltyunder perspective projection following [32]. The unary predictionsare a reasonable baseline as the surface normals can be recon-structed already with good accuracy without any post-processing.The performance may even decrease after post-processing undersynthetic illumination. For real images, which potentially violatethe Lambertian assumptions, however, we observed significantimprovements. For the objects in the MIT dataset, which werepresumably imaged with a long focal length, the benefits of aperspective approach are negligible. Although the convex combi-nation of samples (conv) clearly outperformed all other approa-ches, we rely on the simple l2-penalty in further experiments dueto the much lower run-time.

9.3. Comparison with other methods

We quantitatively compare to two state-of-the-art methods onthree different datasets, two of which are contributed by therespective methods, and one is recorded by ourselves.

First, we compare to the shape-from-shading component of theSIRFS method of Barron and Malik [16] (termed “Cross scale”). Weevaluate using the source code provided by the authors both underunknown and given illumination. In unknown illumination, weadditionally record the accuracy of the estimated reflectance map(lMSE, see [16]). In Table 2a we present results on the dataset of[16], a variant of the MIT intrinsic image dataset [23] re-renderedunder chromatic illumination.

The method of Xiong et al. [4] (termed “Local context”) is oursecond baseline. It exploits local shading context to predict shapefrom shading under known illumination. Xiong et al. provide adataset of 10 objects captured under white directional illumina-tion and also evaluated the shape-from-shading component from[16]. Thus, we simply restate their results (Table 2b, first column)and run our algorithm on their dataset.

As with all experiments, we use our own separate set of artist-created models (Section 4) as sole training data. To set thehyperparameters (number of trees, maximum tree depth, etc.) ofour method, we once used Bayesian optimization [35] on thetraining split of the MIT intrinsic images and fixed the parametersfor all of our experiments.

Real-world experiment: The performance on synthetic data canbe misleading and may not necessarily translate to realistic set-tings [13]. To demonstrate the accuracy and robustness of ourmethod quantitatively in a real-world setting, we recorded a new

dataset for shape from shading in natural illumination, since nodataset captured under natural illumination exists so far; methodsconsidering natural illumination were instead evaluated onlyqualitatively or on synthetic data [2,16].

To record highly accurate ground truth in laboratory illumina-tion, photometric stereo methods are well established [4]. Theyrequire a carefully designed lighting environment, however, andproduce only a normal map. Hence, capturing data in naturalillumination would require either synthesizing illumination in thelab, which violates the real-world assumption, or re-building acontrolled setup at each scene, which is next to impossible formany realistic scenes.

Instead, we used multi-view stereo [36,37] to reconstruct sur-face meshes from ' 200 images we took for each of four objects.That enabled us to take test images under real illumination indifferent environments and later align the test images to themeshes by mutual information [38]. For the test images, wepainted the objects with a white diffuse paint. To recover theground truth illumination for each scene, we recorded a calibra-tion sphere of the same BRDF as the objects. Images and groundtruth are publicly available on our website.

We give quantitative results on the dataset in the three right-most columns of Table 2b; the reconstructed surfaces are shown inFig. 8. As before, our method was not specifically adapted to thedataset. The shape prior used by [16] is neither; the shapes usedfor training (MIT dataset) are still representative (i.e. of similarkind). We show additional results in Fig. 1.

Projection experiment: To evaluate the performance of ourapproach under perspective projection, we recorded three newtest images depicting objects from our orthographic experiments.In contrast to the orthographic setting, where we used a telephotolens to approximate the presumably infinite focal length for allimages, we here chose different focal lengths ranging from 35 mmto 128 mm (full frame 35 mm equivalent). Again, the images weretaken in different environments featuring natural illumination. Wecompare our perspective extension to the orthographic versionfrom the experiments above. As baseline we include a coarsenormal estimate derived from the silhouette features and its per-spective adaption following Section 8. By exploiting the silhouettefeatures as polar coordinates, we directly compute a coarse surfacenormal n as

nðpÞ ¼

cðpÞ cosβðpÞcðpÞ sinβðpÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1$cðpÞ2

q;

0

BB@

1

CCA; cðpÞ ¼ 1$drelðpÞ ð10Þ

The focal length was given to the perspective variants.Results: The quantitative and qualitative results show that our

method robustly recovers surfaces and reflectance maps in syn-thetic, laboratory, and natural illumination. We clearly outperformthe cross-scale approach [16] in all metrics; the only exception arethe real images with known illumination, where we perform aboutthe same. In the more challenging setting of unknown illumina-tion, we perform significantly better, however. As can be seen inthe bottom row of Fig. 8, correctly estimating the reflectance mapis crucial to the performance of surface reconstruction. The sil-houette features allow for a robust estimate, which is leveraged byour discriminative learning approach. The results in Fig. 8 furtherhighlight that our approach is able to recover fine surface detail onreal data, since it does not need to rely on strong spatialregularizers.

We also outperform the local context approach of [4]. Onepoint to note is that our approach can deal with images of differentscales (the images of Fig. 8, top are approximately twice the size ofthose in Fig. 8, bottom). This is due to the scale-invariant nature ofour silhouette features.

Table 2Comparison to other methods. See text for further explanation. n indicates that theillumination was given. (a) Results on synthetic images (MIT intrinsic [16]).(b) Results on real images.

Method nMSEn nMSE lMSE

(a)Cross scale 0.058 0.471 0.039Ours 0:034 0:196 0:013

Illumination Lab [4] Natural (ours)

Method MAE MAEn MAE lMSE

(b)Local context 17:271 – – –

Cross scale 19:301 20:291 29:291 0.013Ours 15:961 20:511 23:071 0:002






It may seem surprising that the performance of all methodsdecreases in realistic settings, since the illumination is more col-orful. However, this can be explained by observing that the realdata exhibits shadows and inter-reflections, which the syntheticdatasets do not. Despite these challenges our discriminativeapproach is able to provide high-quality surface estimates inuncalibrated illumination.

Adapting the algorithm to perspective projection slightly, butconsistently improves its performance (Table 3). As can be seen inFig. 9, the improvement is noticeable in all conditions, as well asfor the estimation of the reflectance map. We observed theimprovement to increase with bigger fields of view, consistentwith the increasing difference between the projection models. InFig. 9 the effects are more prominent in the baseline examples as

ground truthinput image

ours, 22.3°input image local context, 28.1°ours, 11.5°input image local context, 15.3°

cross scale

14.0° 29.9°

lMSE = 0.0052

16.1° 17.2°

lMSE = 0.0011

ours

Fig. 8. Comparison on real images with median angular errors. For laboratory illumination (top row), we show a novel view of our best and worst result of a reconstructedsurface. The input image and the view of “local context” are taken from [4]. For natural illumination (bottom row), we show the surface normal estimates for known (left)and unknown illumination (right), and the estimated reflectance map for the latter case. Across all conditions, our method reconstructs fine surface detail better thanprevious approaches.

baseline ortho24.1°

input (real) baseline persp22.8°

real ortho23.8°

real persp21.6°

synth ortho23.1°

synth persp19.6°

input (synthetic)

ground truth

0°

90°

Fig. 9. Experiment with wide angle and complex illumination. We reconstruct the surface normals (top row) from the wide angle image (focal length 35 mm) captured incomplex real illumination (left column, top row) and synthetic illumination (left column, bottom row). Below the surface normal maps we show the median angular errorand the angular error plot. Each method/image is evaluated assuming orthographic (left) and perspective projection (right). As expected, estimation under syntheticillumination gives the best results. The performance difference stems from the highly complex natural illumination with multiple light sources of the same color castingshadows. Note that factoring in the perspective distortion improves results in all cases.

Table 3Comparison of projection models.

Illumination Synthetic Real

Method MAE lMSE MAE lMSE

Baseline ortho 25:41 – 25:41 –

Baseline persp 25:01 – 25:01 –

Ours ortho 15:61 0.0013 23:81 0.0021Ours persp 14:01 0.0010 23:11 0.0018






there are no other features that compensate for perspective errorsby local or global context.

10. Conclusion

In this paper we demonstrated a discriminative learningapproach to the shape-from-shading problem in uncontrolledillumination, assuming only a single image of an unknown diffuseobject with uniform albedo is given. To this end, we tailoredregression forests to output surface normals. These pixel-independent estimates are processed spatially only by constrain-ing the reconstructed surface to be integrable. We introduced andanalyzed suitable input features that capture context on a localand scale-invariant global level. Besides removing the need forexplicit spatial regularization, the proposed silhouette featuresallow for estimating the unknown reflectance map. Both thereflectance map estimation and the surface reconstruction can beeasily generalized to perspective projection. Our model needs tobe trained for each illumination condition, similar to other learn-ing approaches. Owing to its computational efficiency, ourapproach can be trained and tested within the time other recentmethods need for just testing. We used a novel, large scale datasetto train our model and evaluated it on various challenging data-sets, where it outperforms recent approaches from the literature.Finally, we demonstrated its ability to reconstruct fine surfacedetail outside of the laboratory on a new real-world dataset.

Acknowledgments

We thank Simon Fuhrmann for assistance with multi-view ste-reo capturing. SRR was supported by the German Research Foun-dation (DFG) within the Research Training Group "Cooperative,Adaptive and Responsive Monitoring in Mixed Mode Environ-ments" (GRK 1362). SR was supported in part by the EuropeanResearch Council under the European Union's Seventh FrameworkProgramme (FP/2007-2013)/ERC Grant agreement no. 307942, aswell as by the EU FP7 project “Harvest4D” (No. 323567).

References

[1] Stockman D. Vienna 2010 35. Flickr; 2010. Licensed under ⟨https://creativecommons.org/licenses/by-sa/2.0/⟩.

[2] Johnson MK, Adelson EH. Shape estimation in natural illumination. In: IEEEconference on computer vision and pattern recognition (CVPR); 2011. p. 2553–60.

[3] Ramamoorthi R, Hanrahan P. An efficient representation for irradiance envir-onment maps. In: ACM SIGGRAPH; 2001. p. 497–500.

[4] Xiong Y, Chakrabarti A, Basri R, Gortler SJ, Jacobs DW, Zickler T. From shadingto local shape. IEEE Trans Pattern Anal Mach Intell 2014;37(1):67–79.

[5] Shotton J, Winn JM, Rother C, Criminisi A. TextonBoost for image under-standing: multi-class object recognition and segmentation by jointly modelingtexture, layout, and context. Int J Comput Vis 2009;81(1):2–23.

[6] Ikeuchi K, Horn BK. Numerical shape from shading and occluding boundaries.Artif Intell 1981;17(1):141–84.

[7] Koenderink JJ. What does the occluding boundary tell us about solid shape?Perception 1984;13(3):321–30.

[8] Oxholm G, Nishino K. Shape and reflectance from natural illumination. In:European conference on computer vision (ECCV); 2012. p. 528–41.

[9] Hertzmann A, Seitz SM. Shape and materials by example: A photometricstereo approach. In: IEEE conference on computer vision and pattern recog-nition (CVPR); 2003. p. 526–33.

[10] Cole F, Isola P, Freeman WT, Durand F, Adelson EH. ShapeCollage: occlusion-aware, example-based shape interpretation. In: European conference oncomputer vision (ECCV); 2012. p. 665–78.

[11] Khan N, Tran L, Tappen M. Training many-parameter shape-from-shadingmodels using a surface database. In: ICCV workshops; 2009. p. 1433–40.

[12] Richter S, Roth S. Discriminative shape from shading in uncalibrated illumi-nation. In: IEEE conference on computer vision and pattern recognition(CVPR); 2015.

[13] Durou JD, Falcone M, Sagona M. Numerical methods for shape-from-shading:a new survey with benchmarks. Comput Vis Image Underst 2008;109(1):22–43.

[14] Zhang R, Tsai PS, Cryer JE, Shah M. Shape from shading: a survey. IEEE TransPattern Anal Mach Intell 1999;21(8):690–706.

[15] Huang R, Smith WAP. Shape-from-shading under complex natural illumina-tion. In: IEEE International conference on image processing (ICIP); 2011. p. 13–16.

[16] Barron JT, Malik J. Color constancy, intrinsic images, and shape estimation. In:European conference on computer Vision (ECCV); 2012. p. 57–70.

[17] Barron JT, Malik J. Shape, albedo, and illumination from a single image of anunknown object. In: IEEE conference on computer vision and pattern recog-nition (CVPR); 2012. p. 334–41.

[18] Prados E, Faugeras O. “Perspective shape from shading” and viscosity solu-tions. In: IEEE international conference on computer vision (ICCV). Tokyo,Japan: IEEE; 2003. p. 826–31.

[19] Tankus A, Sochen N, Yeshurun Y. Shape-from-shading under perspectiveprojection. Int J Comput Vis 2005;63(1):21–43.

[20] Vogel O, Breuß M, Weickert J. Perspective shape from shading with non-Lambertian reflectance. In: Pattern recognition. Berlin, Heidelberg: Springer;2008. p. 517–26.

[21] Ben-Arie J, Nandy D. A neural network approach for reconstructing surfaceshape from shading. In: IEEE international conference on image processing(ICIP), vol. 2; 1998. p. 972–6.

[22] Wei GQ, Hirzinger G. Learning shape from shading by a multilayer network.Trans Neural Netw 1996;7(4):985–95.

[23] Grosse R, Johnson MK, Adelson EH, Freeman WT. Ground truth dataset andbaseline evaluations for intrinsic image algorithms. In: IEEE internationalconference on computer vision (ICCV); 2009. p. 2335–42.

[24] Panagopoulos A, Hadap S, Samaras D. Reconstructing shape from dictionariesof shading primitives. In: Asian conference on computer vision (ACCV); 2012.p. 80–94.

[25] Kontschieder P, Kohli P, Shotton J, Criminisi A. GeoF: geodesic forests forlearning coupled predictors. In: IEEE conference on computer vision andpattern recognition (CVPR); 2013. p. 65–72.

[26] Dosch design ⟨http://www.doschdesign.com/products/3d/Comic_Characters_V2.html⟩; 2014.

[27] Shotton J, Sharp T, Kipman A, Fitzgibbon A, Finocchio M, Blake A, et al. Real-time human pose recognition in parts from single depth images. CommunACM 2013;56(1):116–24.

[28] Jancsary J, Nowozin S, Rother C. Loss-specific training of non-parametricimage restoration models: a new state of the art. In: European conference oncomputer vision (ECCV); 2012. p. 112–25.

[29] Breiman L. Random forests. Mach Learn 2001;45(1):5–32.[30] Fisher R. Dispersion on a sphere. Proc R Soc Lond B 1953; 217(1130).[31] Dhillon IS, Sra S. Modeling data using directional distributions. Technical

Report; TR-03-06, Department of Computer Sciences, The University of Texasat Austin; 2003.

[32] Papadhimitri T, Favaro P. A new perspective on uncalibrated photometricstereo. In: IEEE conference on computer vision and pattern recognition(CVPR); 2013. p. 1474–81.

[33] Reddy D, Agrawal A, Chellappa R. Enforcing integrability by error correctionusing l1-minimization. In: IEEE conference on computer vision and patternrecognition (CVPR); 2009. p. 2350–7.

[34] White R, Forsyth D. Combining cues: shape from shading and texture. In: IEEEconference on computer vision and pattern recognition (CVPR); 2006. p.1809–16.

[35] Snoek J, Larochelle H, Adams RP. Practical Bayesian optimization of machinelearning algorithms. In: Advances in neural information processing systems(NIPS); 2012. p. 2951–9.

[36] Fuhrmann S, Goesele M. Floating scale surface reconstruction. In: ACMtransactions on graphics (Proceedings of ACM SIGGRAPH), vol. 33; 2014. p. 46.

[37] Fuhrmann S, Langguth F, Goesele M. MVE—a multi-view reconstructionenvironment. In: Eurographics workshop on graphics and cultural heritage;2014. p. 11–8.

[38] Cosini M, Dellepiane M, Ponchio F, Scopigno R. Image-to-geometry registra-tion: a mutual information method exploiting illumination-related geometricproperties. Comput Gr Forum 2009;28(7):1755–64.



https://creativecommons.org/licenses/by-sa/2.0/

https://creativecommons.org/licenses/by-sa/2.0/

http://refhub.elsevier.com/S0097-8493(15)00138-7/sbref4


























http://www.doschdesign.com/products/3d/Comic_Characters_V2.html

http://www.doschdesign.com/products/3d/Comic_Characters_V2.html














Computers & Graphics · 2020. 11. 23. · 40 years of Computer Graphics in Darmstadt Adiscriminativeapproachtoperspectiveshapefromshading in uncalibrated illumination Stephan R. Richtern,

Documents