Probabilistic visibility for multi-view stereo

Probabilistic visibility for multi-view stereo

Carlos Hernandez1

Computer Vision GroupToshiba Research Europe

George Vogiatzis2

Computer Vision GroupToshiba Research Europe

Roberto Cipolla3

Dept. of EngineeringUniversity of Cambridge

{carlos.hernandez1|george.vogiatzis2}@crl.toshiba.co.uk, [email protected]

Abstract

We present a new formulation to multi-view stereo thattreats the problem as probabilistic 3D segmentation. Pre-vious work has used the stereo photo-consistency criterionas a detector of the boundary between the 3D scene andthe surrounding empty space. Here we show how the samecriterion can also provide a foreground/background modelthat can predict if a 3D location is inside or outside thescene. This model replaces the commonly used naive fore-ground model based on ballooning which is known to per-form poorly in concavities. We demonstrate how the prob-abilistic visibility is linked to previous work on depth-mapfusion and we present a multi-resolution graph-cut imple-mentation using the new ballooning term that is very ef-ficient both in terms of computation time and memory re-quirements.

1. Introduction

Digital modeling of 3D objects is becoming increasinglypopular and necessary for a wide range of applications suchas cultural heritage preservation, online shopping or com-puter games. Although laser range scanning remains oneof the most popular techniques of acquiring shape, the highcost of the equipment, complexity, and difficulties to cap-ture color are three big disadvantages. As opposed to lasertechniques, image-based techniques provide an efficient andeasy way to acquire shape and color by simply capturing asequence of photographs of the object. Among the vast lit-erature available on image-based modeling techniques, re-cent work based on volumetric approaches shows an excel-lent compromise between computation time and accuracy.However, these methods exhibit very strong memory re-quirements and are difficult to tune in order to obtain thebest results.

In this paper we present a new formulation of the multi-view stereo problem that closely resembles a 3D segmen-tation. The new formulation is both computationally and

memory efficient. The main contribution of this paper isthat it is explained how the photo-consistency criterion candefine theoutsideof an object. It upgrades previous reason-ing based on photo-consistency where only the notion “onthe object” is available, but inside and outside are indistin-guishable. The existence of a background/foreground dis-tinction is crucial for volumetric methods since, wheneverit is not available, an ad-hoc model is introduced, usuallyin the form of an inflationary or deflationary force,i.e., aballooning term [6]. The negative effect of this new term isthat it also acts as a regularization term and can make de-tails or thin structures disappear from the final reconstruc-tion. Compared to previous work, our formulation providesa foreground/background model that isdata-aware. Thismodel can be used as an intelligent ballooning term pro-viding better reconstruction results. The new backgroundmodel can be used with previous existing techniques for 3Dsegmentation such as level-sets or graph-cuts while improv-ing their accuracy and simplicity for the task of 3D recon-struction.

This paper is organized as follows: In Section 2 we re-view the literature and explain the main motivation of thepaper. In Section 3 we sketch the basics of the techniqueand describe how to use the concept of probabilistic visi-bility as an intelligent ballooning term in a multi-resolutiongraph-cut implementation. Section 4 describes the theorybehind the new probabilistic approach to visibility compu-tation from a set of depth-maps. Finally in Section 5 wevalidate the proposed approach by showing high quality re-constructions and comparing it to previous algorithms.

2. Motivation and related work

We consider a set of input photographs of an object andtheir corresponding camera projection matrices. The objectis considered to be sufficiently textured so that dense cor-respondence can be obtained between two different images.The goal is to obtain a 3D surface representation of the ob-ject,e.g., a triangular mesh.

As recently reviewed by Seitz et al. [18], a vast quantity

1

https://www.researchgate.net/publication/221364550_A_Comparison_and_Evaluation_of_Multi-View_Stereo_Reconstruction_Algorithms?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/3192193_Finite-element_methods_for_active_contour_models_and_ballons_for_2-D_and_3-D_images_IEEE_Trans_Pattern_Anal_Mach_Intell?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

of literature exists on how to obtain such a 3D represen-tation from images only. Nearly all of them use a photo-consistency measure to evaluate how consistent is a recon-struction with a set of images,e.g., normalized cross corre-lation or sum of square differences.

The inspiration of this paper is recent progress in multi-view stereo reconstruction, and more specifically on vol-umetric techniques. The overview of state-of the-art al-gorithms for multi-view stereo reconstruction reported bySeitz et al. [18] shows a massive domination of volumetricmethods. Under this paradigm, a 3D cost volume is com-puted, and then a 3D surface is extracted using tools pre-viously developed for the 3D segmentation problem suchas snakes [12], level-sets [17] or more recently graph-cuts[21, 19, 4, 9, 13, 16, 20].

The way volumetric methods usually exploit photo-consistency is by building a 3D map of photo-consistencywhere each 3D location gives an estimate of how photo-consistent would be the reconstructed surface at that lo-cation. The only requirement to compute this photo-consistency 3D map is that camera visibility is available.Some methods use an initial approximation of the true sur-face to estimate visibility, such as the visual hull [21]. It-erative methods use instead the notion of ”current surface”.The visibility computed from the reconstructed surface atiterationi− 1 is then used to compute photo-consistency atiterationi, improving the reconstruction gradually [8]. Fi-nally, some recent methods are able to compute a “visibility-independent” photo-consistency where occlusion is treatedas an additional source of image noise [12].

Independently of how visibility is computed, all the vol-umetric methods suffer from a limitation that is specificto the multi-view stereo problem, namely that there is nostraightforward way of defining a foreground/backgroundmodel for the 3D segmentation problem. This is because inthis problem our primary source of geometric informationis the correspondence cuewhich is based on the follow-ing observation: A 3D point locatedon the object surfaceprojects to image regions ofsimilar appearance in all im-ages where it is not occluded. Using this cue one can label3D points as beingonor off the object surface but cannot di-rectly distinguish between pointsinsideor outsideit. Thislack of distinction has been historically solved by using adata-independentballooningterm that produces a constantinflationary tendency. The motivation for this type of termin the active contour domain is given in [6], but intuitively, itcan be thought of as a shape prior that favors objects that fillthe bounding volume in the absence of any other informa-tion. On one hand, if the ballooning term is too large, thenthe solution tends to over-inflate, filling the entire bound-ing volume. On the other hand, if it is too small, then thesolution collapses into an empty surface.

In this paper we propose a probabilistic framework

to construct a data-aware ballooning term from photo-consistency only. This framework is related to the workof [5, 1, 10] in that we all aim to model geometric occlu-sion in a probabilistic way. However we are the first tostudy the problem in a volumetric framework adapted tothe 3D segmentation problem. Roughly speaking, insteadof just assigning a photo-consistency value to a 3D loca-tion, this value is also propagated towards the camera cen-ters that where used to compute it. The key observationabout photo-consistency measures is that, besides provid-ing a photo-consistency score of a 3D particular location,they also give additional information about the spacebe-tweenthe location and the cameras used to compute theconsistency. In other words, if a set of cameras gives a highphoto-consistency score to a 3D location, they give at thesame time a “background” score of the same strength tothe 3D segments linking the 3D location with the cameracenters. This follows from the fact that, if the surface wasreally at that location, then the cameras that gave it a highphoto-consistency score would indeed see the surface with-out occlusion,i.e., the segments linking the camera centerswith the 3D location would all be background.

To our knowledge, this “background” score is not usedin any volumetric multi-view stereo algorithm, perhaps withthe exception of [11], where photo-consistency is used togenerate depth-maps which are then merged together us-ing a volumetric depth-map fusion technique [7]. It turnsout that depth-map fusion techniques are very related to ourprobabilistic approach. In fact, we demonstrate how ourprobabilistic approach explains and generalizes the work ofLevoy and Curless [7] on using signed distance functionsfor depth-map fusion. As we show in Section 4, the workof [7] naturally arises as the probabilistic solution wheneverthe sensor noise is modeled with a logistic distribution.

3. Multi-view stereo using multi-resolutiongraph-cuts

In [3] and subsequently in [2] it was shown how graph-cuts can optimally partition 2D or 3D space into ‘fore-ground’ and ‘background’ regions under any cost functionalconsisting of the following two terms:

• Labeling cost: for every point in space there is a costfor it being labeled ‘foreground’ or ‘background’.

• Discontinuity cost: for every point in space, there isa cost for it lying on the boundary between the twopartitions.

Mathematically, the cost functional described above can beseen as the sum of a weightedsurface areaof the boundarysurface and a weightedvolumeof the ‘foreground’ region

https://www.researchgate.net/publication/221362726_Multi-View_Stereo_Revisited?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/222687968_Silhouette_and_stereo_fusion_for_3D_object_modeling?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==


https://www.researchgate.net/publication/220502517_Variational_principles_surface_evolution_PDES_level_set_methods_and_the_stereo_problem?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/4156335_Bayesian_3D_Modeling_from_Images_Using_Multiple_Depth_Maps?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/221259303_From_Photohulls_to_Photoflux_Optimization?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/4156407_Multi-view_Stereo_via_Volumetric_Graph-cuts?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==


https://www.researchgate.net/publication/3906088_A_probabilistic_framework_for_space_carving?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/221305369_Oriented_Visibility_for_Multiview_Reconstruction?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/4038366_Computing_geodesics_and_minimal_surface_via_graph_cuts?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/221111727_Multi-View_Reconstruction_Using_Photo-consistency_and_Exact_Silhouette_Constraints_A_Maximum-Flow_Formulation?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==


https://www.researchgate.net/publication/221304351_3D_Surface_Reconstruction_Using_Graph_Cuts_with_Surface_Constraints?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/221304764_Interactive_Image_Segmentation_Using_an_Adaptive_GMMRF_Model?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/4246122_Hierarchical_Volumetric_Multi-view_Stereo_Reconstruction_of_Manifold_Surfaces_based_on_Dual_Graph_Embedding?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/3940732_A_probabilistic_framework_for_surface_reconstruction_from_multiple_images?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/2916943_A_Volumetric_Method_for_Building_Complex_Models_from_Range_Images?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==


https://www.researchgate.net/publication/225549359_Carved_Visual_Hulls_for_Image-Based_Modeling?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/4156348_Modelling_Dynamic_Scenes_by_Registrating_Multi-View_Image_Sequences?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

https://www.researchgate.net/publication/3192193_Finite-element_methods_for_active_contour_models_and_ballons_for_2-D_and_3-D_images_IEEE_Trans_Pattern_Anal_Mach_Intell?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

as follows:

E[S] =

∫

S

ρ(x)dA +

∫

V (S)

σ(x)dV (1)

whereS is the boundary between ‘foreground’ and ‘back-ground’, V (S) denotes the ‘foreground’ volume enclosedby S andρ andσ are two scalar density fields. The appli-cation described in [3] was the problem of 2D/3D segmen-tation. In that domainρ(x) is defined as a function of theimage intensity gradient andσ(x) as a function of the imageintensity itself or local image statistics.

This model balances two competing terms: The first oneminimizes a surface integral of photo-consistency while thesecond one maximizes the volume of regions with a highevidence of being foreground. While the photo-consistencyterm is relatively easy to compute from a set of images,very little work has been done to obtain an appropriate bal-looning term. In most of the previous work on volumet-ric multi-view stereo the ballooning term is a very simplis-tic inflationary force that is constant in the entire volume,i.e., σ(x) = −λ. This simple model tries to recover thinstructures by maximizing the volume inside the final sur-face. However, as a side effect, it also fills in concavitiesbehaving as a regularization force and smoothing fine de-tails.

When silhouettes of the object are available, an addi-tional silhouette cuecan be used [21], which provides theconstraint that all pointsinside the object volume mustproject inside the silhouettes of the object. Hence the sil-houette cue can provide some foreground/background in-formation by giving a very high likelihood of beingout-sidethe object to 3D points that project outside the silhou-ettes. However this ballooning term is not enough if thinstructures or big concavities are present, in which case themethod fails (see Fig. 4 middle row). Very recently, a datadriven, foreground/background model based on the conceptof photo-fluxhas been introduced [4]. However, the ap-proach requires approximate knowledge of the object sur-face orientation which in many cases is not readily avail-able.

Ideally, the ballooning term should be linked to the no-tion of visibility, where points that are not visible fromany camera are considered to be inside the object orfore-ground, and points that are visible from at least one cameraare considered to be outside the object orbackground. Anintuition of how to obtain such a ballooning term is found ina classic paper on depth sensor fusion by Curless and Levoy[7]. In that paper the authors fuse a set of depth sensors us-ing signed distance functions. This fusion relies on the ba-sic principle that the space between the sensor and the depthmap should be empty or background, and the space after thedepth map should be considered as foreground. Here wepropose to generalize this visibility principle and compute

a probabilistic version of it by calculating the “evidence ofvisibility” from a given set of depth-maps and use it as anintelligent ballooning term.

The outline of the full system is as follows:

• create a set of depth-maps from the set of input cali-brated images,

• derive discontinuity costρ(x) from the set of depth-maps,i.e., compute the photo-consistency term,

• derive labeling costσ(x) from the set of depth-maps,i.e., use a data-aware ballooning term computed fromthe evidence of visibility and,

• extract the final surface as the global solution of themin-cut problem givenρ(x) andσ(x).

It is worth noting that the algorithm just described canalso be used when the input is no longer a set of imagesbut a set of depth-maps obtained from other types of sensor,e.g., laser scanner. In this case, the system just skips thefirst step, since the depth-maps are already available, andcomputesρ andσ directly from the set of depth-maps givenas input.

3.1. Depth-map computation from images

The goal of this section is to generate a set of depth-mapsD1, ...,DN from a a sequence of imagesI1, ..., IN

calibrated for camera pose and intrinsic parameters. Eachdepth map is similar to a 2D image but where each pixelmeasures depth of the scene away from the sensor insteadof a color value. In order to create the depth-maps fromthe set of input images, we propose to use a robust photo-consistency metric similar to the one described in [12] thatdoes not need any visibility computation. This choice ismotivated by the excellent results obtained by this type ofphoto-consistency metric in the recent comparison of 3Dmodeling techniques carried out by [18]. Basically, occlu-sion is considered as another type of image noise and ishandled robustly in the same way as the lack of texture orthe presence of highlights in the image. For a given imageIi, the depthDi(x) along the optic ray generated by a 3Dlocationx is computed as follows:

• Compute the corresponding optic ray

oi(d) = x + (ci − x)d (2)

that goes through the camera’s optic centerci and the3D locationx,

• As a function of the depth along the optic rayd, projectthe 3D pointoi(d) into the M closest cameras andcomputeM correlation scores between each neighborimage and the reference image using normalized crosscorrelation,

https://www.researchgate.net/publication/221259303_From_Photohulls_to_Photoflux_Optimization?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==


https://www.researchgate.net/publication/4038366_Computing_geodesics_and_minimal_surface_via_graph_cuts?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==



• combine theM correlation scores into a single scoreC(d) using a voting scheme as in [12], and find thefinal depthDi as the global maximum ofC. The confi-dence on the depthDi is simplyC(Di). As an optionaltest, a minimum confidence value can be used to rejectdepth estimations with very low confidence. The 3Dlocations of depths that are obtained along with theircorresponding confidence are stored.

3.2. Discontinuity cost from a set of depth-maps

Once we have computed a depth-map for every input im-age, we can build the discontinuity mapρ(x) for every 3Dlocationx. We propose a very simple accumulation schemewhere for every 3D pointx its total photo-consistencyC(x)is given by the sum of the confidences of all nearby pointsin the computed depth-maps. Since the graph-cut algorithmminimizes the discontinuity cost, and we want tomaximizethe photo-consistency,ρ(x) is simply inverted using the ex-ponential:

ρ(x) = e−µC(x), (3)

whereµ is a very stable rate-of-decay parameter which inall our experiments was set to0.05.

As a way of improving the big memory requirements ofgraph-cut methods, we propose to store the values ofρ(x) inan octree partition of 3D space. The size of the octree voxelwill depend on the photo-consistency valueC(x). Voxelswith a non-zero photo-consistency value will have the finestresolution while the remaining space whereC(x) = 0 willbe partitioned using bigger voxels, the voxel size being di-rectly linked with the distance to the closest “non-empty”voxel (see Fig. 3 for an example of such an octree parti-tion). As an implementation detail, the only modificationneeded in the graph-cut algorithm to use a multi-resolutiongrid is that now links between neighboring nodes need tobe weighted accordingly to the surface area shared by bothnodes.

3.3. Labeling cost from a set of depth-maps

In the same way as the computation of the discontinu-ity cost, the ballooning termσ(x) can be computed exclu-sively from a set of depth-maps. In this paper we proposeto use the probabilistic evidence for visibility introduced inSection 4 as anintelligent ballooning term. To do so, allwe need is to choose a noise model for our sensor given adepth-mapD and its confidenceC(D). We propose to use asimplistic yet powerful model of a Gaussian contaminatedwith a uniform distribution,i.e., an inlier model plus an out-lier model. The inlier model is assumed to be a Gaussiandistribution centered around the true depth. The standarddeviation is considered to be a constant value that only de-pends on the image resolution and camera baseline. Theoutlier ratio varies according to the confidence on the depth

of sensor iestimate

D (x)i*

D (x)i

3D surfacex

sensor i

d (x)

Figure 1. Sensor depth notation. Sensori measures the depth ofthe scene along the optic ray from the sensor to 3D pointx. Thedepth of pointx from sensori is di(x) while the correct depth ofthe scene along that ray isD∗

i (x) and the sensor measurement isDi(x).

estimationC(D), and in our case is just proportional to it.The labeling costσ(x) at a given location is just the evi-dence of visibility. The details of this calculation are laidout in the next Section.

4. Probabilistic fusion of depth sensors

This section considers the problem of probabilisticallyfusing depth maps obtained fromN depth sensors. We willbe using the following notation: The sensor data is a set ofN depth mapsD = D1, . . . ,DN . A 3D pointx can there-fore be projected to a pixel of the depth map of thei-th sen-sor and the corresponding depth measurement at that pixelis written asDi(x) while D∗

i (x) denotes the true depth ofthe 3D scene. The measurementDi(x) contains some noisewhich is modeled probabilistically by a pdf conditional onthe real surface depth

p (Di(x) | D∗i (x)) .

The depth of the pointx away from the sensor isdi(x)(see figure 1). Ifx is located on the 3D scene surface then∀i D∗

i (x) = di(x). If for a particular sensori we haveD∗

i (x) > di(x) this means that the sensor cansee beyondx or in other words thatx is visible from the sensor. Wedenote this event byVi(x). When the opposite eventVi(x)is true, as in figure 1, thenx is said to beoccludedfromthe sensor. To fuse these measurements we consider a pred-icateV (x) which is read as:‘x is visible from at least onesensor’. More formally the predicate is defined as follows:

V (x) ≡ ∃i Vi(x) (4)

V (x) acts as a proxy for the predicate we should ide-ally be examining which is‘x is outside the volume of the3D scene’. However our sensors cannot provide any ev-idence beyondD∗

i (x) along the optic ray, the rest of the


sensor 1

sensor 2

Figure 2. Visibility from sensors. In the example shown above thepoint is not visible from sensor 2 while it is visible from sensor1, i.e. we haveV1V 2. In the absence a surface prior that doesnot favor geometries such as the one shown above, one can safelyassume that there is no probabilistic dependence between visibilityor invisibility from any two sensors.

points on that ray being occluded. If there are locations thatare occluded from all sensors, no algorithm could produceany evidence for these locations being inside or outside thevolume. In that sense therefore,V (x) is the strongest pred-icate one could hope for in an optical system. An intuitiveassumption made throughout this paper is that the proba-bility of V (x) depends only on the depth measurements ofsensors along optic rays that go throughx. This means thatmost of our inference equations will be referring to a sin-gle point x, in which case thex argument can be safelyremoved from the predicates. Our set of assumptions whichwe denote byJ consists of the following:

• The probability distributions of the true depths of thesceneD∗

1(x) · · ·D∗N (x) and also of the measurements

D1(x) · · ·DN (x) are independent givenJ (see figure2 for justification).

• The probability distribution of of a sensor measure-ment given the scene depths and all other measure-ments only depends on the surface depth it is measur-ing:

p (Di | D∗1 · · ·D

∗N Dj 6=i J ) = p (Di | D∗

i J )

We are interested in computing the evidence function underthis set of independence assumptions [14] for the visibilityof the point given all the sensor measurements:

e (V | D1 · · ·DNJ ) = logp (V | D1 · · ·DNJ )

p(

V | D1 · · ·DNJ) .

FromJ and rules of probability one can derive:

p(

V | D1 · · ·DNJ)

=

N∏

i=1

p(

V i | DiJ)

.

and

p(

V i | DiJ)

=

∫ di

0p (Di | D∗

i J ) p (D∗i | J ) dD∗

i∫ ∞

0p (Di | D∗

i J ) p (D∗i | J ) dD∗

i

As mentioned, the distributionsp (Di | D∗i J ) encode our

knowledge about the measurement model. Reasonablechoices would be the Gaussian distribution or a Gaus-sian contaminated by an outlier process. Both of theseapproaches are evaluated in section 5. Another interest-ing option would be multi-modal distributions. The priorp (D∗

i | J ) encodes some geometric knowledge about thedepths in the scene. In our examples a bounding volumewas given so we assumed a uniform distribution ofD∗

i in-side that volume.

If we write πi = p(

V i | DiJ)

then the evidence forvisibility is given by:

e (V | D1 · · ·DNJ ) = log1 − π1 . . . πN

π1 . . . πN

(5)

Before proceeding with the experimental evaluation of thisevidence measure we point out an interesting connectionbetween our approach and one of the classic methods in theComputer Graphics literature for merging range data.

4.1. Signed distance functions

In [7], Curless and Levoy compute signed distance func-tions from each depth-map (positive towards the camera andnegative inside the scene) whose weighted average is thenstored in a 3D scalar field. So ifwi(x) represents the con-fidence of depth measurementDi(x) in the i-th sensor, the3D scalar field they compute is:

F (x) =

N∑

i=1

wi(x) (di(x) − Di(x)) (6)

The zero level ofF (x) is then computed using marchingcubes. While this method provides quite accurate results ithas a drawback: For a set of depth maps around a closed ob-ject, distances from opposite sides interfere with each other.To avoid this effect [7] actually clamps the distance on ei-ther side of a depth map. The distance must be left un-clamped far enough behind the depth map so that all dis-tance functions contribute to the zero-level crossing, butnottoo far so as to compromise the reconstruction of thin struc-tures. This limitation is due to the fact that the methodimplicitly assumes that the surface has low relief or thatthere are no self-occlusions. This can be expressed in sev-eral ways but perhaps the most intuitive is that every opticray from every sensor intersects the surface only once. Thismeans that if a pointx is visible from at least one sensorthen it must be visible from all sensors (see figure 2). Us-ing this assumption, an analysis similar to the one in theprevious section leads to some a surprising insight into thealgorithm. More precisely, if we set the prior probability forvisibility to p(V ) = 0.5 and assume the logistic distribution

https://www.researchgate.net/publication/234163994_Probability_Theory_The_Logic_Of_Science?el=1_x_8&enrichId=rgreq-fd5efb5d-1890-40ae-9d4d-47d48968c96b&enrichSource=Y292ZXJQYWdlOzIyMTM2MTU0MDtBUzoyMzgxNTQyMjU1NDkzMTJAMTQzMzc5MTgwMDE1OA==

Figure 3.Different terms used in the graph-cut algorithm to reconstruct the Gormley sculpture. Left: multi-resolution grid used inthe graph-cut algorithm. Middle: Discontinuity costρ(x) (or photo-consistency). Right: labeling costσ(x) (or intelligent ballooning).

for sensor noise, i.e.

p (Di,D∗i | I) ∝ sech

(

D∗i − Di

2wi

)2

then the probabilistic evidence forV given all the data ex-actly corresponds to the right hand side of (6). In otherwords, the sum of signed distance functions of [7] can beseen as an accumulation of probabilistic evidence for visi-bility of points in space, given a set of noisy measurementsof the depth of the 3D scene. This further reinforces theusefulness of probabilistic evidence for visibility.

5. Experiments

We present a sequence of 72 images of a ”crouchingman” sculpture made of plaster by the modern sculptorAntony Gormley (see Fig. 4 top). The image resolutionis 5 Mpix and the camera motion was recovered by stan-dard structure from motion techniques [22]. The object ex-hibits significant self-occlusions, a large concavity in thechest and two thin legs which make it a very challengingtest to validate our new ballooning term. The first step inthe reconstruction process is to compute a set of depth-mapsfrom the input images. This process is by far the most ex-pensive of the whole pipeline in terms of computation time.A single depth-map takes between 90 and 120 seconds, theoverall computation time being close to 2 hours. Once thedepth-maps are computed, a 3D octree grid can be built (seeFig. 3 left) together with the discontinuity cost and the la-beling cost (see Fig. 3 middle and right respectively). Be-cause of the octree grid, we are able to use up to 10 levels ofresolution to compute the graph-cut,i.e., the equivalent of aregular grid of10243 voxels. We show in figure 4 some ofthe images used in the reconstruction (top), the result usingan implementation of [21] (middle) and the reconstructionresult of the proposed method (bottom). We can appreciatehow the constant ballooning term introduced in [21] is un-able to reconstruct correctly the feet and the concavities at

the same time. In order to recover thin structures such asthe feet, the ballooning term needs to be stronger. But evenbefore the feet are fully recovered, the concavities start toover inflate.

Finally we show in figure 5 the effect of having an out-lier component in the noise model of the depth sensor whencomputing the volume of evidence of visibility. The ab-sence of an outlier model that is able to cope with noisydepth estimates appears in the volume of visibility as tun-nels “drilled” by the outliers (see Fig. 5 center). Adding anoutlier term clearly reduces the tunneling effect while pre-serving the concavities (see Fig. 5 right).

6. Conclusions

We have presented a new formulation to multi-viewstereo that treats the problem as probabilistic 3D seg-mentation. The primary result of this investigationis that the photo-consistency criterion provides a fore-ground/background model that can predict if a 3D loca-tion is inside or outside the scene. This fact, which hasnot received much notice in the multi-view stereo litera-ture, lets us replace the commonly used naive inflationarymodel with probabilistic evidence for the visibility of 3D lo-cations. The proposed algorithm significantly outperformsballooning based approaches, especially in concave regionsand thin protrusions. We also report on a surprising connec-tion between the proposed visibility criterion and a classiccomputer graphics technique for depth-map fusion, whichfurther validates our approach. As future work we are plan-ning a detailed evaluation of our approach with state-of-the-art depth-map fusion techniques such as [15]. We arealso considering evaluating our method with the Middle-bury datasets [18].

Figure 4.Comparison of our reconstruction results with previous methods.Plaster model of a crouching man by Antony Gormley,2006. Top: some of the input images. Middle: views of reconstructed model using the technique of [21] with a constant ballooning term.No constant ballooning factor is able to reconstruct correctly the feet and the concavities at the same time. Bottom: views of reconstructedmodel using the intelligent ballooning proposed in this paper and shown in Fig.5 right.

Figure 5.Comparison of two different inlier/outlier ratios for the depth sensor noise model. Left: 3D location of one slice of thevolume of “evidence of visibility”. Middle: the sensor model is a pure Gaussian without any outlier model. Outliers ”drill” tunnels in thevisibility volume. Right: the sensor model takes into account an outlier model.The visibility volume is more robust against outliers whilethe concavities are still distinguishable.

Appendix. Interpretation of signed distancefunctions.

Using the predicates we have already defined, the as-sumption of no self-occlusion can be expressed by

V ↔ ∀i Vi. (7)

From (4) and (7) we see that if a pointx is visible (invis-ible) from one sensor it is visible (invisible) from all sen-sors, i.e.V1 ↔ · · · ↔ VN ↔ V . Let I stand for the priorknowledge which includes the geometric description of theproblem and (7). Given (7) eventsD1 · · ·DN are indepen-dent under the knowledge ofV orV which means that usingBayes’ theorem we can write:

p (V | D1 · · ·DNI) =p (V | I)

∏N

i=1 p (Di | V I)

p (D1 · · ·DN | I)(8)

Obtaining the equivalent equation forV and dividing withequation (8) and taking logs gives us:

e (V | D1 · · ·DNI) = e (V | I) +

N∑

i=1

logp (Di | V I)

p(

Di | V I) .

(9)By several applications of Bayes’ theorem we get:

e (V | D1 · · ·DNI) =

N∑

i=1

logαi

βi

− (N − 1)e (V | I) .

(10)where αi =

∫ ∞

di

p (Di,D∗i | I) dD∗

i and βi =∫ di

0p (Di,D

∗i | I) dD∗

i . We now sete (V | I) = 0 and as-sume the noise model is given by the logistic function

p (Di,D∗i | I) ∝ sech

(

D∗i − Di

2wi

)2

.

Using standard calculus one can obtain the following ex-pression for the evidence

e (V | D1 · · ·DNI) =N

∑

i=1

wi (di − Di) , (11)

equal to the average of the distance functions used in [7].�

Notation

N Number of images/sensorsx 3D locationI Prior

ǫ(A) Evidence of predicate Ap(A) Probability of predicate A

Di(x) Depth measured by sensori for locationx

D∗i (x) True depth of the scene for sensori

C(x) Confidence of depth estimation at locationx

Vi(x) Predicate ’x is visible from sensori’V (x) Predicate ’x is visible from at least one sensor’

References[1] M. Agrawal and L.-S. Davis. A probabilistic framework forsurface

reconstruction from multiple images. InProc. IEEE Conf. on CVPR,volume 2, pages 470–477, 2001.

[2] A. Blake, C. Rother, M. Brown, P. Perez, and P. Torr. Interactiveimage segmentation using an adaptive GMMRF model. InProc.8th Europ. Conf. on Computer Vision, pages 428–441, 2004.

[3] Y. Boykov and V. Kolmogorov. Computing geodesics and minimalsurfaces via graph cuts. InProc. 9th Intl. Conf. on Computer Vision,pages 26–33, 2003.

[4] Y. Boykov and V. Lempitsky. From photohulls to photoflux opti-mization. InProc. BMVC, to appear, pages 1149–1158, 2006.

[5] A. Broadhurst, T. Drummond, and R. Cipolla. A probabilistic frame-work for space carving. InProc. 8th Intl. Conf. on Computer Vision,volume 1, pages 338–393, 2001.

[6] L. Cohen and I. Cohen. Finite-element methods for active contourmodels and balloons for 2-d and 3-d images.IEEE Trans. PatternAnal. Mach. Intell., 15(11):1131–1147, November 1993.

[7] B. Curless and M. Levoy. A volumetric method for building complexmodels from range images.Proc. of the ACM SIGGRAPH, pages303–312, 1996.

[8] O. Faugeras and R. Keriven. Variational principles, surface evolu-tion, pdes, level set methods and the stereo problem.IEEE Transac-tions on Image Processing, 7(3):335–344, 1998.

[9] Y. Furukawa and J. Ponce. Carved visual hulls for image-based mod-eling. In Proc. 9th Europ. Conf. on Computer Vision, volume 1,pages 564–577, 2006.

[10] P. Gargallo and P. Sturm. Bayesian 3d modeling from images usingmultiple depth maps. InProc. IEEE Conf. on Computer Vision andPattern Recognition, volume II, pages 885–891, 2005.

[11] M. Goesele, B. Curless, and S. Seitz. Multi-view stereorevisited.In Proc. IEEE Conf. on Computer Vision and Pattern Recognition,volume 2, pages 2402–2409, 2006.

[12] C. Hernandez and F. Schmitt. Silhouette and stereo fusion for3d object modeling.Computer Vision and Image Understanding,96(3):367–392, December 2004.

[13] A. Hornung and L. Kobbelt. Hierarchical volumetric multi-viewstereo reconstruction of manifold surfaces based on dual graph em-bedding. InProc. IEEE Conf. on Computer Vision and PatternRecognition, volume 1, pages 503–510, 2006.

[14] E. Jaynes.Probability theory: the logic of science. Cambridge Uni-versity Press, 2003.

[15] M. Kazhdan, M. Bolitho, and H. Hoppe. Poisson surface reconstruc-tion. Eurographics SGP, pages 61–70, 2006.

[16] V. Lempitsky, Y. Boykov, and D. Ivanov. Oriented visibility for mul-tiview reconstruction. InProc. 9th Europ. Conf. on Computer Vi-sion, volume 3, pages 226–238, 2006.

[17] J. Pons, R. Keriven, and O. Faugeras. Modelling dynamic scenesby registering multi-view image sequences. InProc. IEEE Conf. onCVPR, volume 2, pages 822–827, 2005.

[18] S. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. Acomparison and evaluation of multi-view stereo reconstruction algo-rithms. InProc. IEEE Conf. on Computer Vision and Pattern Recog-nition, volume 1, pages 519–528, 2006.

[19] S. Sinha and M. Pollefeys. Multi-view reconstruction using photo-consistency and exact silhouette constraints: A maximum-flow for-mulation. InProc. 10th Intl. Conf. on Computer Vision, volume 1,pages 349–356, 2005.

[20] S. Tran and L. Davis. 3d surface reconstruction using graph cutswith surface constraints. InProc. 9th Europ. Conf. on ComputerVision, volume 2, pages 218–231, 2006.

[21] G. Vogiatzis, P. Torr, and R. Cipolla. Multi-view stereo via volumet-ric graph-cuts. InProc. IEEE Conf. on Computer Vision and PatternRecognition, volume 1, pages 391–398, 2005.

[22] A. Zisserman and R. Hartley.Multiple View Geometry. Springer-Verlag, 2000.

Probabilistic visibility for multi-view stereo

Documents