-
3D Object Reconstruction with Heterogeneous Sensor Data
Li GuanUNC-Chapel Hill, USA
ETH-Zurich, [email protected]
Jean-Sebastien FrancoLaBRI - INRIA Sud-Ouest
University of Bordeaux, [email protected]
Marc PollefeysUNC-Chapel Hill, USA
ETH-Zurich, [email protected]
Abstract
In this paper, we reconstruct 3D objects with a heteroge-neous
sensor network of Time of Flight (ToF) Range Imag-ing (RIM) sensors
and high-res camcorders. With this setup,we first carry out a
simple but effective depth calibration forthe RIM cameras. We then
combine the camcorder silhou-ette cues and RIM camera depth
information, for the recon-struction. Our main contribution is the
proposal of a sensorfusion framework so that the computation is
general, sim-ple and scalable. Although we only discuss the fusion
ofconventional cameras and RIM cameras in this paper, theproposed
framework can be applied to any vision sensors.This framework uses
a space occupancy grid as a proba-bilistic 3D representation of
scene contents. After definingsensing models for each type of
sensors, the reconstructionsimply is a Bayesian inference problem,
and can be solvedrobustly. The experiments show that the quality of
the re-construction is substantially improved from the noisy
depthsensor measurement.
1. Introduction
3D object reconstruction is a classic computer visionproblem and
has many applications such as virtual reality,vision-guided
surgeries, medical studies and simulations,video games,
architectural design, etc. Within the past fiveyears, a promising
new technology, Range Imaging (RIM)cameras based on Time of Flight
(ToF) principles are com-ing to market. Swiss Ranger 3000 as shown
in Fig. 1 is atypical model. 2.5D range images combined with 2D
inten-sity images can be directly read out up to 50 fps.
Althoughmost of these RIM cameras do not have high image
reso-lution (e.g. 176 x 144 for Swiss Ranger 3000), their
mea-surement throughput is still far beyond the traditional
depthsensors, such as LIDAR. This opens enormous potential ina wide
range of application areas, including action recogni-tion and
tracking, object pose recognition, obstacle detec-tion and so on.
However, few literatures have explored its
(a) (b)
Figure 1. (a) RIM camera, Swiss Range 3000(b) a typical output
from the sensor.
potential in 3D object reconstruction. The main challengesare
(1) the range images are noisy and not always accurateenough for 3D
reconstruction purposes. In fact, the RIMcamera depth calibration
itself remains a new and active re-search topic [15]; (2) the
relatively low image resolutionprohibits detailed
reconstruction.
In this paper, we propose to solve the above problemsand explore
the reconstruction potential of the RIM cam-eras by introducing a
heterogeneous sensor network of RIMcameras and high-res camcorder.
We first describe a simplebut effective depth calibration process
for the RIM cameras,which is just an additional step in the classic
geometric cam-era calibration procedure [28]. For the
reconstruction, wepropose a probabilistic framework to fuse depth
informationfrom RIM cameras and silhouette cues from
camcorders.
The depth information and silhouette cues alone havebeen
explored intensively for 3D reconstruction purpose.Both have their
own advantages and drawbacks. For thedepth information, it can give
you actual object surfacepatches. But due to self occlusion,
individual patches onlyprovide a partial model of the object
surface, so one of themany challenges is to deal with missing
patches and fill upthe holes so as to get a topologically correct
object shape[1, 6, 13, 7, 4]. On the other hand, reconstruction
from sil-houette cues [2, 16, 25, 10, 17] are praised for a
closed-formshape estimate of the object. And recently even no
hard-decision binary silhouette images are required for a
robust
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
probabilistic visual hull reconstruction [11]. An
inherentdrawback of a visual hull is that it cannot recover
objectconcavities no matter how many views of silhouettes
areprovided. However, this can be directly compensated bythe depth
information. In fact, object depth and silhouetteare quite
complementary information in nature: the formerencodes lights
bouncing back from the frontal surfaces; andthe latter is tangent
to the object. So in theory, these twocould be combined to improve
the reconstruction quality.Additionally, in our sensor network, the
shape details canbe recovered with the high-res camcorder frames to
com-pensate the low-res RIM camera images.
However, silhouette and depth integration is not
straight-forward due to the heterogeneity of the information.
Liet.al. try to tackle the problem with pure surface
repre-sentation [18], which requires a lot of delicate handlingof
geometry computation errors. As an alternative, volu-metric fusion
can be favored to avoid topological problems[20, 23, 27], but these
methods are all based on determin-istic criterions, which have to
specifically deal with sensornoise perturbations.
In order to achieve a more robust but also more generalsolution
to the fusion problem, similar to [11], our frame-work borrows the
concept of a space occupancy grid fromthe robotics literature [8,
21, 22] as the representation of3D scenes. After defining the
probabilistic sensing modelsfor each type of sensors, the
reconstruction simply becomesa Bayesian inference. The
reconstruction result is a poste-rior probability volume given
sensor observations. It is in-herently robust and requires no
special treatment regardingsensor noise, because the noise and
variation is already partof the probabilistic sensing models. One
thing to note is thatthe proposed framework is not limited to the
fusion betweensilhouette cues and depth maps, but any type of
sensor ob-servations such as point clouds and disparity maps, as
longas the sensing model can be properly defined.
The paper is organized as follows: In Section 2 weexplain the
mechanism for common RIM cameras andintroduce our calibration
method. Then in Section 3,we formally describe our reconstruction
algorithm via theBayesian inference framework. And we introduce the
cam-corder and RIM camera sensing models in Section 4. InSection 5
we validate the proposed calibration and fusionscheme by
reconstructions from two real-world datasets.
2. RIM Camera and its Calibration
Most common RIM camera designs, including the SR3000 in Fig. 1,
are based on the indirect ToF-principle. Am-plitude modulated light
is emitted from the camera, travelsto the object, is reflected, and
finally demodulated by meansof a specialized CMOS/CCD pixel.
Demodulation is knownas the reconstruction of the received signal,
as shown in the
Figure 2. Left: indirect ToF principle. Right:phase shift
computation. Adapted from [15]
Fig. 2. In case of a sinusoidal signal, three parameters haveto
be calculated: the intensity B, the amplitude A, and thephase φ.
Four sampling points τ0, τ1, τ2 and τ3 (inten-sity measurements,
each shifted of 90◦) are triggered to theemitted wave. Thus the
phase shift φ can be calculated as:
φ = arctan(τ1 − τ3τ0 − τ2
) (1)
The phase shift φ is directly proportional to the distance Dthe
light has traveled:
D =λmod
2· φ2π
with D < λmod (2)
The main issue with the depth measurement is that themeasures
have low accuracy (the displacement between theground truth and the
measured depth is large). For example,the SR 3000 may have a 0.35m
displacement aiming at anobject at 5.5m. For 3D reconstruction,
this displacementis not negligible and needs to compensated in
advance. Itcan be done by the depth calibration, but so far, only
verydelicate and expensive devices [15, 19] are available to
doso.
Because (1) it is not feasible to have such expensiveand
sophisticated calibration devices and (2) the calibra-tion is
beyond our 3D reconstruction accuracy requirement,we hereby propose
an easier calibration procedure takingadvantage of our
heterogeneous camera network design.Firstly, since the RIM cameras
can also produce intensityimages, all the cameras can be
geometrically calibrated us-ing Bouguets toolbox based on [28].
After the bundle ad-justment, we know the relative pose of the
cameras and thecalibration checkerboard patterns in 3D space with
abso-lute scale. The error is normally smaller than 1cm, whichis
reasonable for our 3D reconstruction of human-size fig-ures. Then
we perform the depth calibration of the RIMcamera simply by finding
out a mapping function betweenthese checkerboard pattern poses as
the ground truth and thedepth readouts from the RIM camera. A
detailed exampleis given in Section 5.
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
Figure 3. General system dependency.
3. Problem Formulation
Assume we have a set of calibrated sensors, in this sec-tion we
introduce our probabilistic shape inference frame-work in details.
With the following notations, we can defineour problem formally:
given a set of synchronized observa-tions O from n sensors at a
specific time instant, we inferfor every 3D location X in an
occupancy grid G expandingthe 3D space its probability of being
occupied or not by theobject that we are modeling. And we denote
this probabilityas p(GX) with GX the binary variable at X .
Notations
n total number of sensorsX 3D locationi sensor indexp pixel
index in sensor i corresponding to Xτ priorGX voxel occupancy at X
in the occupancy grid GOpi observation at pixel p of sensor iMi
sensor i’s model, in this paper we specifically con-
sider consumer camcorders and 3D depth cameraB camcorder’s
background modelD RIM camera’s depth observation modelSpi
silhouette formation at pixel p of camcorder iT pi object’s front
most surface location with respect to
pixel p of RIM sensor iP silhouette sampling variableR
silhouette detection external cause
An intuitive assumption made throughout this paper isthat space
occupancy variable GX ∈ {0, 1} depends onlyon the information along
optic rays that go through X .However, anti-aliasing effects need
to be considered. Wesimply use the same sampling window strategy
introducedin [11], where a certain 3D voxel affects the formationof
pixels within the sampling window similar to a pointspread
function. Another common occupancy grid assump-tion as in [8] is
that, we assume statistical independencebetween voxel occupancies,
and compute each voxel oc-
cupancy likelihood independently for tractability. Resultsshow
that independent estimation, while not as exhaus-tive as a global
search over all voxel configurations, stillprovides very robust and
usable information, at a muchlower cost. Therefore, we model the
sensor network rela-tionships as computing the joint probability of
these vari-ables, p(GX ,O1,...,n,M1,...,n, τ), and propose the
follow-ing decomposition, based on the statistical dependencies
ex-pressed in Fig. 3:
p(GX ,O1,...,n,M1,...,n, τ)
=p(τ)p(GX |τ)n∏
i=1
p(Mi|τ)p(Oi|GX ,Mi, τ)(3)
• p(τ) represents the prior probabilities of our parame-ter set.
Since we have no a priori reason to favor anyparameter values, we
set it to a uniform distribution. Itthus disappears from any
subsequent inference.
• p(GX |τ) is the prior likelihood for occupancy, which
isindependent of all other variables except τ . We choosenot to
favor any voxel location and set this term to uni-form in this
paper.
• p(Oi|GX ,Mi, τ), or more specifically, given ouraforementioned
viewing-ray independence assump-tion, p(Opi |GX ,M
pi , τ) represents the sensor observa-
tion probability.
Once the joint probability distribution has been fully
de-termined, it is possible to use Bayes rule to infer the
prob-ability distributions of our searched variable GX , given
thesensor modelsM and their observations O .
p(GX |O1,...,n,M1,...,n, τ)
=∏n
i=1 p(Opi |GX ,M
pi , τ)∑
GX
∏ni=1 p(O
pi |GX ,M
pi , τ)
(4)
If we apply Eq. 4 for all locations and obtain this
proba-bilistic volume G , we can simply reconstruct our 3D
objectsby extracting iso-probability surfaces, or more robustly
us-ing state-of-the-art techniques, such as
Graphcut/Levelsetalgorithms [24, 26]. The remaining problem is to
define theproper sensor modelsM so that the observation
formationp(Oi|GX ,Mi, τ) in Eq. 4 is reasonable. But so far, we
haveintroduced a very general sensor fusion framework, whichhas no
constraints on the sensor type nor data type.
4. Sensor Models
In this section, we describe the probabilistic
camcorderbackground model B and RIM camera depth model D ,which are
used in our sensor network. Namely, we analyzethe components of
p(Opi |GX ,B
pi , τ) and p(O
pi |GX ,D
pi , τ)
for the two types of sensors respectively.
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
4.1. Camcorder Sensor Model
The sensor observation Opi for a camcorder is the coloror
intensity but not the silhouette Spi ∈ {0, 1} of the objectbeing
reconstructed. However, they are directly related asshown in Fig.
4: the existence of an object at GX determinesthe value of the
object silhouette Spi . The state of S
pi as well
as the background color appearance Bpi determine the colorto be
observed.
Figure 4. Camcorder dependency. Adaptedfrom [11].
Since Spi is not the direct observation from the camera,it is a
latent variable and is marginalized as shown in Eq.5.
p(Opi |GX ,Mpi , τ)
=∑Spi
p(Opi |Spi ,B
pi , τ)p(S
pi |GX , τ) (5)
• p(Opi |Spi ,B
pi , τ) is the image formation term. If
Spi = 0, then Opi can be explained by the background
model Bpi . In this paper, we model it as an RGBcolor space
normal distribution N (Opi , µ
pi , σ
pi ), where
(µpi , σpi ) are the distribution parameters, and trained
in advance from a number of images with only emptyscene but no
reconstruction object in the presence. IfSpi = 1, the pixel should
display the foreground ob-ject’s color. We set it to a uniform
distribution Upi ,meaning any color can be possibly observed from
theobject that we are reconstructing. This sensor modelis
consistent with [11], which follows the basic back-ground
subtraction algorithm [12, 9], without forcinga hard decision of a
binary silhouette image, thus to bemuch more robust against sensor
noise and environ-ment lighting variations.
• p(Spi |GX , τ) is the silhouette formation term. Itmodels the
silhouette detection response of a singlepixel sensor (i, p) to the
occupancy state of GX . Inour discretized world, the assumption
that a voxel lieson the viewing line of a pixel is uncertain. This
may bedue to many external causes: potential camera calibra-tion
errors, camera mis-synchronization etc. This canbe modeled by a
latent variable — the sampling vari-able P . Second, there can be
causes for silhouette de-tection other than the voxel itself: an
object occupancy
other than the one related by GX , which is modeledby another
hidden variable — external detection causeR. The complete
dependencies are shown in Fig. 5.This relationship model is
introduced by [11], wheredetailed formulations regarding these
variables can befound.
Figure 5. Silhouette detection dependency.Adapted from [11].
4.2. RIM Camera Sensor Model
For a RIM camera, the observation Opi is the depth mea-surement.
Similar to the silhouette variable Spi in camcordersensor, here we
also introduce a latent variable T pi , to modelthe front most
surface of the object with respect to the RIMcamera. The
relationship between sensor variables is shownin Fig. 6. Basically,
the existence of an object at GX affectsthe front most surface
location T pi to a certain RIM camerai. And T pi affects the depth
measurement directly.
Figure 6. RIM camera dependency.
Because T pi is a latent variable, we also need tomarginalize
it. However, T pi is not a binary variable asits counterpart — the
silhouette Spi for a camcorder, butits range expands all possible
locations along the viewingdirection. Namely, T pi ∈ [0, dmax],
with 0 being the RIMcamera optical center, and dmax the largest
detectable dis-tance of the RIM camera.
p(Opi |GX ,Mi, τ)
=∫ dmax
0
p(Opi |Tp
i ,Dpi , τ)p(T
pi |GX , τ)dT
pi
(6)
• p(Opi |Tp
i ,Dpi , τ) is the depth measurement term. It
depicts how precise the RIM camera depth measure is.We use a
normal distribution N (T pi , σ) to model it,where σ is trained
from depth calibration process orobtained from the camera
manual.
• p(T pi |GX , τ) is the surface formation term. Assumeevery
voxel is independent along the viewing direction
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
of length dmax, and any place on the viewing ray hasan equal
chance of 1/dmax to be the front most point.Now, if GX = 1, the
front most surface position T
pi
still has a chance of 1/dmax to be at any position infront of X
, namely T pi < dX − �, where � → 0. Butthis is not the case for
the positions behind X , becauseX is already blocking the viewing
ray. Eq. 7 & 8shows the complete scenario, with dX being the
dis-tance from X to the RIM camera. Both distributionsof p(T pi
|[GX = 1], τ) and p(T
pi |[GX = 0], τ) must
sum up to 1.
p(T pi |[GX = 1], τ) (7)
=
1/dmax if T pi < dX − �(1− dX/dmax)/� if dX − � ≤ T pi ≤ dX0
if T pi > dX
p(T pi |[GX = 0], τ) = 1/dmax (8)
To get an intuitive idea of the RIM camera model, imag-ine we
have a single pixel RIM camera, with the depth de-tection standard
deviation σ = 0.3m and maximum detec-tion range of 8m. If the
current sensor readout is 5.0m, ac-cording to our RIM sensor model,
we can plot out the spaceoccupancy probability p(GX |O,D, τ) along
the viewingray as in Fig. 7, given Eq. 3-4 & 6-8. This means
the objectis most likely existing at 5m, the observed depth region.
Re-gions in front of it should be free of any object and visibleup
to the camera. Regions behind 5m remains total uncer-tainty, 0.5,
because we have no idea whether there is matterbehind the surface
or not. The peak falls smoothly on bothdirections, because of the
limited sensor precision. This plotis consistent with the depth
sensor models described in otherliteratures such as [5, 22].
0 (m) 1 2 3 4 5 6 7 8
Figure 7. Space occupancy probabilityp(GX |O,D, τ) at certain
distances given theRIM camera readout of 5.0m. It is a
longitudi-nal cut of what probabilities look like on oneviewing ray
in the grid.
5. Experiment and Result
We acquire two sets of data to test our proposed calibra-tion
and heterogeneous sensor framework. Without losinggenerality, for
the camcorders and RIM cameras, we useCanon HG10 and Swiss Ranger
3100 respectively. CanonHG10 DV camcorders are set to run at 25 fps
with an im-age resolution of 1920 × 1080 pixels. Swiss Ranger
3100are set to run at 5 fps with an image resolution of
176×144pixels. The dataset specifications are listed below. For
SEN-SOR NETWORK 2, in order to prevent the interference be-tween
multiple RIM cameras, their modulation frequencyare manually set at
19MHz, 20MHz and 21MHz respec-tively. Although this setting will
affect the maximum de-tection depth of each camera, the minimal
range 7.1m [14]is still beyond our reconstruction volume range, 6m.
Bothdatasets use a occupancy volume.
Canon HG10 SR 3100 volume sizeSensor Network I 3 1 128× 256×
128Sensor Network II 6 3 128× 128× 128
5.1. Depth Calibration Evaluation
950
900
850
800
750
700
650650 800 850 900 950 1000 1050
Measured Distance (mm)
Ground Truth Distance (mm)
black pixel positionslinear fitting for black pixelswhite pixel
positionslinear fitting for white pixelsground truth line
Figure 8. Distance Linear Fitting result. Sincethe data points
are close to one other, this fig-ure is a zoomed in view, and not
all the fittingdata is shown here. Best viewed in color.
We show the depth calibration procedure with SensorNetwork I, 3
camcorder and 1 RIM camera setup. 24checkerboard poses seen by all
4 cameras are used for thegeometric calibration. After using
Bouguet’s MATLAB cal-ibration toolbox [3] to recover the intrinsic
and extrinsic pa-rameters for each camera, we perform a bundle
adjustmentto globally optimize the camera parameters and
checker-board poses. The resulting average pixel re-projection
er-
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
ror is within 0.69 pixels. Since we now know the checker-board
absolute poses, we can compute certain points on thecheckerboard to
the SR 3100 as the ground truth, and com-pare the distance values
with the SR 3100 sensor measure-ment. Specifically, for every black
square and white squareof the checkerboard pattern, we find its
center point posi-tion’s measured distance against the ground
truth. As shownin Fig. 8, the measurements do have a linear
deviation fromthe ground truth. If we analyze the black or white
patterns ofthe checkerboard separately, the distance measure is
also af-fected by the received infrared intensity: the darker the
pixelintensity is, the less accurate the measurement is. So we
fitlines to white and black pixel measurement separately. Thuswe
get two mapping functions from the measured distanceto the ground
truth as below. The computed standard de-viations show the
uncertainty of the measurement, whichare also used as the standard
deviation in N (T pi , σ) of thedepth measurement term of Eq.
6.
dcorrect = a · dmeasure + b, σablack = 0.8823, bblack = 55.27,
σblack = 9.131awhite = 0.9666, bwhite = 22.70, σwhite = 6.168
These are our depth calibration functions. They are usedto
correct the depth measurements in the 3D reconstruc-tion later on.
Given a certain pixel intensity, we obtain thespecific depth
correction line parameters apixel and bpixelby linear interpolation
between (ablack, awhite) and (bblack,bwhite) respectively. It is a
very simple solution, but its ef-fectiveness is shown with
reconstruction results in the nextsection. However, with our
geometrically calibrated cam-corders, more delicate analysis can be
performed to explic-itly model the distance measurement
relationship to inten-sity changes etc., similar to [19].
5.2. Sensor Network I result
We have two static reconstructions with this 4 camerasetup: an
office chair with two boxes and a sitting person.The output of the
alrogithm is a probabilistic volume, for vi-sualization purpose,
the volume surfaces are extracted at anarbitrary iso-probability of
87%, and the results are shownin Fig. 9 and Fig. 10. The
reconstructions from our pro-posed framework preserve detailed
concavity and signifi-cantly improve the quality of the result from
the 3 cam-corder only (the probabilistic visual hull). More
delicatesurface extraction schemes can be applied to get better
ob-ject shapes, but this is beyond the scope of this paper.
In order to evaluate the depth calibration, we computeanother
two volumes with only the SR3100 camera turnedon, one with the
depth correction, and the other withoutit. Then together with the
volume of 3-Camcorder (theprobabilistic visual hull), we extract
three horizontal slices
3-camcorder 4-sensorvolume volume
1 2 3 4
1
23
4
Figure 9. An office chair with two boxes. Top:the four camera
views Bottom: 3-camcorderprobabilistic visual hull and 4-camera
fusionresult with our proposed algorithm. The cal-ibrated camera
configuration is also shownhere, with #2 the SR3100, and 1, 3 and 4
theCanon HG10.
3-camcorder 4-sensorvolume volume
Figure 10. A sitting person. The same config-uration with Fig.
9.
at the same height level (the level of the head), as shownin
Fig. 11, the intensity denotes the occupancy probabilitywith 0
being black and 1 being white. In theory, the ac-tual object
surface should be tangent to the visual hull ofthe object [16].
After we overlay the visual hull slice to thetwo SR3100 slices, we
see that of the two SR3100 volumes,the one without depth
calibration is not tangent with the 3-camcorder visual hull
surface, which figuratively will carveaway voxels that are actually
on the reconstruction object.However, the volume after depth
calibration is tangent tothe visual hull, demonstrate the necessity
and effectivenessof the our depth calibration procedure. The
roughly hori-zontal white lines in Fig. 11 (b) and (c) are the wall
positionat the back of the person. The thickness reflects the
sensormeasurement uncertainty, similar to the peak in Fig. 7.
5.3. Sensor Network II result
For this 9 camera network, we also have two reconstruc-tions: a
person with a rubber ball, and a crowd of 5 people.The number of
cameras in use is not designed on purpose,instead is based on the
number of sensors available. Ad-
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
mittedly though, more detailed information can be obtainedwith
more sensors, and it really helps in challanging casessuch as very
cluttered scenes. The results are shown in Fig.9 and Fig. 10
respectively. The camera calibration proce-dures are the same as
the previous dataset. And the recov-ered camera poses are shown in
Fig. 12, with red cones de-noting three SR3100. The reconstructed
ball in Fig. 12 has adiameter of 60cm, which is pretty close to the
actual value is57.06cm given the low volume resolution. This again
showsthe power of our depth calibration. A more challenging
ex-ample is Fig. 13, where 5 people are highly clustered in
thespace. Without the depth information to recover the concav-ities
the visual hull would fail the reconstruction task. Onething to
note is that the missing forearms are sub-voxel size.They can be
recovered if we increase volume resolution atthose places.
a b c
d e
Figure 11. Horizontal slices at the head levelof the sitting
person data. Cameras are look-ing downwards from the top. (a)
3-camcordervolume. (b) SR3100 volume without depthcorrection. (c)
SR3100 volume without depthcorrection. (d) overlays (a) to (b),
there isa big gap between the two. (e) overlays (a)to (c), they are
tangent. The red line on thefirst row shows the depth measure
differencebefore and after depth calibration. From (d)and (e), it
is shown that our depth correc-tion gives more accurate front most
surface,which should be tangent to the visual cone.
6. Discussion
In this paper, we propose a new heterogeneous sensornetwork of
camcorders and RIM cameras in multi-view 3Dobject reconstruction.
To achieve more accurate distancemeasurements, we carry out a new
RIM camera depth cal-ibration method as a simple extension of the
conventionalcamera geometric calibration process. We then propose
anovel probabilistic sensor fusion framework to robustly re-late
camcorder silhouette cues and RIM camera depth im-
Figure 12. Top: the camera settings. Bottom:the reconstruction
of a person with a rubberball. Best viewed in color.
Figure 13. The reconstruction of the denselypopulated scene from
all 9 sensors with con-cavity and details. Visual hull fails in
thiscase, resulting an indistinguishable blob.
ages together, and improve the reconstruction quality
signif-icantly comparing with the result using either type of
sensoralone. RIM cameras are thus shown for the first time to be
avery promising new type of sensor for accurate multi-view3D
reconstruction, besides its proposed usage in object de-tection,
tracking etc. For camcorders, similar to [11], noexplicite
silhouette extraction is needed. More importantly,our sensor fusion
framework is general enough and not lim-ited to a silhouette cues
or depth images, but also to dis-parity maps of stereo camera pairs
or 3D point clouds ofLIDAR sensors etc., as long as the proper
sensor model isprovided. Also, using our camcorder-RIM camera
platform,similar to our depth calibration process, with the
guidancefrom the geometrically calibrated camcorders, more
deli-cate experiments can be carried out to analyze RIM cam-era’s
impulse-response properties, such as depth measurevariation with
respect to infrared light incident angle or ma-terial reflectance
of the object described in [15]. Finally,consider computation time
to our volume framework, mostof the computation can be parallelized
on GPU. Also given
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA
-
the high frame rate of both the camcorders and RIM cam-eras,
dynamic scenes can be recovered in real-time.
Acknowledgments: We would like to thank Rolf Adelsberger,Prof.
Markus Gross, Tobias Kohoutek and Prof. Hilmar Ingen-sand for
resource support and technical discussion. This work waspartially
supported by David and Lucille Packard Foundation Fel-lowship, and
NSF Career award IIS-0237533.
References
[1] C. L. Bajaj, F. Bernardini, and G. Xu.
Automaticreconstruction of surfaces and scalar fields from 3Dscans.
Computer Graphics, 1995.
[2] B. Baumgart. Geometric modeling for computer vi-sion. PhD
thesis, CS Dept, Stanford U., 1974.
[3] J.-Y. Bouguet. Camera calibration toolbox for
matlab.http://www.vision.caltech.edu/bouguetj/calibdoc/.
[4] G. Casciola, D. Lazzaro, L. B. Montefusco, andS. Morigi.
Fast surface reconstruction and hole fillingusing positive definite
radial basis functions. Numeri-cal Algorithms, 2005.
[5] C. Coué. Modèle bayésien pour l’analyse
multimodaled’environnements dynamiques et encombrés : Appli-cation
à l’assistance à la conduite en milieu urbain.Dissertation to
doctor of Sciences Institut NationalPolytechnique De Grenoble,
2003.
[6] B. Curless and M. Levoy. A volumetric method forbuilding
complex models from range images. Com-puter Graphics, 1996.
[7] J. Davis, S.R. Marshner, M. Garr, and M. Levoy. Fill-ing
holes in complex surfaces using volumetric diffu-sion. 3DPVT,
2001.
[8] A. Elfes. Occupancy grids: a probabilistic frameworkfor
robot perception and navigation. Dissertation todoctor of Sciences
CMU, 1989.
[9] A. Elgammal, D. Harwood, and L. Davis. Non-parametric model
for background subtraction. ECCV,2000.
[10] J.-S. Franco and E. Boyer. Exact polyhedral visualhulls.
BMVC, 2003.
[11] J.-S. Franco and E. Boyer. Fusion of multi-view sil-houette
cues using a space occupancy grid. ICCV,2005.
[12] W. Grimson and C. Stauffer. Adaptive backgroundmixture
models for real-time tracking. CVPR, 1999.
[13] A. Hilton, A. Stoddart, J. Illingworth, and T. Winder-att.
Implicit surface based geometric fusion. CVIU,1998.
[14] MESA Imaging. Swiss ranger 3000 help document,1.0.8.x,
miniature 3d time of flight camera. [email protected].
[15] T. Kahlmann. Range imaging metrology: Investi-gation,
calibration and development. Dissertation todoctor of Sciences ETH
Zurich, 2007.
[16] A. Laurentini. The visual hull concept for silhouette-based
image understanding. PAMI, 1994.
[17] S. Lazebnik, Y. Furukawa, and J. Ponce. Projectivevisual
hulls. IJCV, 2007.
[18] M. Li, H. Schirmacher, M. Magnor, and H.-P.
Seidel.Combining stereo and visual hull information for on-line
reconstruction and rendering of dynamic scenes.Computer Graphics,
1996.
[19] M. Lindner and A. Kolb. Calibration of the
intensity-related distance error of the pmd tof-camera.
SPIE,2007.
[20] R. Sablatnig M. Kampel and S. Tosovic. Fusion ofsurface and
volume data. OAGM, 2002.
[21] D. Margaritis and S. Thrun. Learning to locate an ob-ject
in 3d space froma sequence of camera images.ICML, 1998.
[22] K. Pathak, A. Birk, J. Poppinga, and S. Schwertfeger.3d
forward sensor modeling and application to occu-pancy grid based
sensor fusion. IROS, 2007.
[23] R. Sablatnig, S. Tosovic, and M. Kampel. Combin-ing shape
from silhouette and shape from structuredlight for volume
estimation of archaeological vessels.ICPR, 2002.
[24] D. Snow, P. Viola, and R. Zabih. Exact voxel occu-pancy
with graph cuts. CVPR, 2000.
[25] R. Szeliski. apid octree construction from image
se-quences. Computer Vision, Graphics and Image Pro-cessing,
1993.
[26] R. Whitaker. A level-set approach to 3d reconstructionfrom
range data. IJCV, 2004.
[27] Y. Yemez and C.J. Wetherilta. A volumetric fusiontechnique
for surface reconstruction from silhouettesand range data. CVIU,
2007.
[28] Z. Zhang. A flexible new technique for camera cali-bration.
PAMI, 2000.
Proceedings of 3DPVT'08 - the Fourth International Symposium on
3D Data Processing, Visualization and Transmission
June 18 - 20, 2008, Georgia Institute of Technology, Atlanta,
GA, USA