HAL Id: hal-01638241 https://hal-upec-upem.archives-ouvertes.fr/hal-01638241 Submitted on 26 Dec 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Real-time scene reconstruction and triangle mesh generation using multiple RGB-D cameras Siim Meerits, Vincent Nozick, Hideo Saito To cite this version: Siim Meerits, Vincent Nozick, Hideo Saito. Real-time scene reconstruction and triangle mesh genera- tion using multiple RGB-D cameras. Journal of Real-Time Image Processing, Springer Verlag, 2019, 10.1007/s11554-017-0736-x. hal-01638241
16
Embed
Real-time scene reconstruction and triangle mesh ... · [47] propose advancing front methods to generate triangles on the basis of MLS point clouds. These methods can achieve good
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
HAL Id: hal-01638241https://hal-upec-upem.archives-ouvertes.fr/hal-01638241
Submitted on 26 Dec 2017
HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.
Real-time scene reconstruction and triangle meshgeneration using multiple RGB-D cameras
Siim Meerits, Vincent Nozick, Hideo Saito
To cite this version:Siim Meerits, Vincent Nozick, Hideo Saito. Real-time scene reconstruction and triangle mesh genera-tion using multiple RGB-D cameras. Journal of Real-Time Image Processing, Springer Verlag, 2019,�10.1007/s11554-017-0736-x�. �hal-01638241�
where pðx; yÞ is the range image pixel’s 3D point location
in a global coordinate system, gx is the horizontal gradient,
and gy is the vertical gradient.
In practice, some gradient calculations may include
points with invalid data from an RGB-D sensor (i.e. a hole
in the depth map). In that case, the gradients gx and gy are
marked as invalid. Another issue is that the points used in
gradient calculation might be part of different surfaces. We
detect such situations by checking whether the distance
between points is more than a constant value d. In these
cases, both the gx and gy gradients are marked as invalid for
the particular coordinate.
Next, the unnormalized normal in global coordinates is
calculated as a cross-product,
uðx; yÞ ¼X
i;j
gxði; jÞ �X
i;j
gyði; jÞ ; ð3Þ
where the sums are taken over a local neighborhood of
points around ðx; yÞ. Any gradients marked invalid should
be excluded from the sums. Finally, we normalize uðx; yÞso that
nðx; yÞ ¼ uðx; yÞjjuðx; yÞjj ; ð4Þ
which is the surface normal result.
The neighborhood area in the sum of Eq. 3 is typically
very small, e.g., 3� 3. This area is insufficient for com-
puting high-quality normals. However, in a later surface
reconstruction phase of our pipeline, we use weighted
averaging of the normals. This is done over a much larger
support area, e.g., 9� 9, which results in good normals.
Figure 3 shows a comparison of principal component
analysis (PCA)-based normal estimation [26] and our
selected gradient method. A similar estimation radius was
Fig. 2 Example of a system setup with two RGB-D cameras. The
shaded pyramids with red edges show the camera’s field of view and
range
Fig. 3 Comparison of normal calculation methods on real data. Red
and blue lines denote normals of points from different RGB-D
cameras. Note that the gradient method gives similar results compared
to PCA, but with much faster computation
J Real-Time Image Proc
123
selected for both methods. While there are differences in
the initial normals, the final result after MLS smoothing is
practically identical. It thus makes sense to use the faster
gradient-based normal estimation.
4 Moving least squares surface reconstruction
MLS methods are designed to smooth point clouds to
reduce noise that may have been introduced by RGB-D
sensors. They require a point cloud with normals as input
and produce a new point cloud with refined points and
normals.
We choose to follow a group of MLS methods that
approximates surfaces around a point x in space as an
implicit function f : R3 ! R representing the algebraic
distance from a 3D point to the surface. This method is an
iterative process consisting of two main components: an
estimation of the implicit function f (and its gradient if
necessary) and an optimization method to project points to
the implicit surface defined by f . We first use a well-
established method to estimate an implicit surface from the
point cloud and then project the same points to this surface
with our own projection approach.
4.1 Surface estimation
Following Alexa et al. [3], we compute the average point
location a and normal n at point x as
aðxÞ ¼P
i w�jjx� xijj
�xiP
i w�jjx� xijj
� ð5Þ
and
nðxÞ ¼P
i w�jjx� xijj
�niP
i w�jjx� xijj
� ; ð6Þ
where wðrÞ is a spatial weighting function and ni are the
normals calculated in Sect. 3. As in Guennebaud and Gross
[22], we use a fast Gaussian function approximation for the
weight function, defined as
wðrÞ ¼ 1� r
h
� �2� �4
; ð7Þ
where h is a constant smoothing factor. Finally, the implicit
surface function is obtained as
f ðxÞ ¼ nðxÞTðx� aðxÞÞ: ð8Þ
The sums in Eqs. 5–6 are taken over all points in
vicinity of x. Due to the cutoff range of the weighting
function wðrÞ, considering points in the radius of h around
x is sufficient. Traditionally, points xi and normals ni are
stored in spatial data structures such as k-d tree or octree.
While fast, the spatial lookups still constitute the biggest
impact on MLS performance. Kuster et al. [33] propose
storing points and normals data as two-dimensional arrays
similar to range images. In that case, a lookup operation
would consist of projecting search points to every camera
and retrieving an s� s block of points around projected
coordinates, where s is known as window size. This allows
for very fast lookups and is cache friendly.
To achieve temporal stability, we follow Kuster et al.
[33] who propose extending x to a 4-dimensional vector
that contains not only spatial coordinates but also a time
value. Every frame received from a camera has a times-
tamp that is assigned to the fourth coordinate of all points
in the frame. Hence, it is now possible to measure the
spatial and temporal distance of any two points. This
allows us to use multiple consecutive depth frames in a
single MLS calculation. The weighting function wðrÞguarantees that points from newer frames have more
impact on reconstruction, while older frames have less. In
our system, the number of frames used is a fixed parameter
fnum and is selected experimentally. Also note that the time
value should be scaled to achieve desired temporal
smoothing.
4.2 Projecting to surface
Alexa and Adamson [1] present multiple ways of project-
ing points to the implicit surface. One core concept of this
work is that the implicit function f ðxÞ can be understood as
a distance from an approximate surface tangent frame
defined by point aðxÞ and normal nðxÞ. This means that we
can project a point x to this tangent frame along the normal
vector n using
x0 ¼ x� fn: ð9Þ
We call this the simple projection. Since the tangent frame
is only approximate, the procedure needs to be repeated.
On each iteration, the surface tangent frame estimation
becomes more accurate as a consequence of the spatial
weighting function wðrÞ. Another option is to propagate
points along the f ðxÞ gradient. This is called orthogonal
projection.
Instead of following normal vectors or an f ðxÞ gradientto the surface, we constrain the iterative optimization to a
line between the initial point location and the camera’s
viewpoint. Given a point x to be projected to a surface and
camera viewpoint v, the projection will follow vector d
defined as
d ¼ v� x
jjv� xjj : ð10Þ
Our novel viewpoint projection operator projects a point
to the tangent frame in direction d instead of n as in Eq. 9.
J Real-Time Image Proc
123
With the use of some trigonometry, the projection operator
becomes
x0 ¼ x� fd
nTd: ð11Þ
This operator works similarly to simple projection when
d is close to n in value. However, this optimization cannot
easily converge when n and d are close to a right angle.
Conceptually, the closest surface is in a direction where we
do not allow the point to move. Dividing by nTd may
propel the point extremely far, well beyond the local area
captured by the implicit function f . We thus limit each
projection step to distance h (also used in the spatial
weighting function 7). This results in a search through
space to find the closest acceptable surface.
If a point does not converge to a surface after a fixed
number of iterations imax, we consider the projection to
have failed and the point is discarded. This is a desired
behavior and indicates that a particular point is not
required. The pseudocode for the projection method is
listed in Algorithm 2.
Figure 4 shows the visualization of different projection
methods. Figure 5 shows them in action (except for
orthogonal projection, which is computationally more
expensive). Our method results in a more regular grid of
points on a surface than the normals-based simple
projection. This process is crucial in making the final mesh
temporally stable. Moreover, the stability of the distances
between points is a key condition to compute the mesh
connectivity in the next section.
5 Mesh generation
The purpose of mesh generation is to take refined points
produced by MLS and turn them into a single consistent
mesh of triangles. The approach is to first generate initial
triangle meshes for every RGB-D camera separately and
then join those meshes to get a final result.
Our proposed method is inspired by a mesh zippering
method pioneered by Turk and Levoy [48]. This method
was further developed by Marras et al. [38] to enhance
output quality and remove some edge-case meshing errors.
Both zippering methods accept initial triangle meshes as
input and produce a single consistent mesh as output.
Conceptually, they work in three phases:
1. Erosion remove triangles from meshes so that over-
lapped mesh areas are minimized.
2. Clipping in areas where two meshes meet and slightly
overlap, clip triangles of one mesh against triangles of
another mesh so that overlapping is completely
eliminated.
3. Cleaning retriangulate areas where different meshes
connect to increase mesh quality.
Prior zippering work did not consider the parallelization of
these processes. As such, we need to modify the approach
to be suitable for GPU execution.
The mesh erosion process of zippering utilizes a global
list of triangles. The main operation in this phase is
deleting triangles. If parallelized, the triangle list would
need to be locked during deletions to avoid data corruption.
Fig. 4 Visualization of different surface projection methods. a
Orthogonal projection, b simple projection, c viewpoint projection
(ours)
Fig. 5 Comparison of simple projection (left) and our viewpoint
projection (right) for the same depth map patch. The latter shows
excellent temporal stability
J Real-Time Image Proc
123
For this reason, we need to introduce a new data structure
where triangles are not deleted, only updated to reflect a
new state. This allows for completely lockless processing
on GPUs. A similar issue arises with mesh clipping, as it
would require locking access to multiple triangles to carry
out clipping. To counter this, we replace mesh clipping
with a process we call mesh merging. It updates only one
triangle at a time and thus does not requiring locking. The
last step of our mesh generation process is to turn our
custom data structures back into a traditional triangle list
for rendering or other processing. We call this final mesh
generation. This step also assumes the use of mesh cleaning
as seen in previous works. Our mesh generation consists of
following steps:
1. Initial mesh generation creates a separate triangle
mesh for every RGB-D camera depth map.
2. Erosion detects areas where two or more meshes
overlap (but do not delete triangles like in zippering).
3. Merging locates points where meshes are joined.
4. Final mesh generation extracts a single merged mesh.
The next sections discuss each of these points in detail.
5.1 Initial mesh generation
The first step of our meshing process is to generate a tri-
angle mesh for each depth map separately. In practice, we
join neighboring pixels in the depth map to form the tri-
angle mesh. The idea was proposed in Hilton et al. [24] and
has widespread uses. We follow Holz and Behnke [25] to
generate triangles adaptively.
Triangles can be formed inside a cell which is made out
of four neighboring depth map points (henceforth called
vertices). A cell at depth map coordinates ðx; yÞ consists ofvertices v00 at ðx; yÞ, v10 at ðxþ 1; yÞ, v01 at ðx; yþ 1Þ andv11 at ðxþ 1; yþ 1Þ. Edges are formed between vertices as
follows: eu between v00 and v10, er between v10 and v11, ebbetween v01 and v11, el between v00 and v01, ez between v10and v01, and ex between v00 and v11. An edge is valid only if
both its vertices are valid and their distance is below a
constant value d. The maximum edge length restriction acts
as a simple mesh segmentation method, e.g., to ensure that
two objects at different depths are not connected by a
mesh.
Connected loops made out of edges form triangle faces.
A cell can have six different triangle formulations as
illustrated in Fig. 6. For example, the type 1 form is made
out of edges eu; ez; el. However, ambiguity can arise when
all possible cell edges are valid. In this situation, we select
either type 3 or type 6 depending on whether edge ex or ezis shorter. Since the triangles for a single cell can be stored
in just one byte, this representation is highly compact.
5.2 Erosion
The initially generated meshes often cover the same
surface area twice or more due to the overlap of RGB-D
camera views. Mesh erosion detects redundant triangles
in those areas; more specifically, erosion labels all initial
meshes to visible and shadow mesh parts. This labeling
is based on the principle that overlapping areas should
only contain one mesh that is marked visible. The
remaining meshes are categorized as shadow meshes. In
the previous mesh zippering methods, redundant trian-
gles were simply deleted or clipped. In our method, we
keep those triangles in shadow meshes for later use in
the mesh merging step.
To segment a mesh into visible and shaded parts, we
start from the basic building block of a mesh: a vertex. All
vertices are categorized as visible or as a shadow by pro-
jecting them onto other meshes. Next, if an initial mesh
edge consists of at least one shadow vertex, the edge is
considered a shadow edge. Finally, if a triangle face has a
shadow edge, it is a shadow triangle.
Note that if we were to project each vertex onto every
other mesh, we would end up only with shadow meshes
and no visible meshes at overlap areas. Therefore, one
mesh should remain visible. For this purpose, we project
vertices only to meshes with lower indices. For example, a
vertex in mesh i will only be projected to mesh j if i[ j.
Fig. 6 Forming triangles adaptively between vertices. Each number
indicates the triangle formation type. Type 0, which represents an
empty cell, is not shown
J Real-Time Image Proc
123
Algorithm 3 sums up the erosion algorithm. The Pro-
jectVertexToMeshSurface ðv; jÞ function projects a vertex vto mesh j surface. This is possible because initial meshes
are stored as 2D arrays in camera image coordinates. As
such, the function simply projects the vertex to a corre-
sponding camera image plane. IsPointInsideTriangleðp; jÞchecks whether coordinate p falls inside any triangle of
mesh j. Figure 7 shows an example of labeling a mesh into
visible and shaded parts. There are visible gaps between
the two meshes, but this issue will be rectified in the next
meshing stages.
5.3 Mesh merging
The task of mesh merging is to find transitions from one
mesh to another. For simplicity, consider that meshes A and
B were to merge. If mesh A has a shadow mesh that extends
over mesh B, we have transition from A to B. Such a sit-
uation can be seen on the left side of Fig. 8, where A would
be the red mesh and B the blue mesh. In terms of notation,
visible vertices are marked with V and shadow vertices are
marked with S, e.g., vAS denotes the A mesh’s shadow
vertex.
We begin merging by going through all shadow vertices
vAS . If we find a vertex that is joined by an edge to a visible
vertex vAV , then this edge covers a transition area between
two meshes. Such edges are depicted as dashed lines on the
left side of Fig. 8.
Having located the correct shadow vertices vAS , our next
task is to merge them with the vBV vertices so that the two
meshes are connected. The end result of this is illustrated
on the right side of Fig. 8. A more primitive approach of
locating the nearest vBV to vAS would not work well, since the
closest vertices vBV are not necessarily on the mesh
boundary. Instead, we trace an edge from vAV to vAS until we
hit the first B mesh triangle. The closest triangle vertex, vBVto vAV , will be selected as a match. Since meshes are stored
as two-dimensional arrays, we can use a simple drawing
algorithm, such as a digital differential analyzer, to trace
from vAV to vAS .
After the mesh merging procedure, we found edges
connecting the two meshes. However, triangles have yet to
be generated. This is addressed in the next and final mesh
generation section.
5.4 Final mesh generation
The last part of our meshing method collects all data from
previous stages and outputs a single properly connected
mesh. Handling triangles made out of visible vertices is
simple, since they can simply be copied to output. How-
ever, transitions from one mesh to another require an extra
processing step.
For simplicity’s sake, we will once again examine two
meshes, A and B, using notation introduced in Sect. 5.3. All
the triangles in transition areas consist of one or two sha-
dow vertices vAS , with the rest being visible vertices vAV .
Triangles with just one shadow vertex can be copied to the
final mesh without modifications. Triangles with two sha-
dow vertices, however, are a special case. The problem lies
in connecting the two consecutive vAS vertices with an edge.
This situation is illustrated on the left side of Fig. 9.
Specifically, the red mesh A’s top shadow edge does not
coincide with mesh B’s edges. Therefore, we create a
polygon that traces through B’s mesh vertices vBV . To reit-
erate, the vertices of the polygon will be starting point vAV ,
the first shadow vertex vAS , mesh B’s vertices vBV , the second
shadow vertex vAS and the starting point vAV . This polygon is
broken up into triangles, as illustrated on the right side of
Fig. 9. Note that the polygon is not necessarily convex, but
in practice, nonconvex polygons tend to be rare and may be
ignored for performance gains if the application permits
small meshing errors.Fig. 7 Illustration of mesh erosion. Initially, two meshes (one red and
one blue) overlap. After erosion, the red mesh in the overlap area
becomes a shadow mesh, denoted by dashed lines
Fig. 8 Illustration of mesh merging. Shadow mesh points are
projected to other mesh (left) and then traced to the closest triangles
for merging (right)
J Real-Time Image Proc
123
This concludes the stages of mesh generation. The
results can be used in rendering or for any other
application.
6 Implementation and results
Our system has been implemented on a platform with
GPUs running OpenGL 4.5. All experiments were carried
out on a consumer PC with an Intel Core i7-5930 K 3.5
GHz processor, 64 GB of RAM and an Nvidia GeForce
GTX 780 graphics card. We used OpenGL compute sha-
ders for executing code. As all of our point cloud and mesh
data are organized as two-dimensional arrays, we utilized
OpenGL textures for storage.
Table 1 gives an overview of the time spent on different
system processes. The measurements were taken with
OpenGL query timers to get precise GPU time info. The
experiment used two RGB-D cameras, both producing 640
� 480 resolution depth maps, resulting in up to 614k points
per frame. We ran the test with parameters given in
Table 2.
An overwhelming majority of processing time is spent on
surface reconstruction. This is due to fetching a large
number of points and normals from GPU memory. Never-
theless, as the data are retrieved in s� s square blocks from
textures, the GPU cache is well utilized. We also
implemented surface reconstruction on the CPU for com-
parison reasons. The execution has been parallelized across
6 processor cores using OpenMP. The average runtime was
1.6 s per frame on the test dataset. This means that using a
GPU gives us roughly 10� the performance benefit over a