This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Computers & Graphics 66 (2017) 103–112
Contents lists available at ScienceDirect
Computers & Graphics
journal homepage: www.elsevier.com/locate/cag
Special Issue on SMI 2017
A multi-view recurrent neural network for 3D mesh segmentation
Truc Le, Giang Bui, Ye Duan
∗
Department of Computer Science, University of Missouri, MO 65211, Columbia
a r t i c l e i n f o
Article history:
Received 30 March 2017
Revised 17 May 2017
Accepted 25 May 2017
Available online 13 June 2017
Keywords:
Mesh segmentation
Multi-view
3D deep learning
CNN
RNN
LSTM
a b s t r a c t
This paper introduces a multi-view recurrent neural network (MV-RNN) approach for 3D mesh segmen-
tation. Our architecture combines the convolutional neural networks (CNN) and a two-layer long short
term memory (LSTM) to yield coherent segmentation of 3D shapes. The imaged-based CNN are useful
for effectively generating the edge probability feature map while the LSTM correlates these edge maps
across different views and output a well-defined per-view edge image. Evaluations on the Princeton Seg-
mentation Benchmark dataset show that our framework significantly outperforms other state-of-the-art
Fig. 2. Overview of our MV-RNN approach. Given an input 3D mesh model, we render it with a sequence of ordered viewpoints. Each of view is passed through an identical
(shared weights) CNN to obtain a boundary probability map, which is correlated by a two-layer LSTM followed by a fully connected layer. The consistent edge images from
multiple views are unprojected back to 3D followed by a region growing and CRF for boundary smoothing.
i
v
i
a
m
e
u
r
i
s
i
a
[
m
n
f
e
3
i
r
c
r
w
p
t
b
(
V
s
w
[
s
R
R
t
t
a
c
p
n
Xie et al. [12] used multi-view depth images via extreme learn-
ng machine to generate per-view segmentation and combine them
ia Graph-Cut. This method works pretty fast due to the easy train-
ng of the extreme learning machine but it does not give high
ccuracy. Later, Kalogerakis et al. [47] proposed a more complete
ulti-view framework. They first render the 3D model with differ-
nt views, each of which is processed through a shared CNN before
nprojected to 3D. The label consistency is solved by a conditional
andom field (CRF), which is part of a network and is optimized
n an end-to-end manner. Although this approach uses the CRF to
olve the consistency after unprojection to 3D, the semantic label
mages from multiple views are obtained in a max-pooling manner
nd they are still not correlated.
Recently, Su et al. proposed the PointNet [48] and SyncSpecCNN
49] which consume directly non-regular 3D data (point cloud and
esh, respectively). These networks demonstrate the flexibility of
eural networks in many visual problems. However, in term of per-
ormance, these structures still fall behind MV-CNN approaches (if
quipped large enough number of views) [48] .
. Background on recurrent neural network
In contrast to normal feed-forward neural network which
s a one-shot function, recurrent neural network (RNN) runs
epeatedly through time which simulates human brain processing
apability. An RNN is a composition of identical feed-forward neu-
al networks, one for each moment, or step in time, which we
ill refer to as RNN cells. These cells operate on their own out-
ut, allowing them to be composed. They can also operate on ex-
ernal input and produce external output. Note that this is a much
roader definition of an RNN depending on the choice of RNN cells
e.g. Vanilla RNN, LSTM, etc.). Here is the algebraic description of a
anilla RNN cell.
t = φ(
W x t + Us t−1 + b
)(1)
here φ is the activation function (e.g. sigmoid, tanh, ReLU
31,50] ); Assuming d and h are the state input and output sizes, re-
pectively, s t ∈ R
h is the current state (and current output); s t−1 ∈
h is the prior state; x t ∈ R
d is the current input; W ∈ R
h ×d , U ∈
h ×h and b ∈ R
h are weights and biases.
Although being simple and quite powerful, Vanilla RNN has cer-
ain disadvantages. First, it is very difficult to exploit post informa-
ion if information constantly morphs, which leads to the degener-
tion problem [34] . Second, gradient vanishing and exploding are
ommon in training Vanilla RNN because we train it by the back-
ropagation over time algorithm. If the gradients explode, we can-
ot train our model. If they vanish, it is difficult for us to learn
106 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 3. Multiple views ordered in a helix-like sequence.
Fig. 4. LSTM learning process (only four views are shown due to space limit). First
row : Input shaded images to a CNN. Second row : Outputs from CNN. Third to
Tenth rows : Edges returned from LSTM during training. Last row : Ground truth
edges.
long-term dependencies, since back-propagation will be too sensi-
tive to recent distractions.
To tackle the drawbacks of Vanilla RNN, the long short-term
memory (LSTM) unit [51] was introduced to ensure the integrity of
information thanks to its written memories. Furthermore, LSTMs
use gates as a mechanism to selectively control and coordinate
writing (i.e. the cell memory is written, read and forgot selec-
tively).
Unlike Vanilla RNN, an LSTM network is well-suited to learn
from experience to classify, process and predict time series when
there are time lags of unknown size and bound between important
events. Consequently, LSTM achieved the best known results in
natural language text compression, unsegmented connected hand-
writing recognition. Recently, researchers have been integrating
LSTMs to computer vision tasks such as image segmentation [52] ,
activity recognition, image captioning, video description [46] , 3D
object reconstruction [14] .
4. Multi-view recurrent neural network (MV-RNN)
Given an input 3D shape, our goal is to segment it into parts
based on the prior knowledge learned from a pre-segmented train-
ing dataset. We design a MV-RNN network to this end. Our net-
work architecture is visualized in Fig. 2 . It takes as input a set of
images from multiple views which are equally distributed over the
3D model; segments these images by generating per-view bound-
ary probability maps; correlates them by a two-layer LSTM fol-
lowed by a fully connected layer and returns the consistent edges
which are back projected to the 3D surface and finally integrated
by a CRF. In the following sections, we elaborate the input to our
network, its layers and the training procedure.
4.1. Input
The input to our whole algorithm is a 3D shape represented
as a polygonal mesh. As a pre-processing step, we normalize and
scale it to fit into the unit sphere. Then our algorithm renders the
object in K different views (we set K = 60 based on our experi-
ments). We first equally partition the unit sphere into K regions
T. Le et al. / Computers & Graphics 66 (2017) 103–112 107
u
t
l
o
w
p
t
w
[
W
[
s
t
w
w
r
p
c
1
p
o
4
t
C
W
H
p
b
a
[
f
i
t
(
i
p
b
n
l
p
c
b
4
m
n
r
t
t
p
r
d
o
u
d
i
o
c
h
w
p
s
a
t
o
m
e
t
a
t
c
a
4
w
1
a
s
o
u
o
t
L
i
H
m
g
e
e
y
t
4
a
i
i
t
m
a
t
t
f
u
t
c
sing [53] . These regions serve as camera locations. More impor-
antly, to make these views learnable for LSTM, we arrange these
ocations in sequence so that adjacent locations are next to each
ther such as in Fig. 3 . To make all views oriented consistently,
e choose the camera up vector pointing to a very far away fixed
oint (e.g. [0, 0, 100]). The camera always looks at the origin since
he model is normalized.
In general, CNN is quite robust to lighting illumination, so
e render shaded, grayscale images using Phong reflection model
54] with light source always behind the camera for every view.
e also experimented with depth images (with HHA encoding
55] ), normal images and/or combined them together but the re-
ult is not better than using the shading images alone. To make
he training faster, we opt to use image resolution of 128 × 128
ithout sacrificing the overall segmentation accuracy of the frame-
ork.
In addition, for each camera setting, we store the 3D vertex cor-
esponding to each pixel. The correspondence is determined by the
roximity of the 3D point unprojected from the 2D pixel and the
losest 3D vertex (the distance between them must be less than
0 −3 , otherwise there is no corresponding 3D vertex with that
ixel). The stored information is used for the back projection later
n.
.2. CNN module
The shaded images produced in the previous step are processed
hrough identical image-based CNN. There are many choices of
NN architecture such as FCN [36] , DeconvNet [37] and HED [40] .
e opt to choose HED because of its edge detection nature. Each
ED module outputs a grayscale image of the same size as the in-
ut shaded image (i.e. 128 × 128), which is the boundary proba-
ility map. Specifically, in our implementation, we employ the HED
rchitecture suggested in [40] , which adopted the VGG-16 network
32] for dense prediction by truncating after the pool5 layer and
using multiple side outputs. Since the HED is trained on RGB color
mages, we need to replicate our shaded grayscale images into
hree channels.
Fig. 4 shows the boundary probability maps in multiple views
only four views are shown here). As we can see the probabil-
ty maps are not well-localized nor consistent. The inconsistency
roblem cannot be solved by optimizing individual view alone, but
y aggregating them together in a more intelligent way. Recurrent
eural networks (RNN) represent a type of neural networks with
oop connections [56] , which allow them to capture long-range de-
endency by gates and memory structures (such as LSTM [51] ). In
onsequence, multiple views can be cast as time series which can
e learned by such LSTM.
.3. LSTM module
As mentioned in Section 1 , the goal of this layer is to correlate
ultiple views and generate consistent boundary maps. An LSTM
etwork is well-suited here which treats view sequence as time se-
ies. First, we unroll the 2D boundary probability maps and ground
ruth boundary maps into vectors of size 128 × 128 = 16384 . A
wo-layer LSTM (with one LSTM stacked over the other) is de-
loyed so that the first LSTM takes the sequence of ordered (un-
olled) boundary probability maps, produces a sequence of hid-
en states for the second LSTM to eventually output the sequence
f coherent boundary maps. We use the same number of hidden
nits (1024) for both peephole LSTMs [57] with the following up-
ates.
t = sigmoid
(W i x t + U i c t−1 + b i
)(2)
b
f t = sigmoid
(W f x t + U f c t−1 + b f
)(3)
t = sigmoid
(W o x t + U o c t−1 + b o
)(4)
t = f t ◦ c t−1 + i t ◦ tanh
(W c x t + b c
)(5)
t = o t ◦ c t (6)
here x t ∈ R
d is the current input; h t ∈ R
h is the current out-
ut; c t (and c t−1 ) ∈ R
h are the current (and prior) memory
tate, W i , W f , W o , W c ∈ R
h ×d , U i , U f , U o , U c ∈ R
h ×h , b i , b f , b o , b c ∈ R
h
re weights and biases and ‘ ◦’ denotes element-wise multiplica-
ion. In our case, d = 128 × 128 = 16384 and h = 1024 . The output
f the second LSTM is passed through a fully connected layer to
ap back to d -dimension edge image.
Fig. 4 illustrates how LSTM can help correct and correlate the
dge probability maps produced from the MV-CNN. For example,
he boundaries between the torso and two legs are quite different
mong four views, which may result in inconsistent edge informa-
ion when unprojecting them to 3D mesh. However, as the LSTM
onsumes the whole view sequence, the edges at convergence are
ll consistent.
.4. Training
We train our network in a two-stage approach. In the first stage,
e train the HED module. We randomly rotate each 3D model in
6 different ways. The network takes as input a pair of two im-
ges, shaded image and ground truth boundary map. We use the
igmoid cross-entropy loss for all five side outputs and the fused
utput. The network is initialized from VGG-16 weights [32] . We
se Adam optimizer [58] with fixed learning rate 10 −7 , batch size
f 16 and train for 10 0,0 0 0 iterations. The first stage training takes
hree days on an NVIDIA Titan X.
After the HED module is trained, it is fixed for training the
STM module in the second stage. The two-layer LSTM takes as
nput a pair of sequences of boundary probability maps from the
ED and ground truth boundary maps. We also use Adam opti-
izer [58] with initial learning rate 0.01 (as this optimization al-
orithm is able to compute adaptive learning rates for each param-
ter), batch size of 1 (due to memory limit) and train for 70 0 0
pochs. Each view sequence is processed bidirectionally, which
ields two sequences per shape. The second stage training takes
hree days on an NVIDIA Titan X.
.5. Back projection to 3D and post-processing
The consistent boundary maps produced from LSTM network
re back projected to 3D surface using the stored pixel-to-vertex
nformation (see Section 4.1 ). It is possible that many pixels (typ-
cally from different views) map to the same vertex, so we take
he maximum response as the final value. For each edge of the
esh model, we assign the boundary probability which is defined
s the average of the boundary probabilities of the two vertices
hat it connects. Finally a binary boundary edge map is created by
hresholding (we set the threshold as 0.5). These boundary edges
unction as the borders of the regions to be segmented. Thus, we
se a simple region growing to find the initial segmentation with
he boundary edges as blockers. A region with big enough area is
onsidered as a segment. The polygons near the boundaries may
e unlabeled due to projection error. Denote h v as the initial label
108 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 5. Representative segmentation results produced by our MV-RNN on PSB dataset.
Fig. 6. Performance plots of different segmentation algorithms with respect to four evaluation metrics. Lower value is better.
T. Le et al. / Computers & Graphics 66 (2017) 103–112 109
Fig. 7. Comparison of segmentation algorithms.
f
r
a
o
E
E
E
w
a
p
c
5
s
w
d
t
o
i
m
p
s
o
t
b
h
s
b
s
t
c
n
t
o
e
d
a
d
i
t
t
n
o
i
w
or polygon v , where h v = 0 if v has no label. We expect that cor-
ect labels will be propagated to them via a CRF. Let V be the set of
ll polygons in a 3D shape, a CRF f with unary and pairwise terms
perating on the surface representation is defined as follows.
( f ) =
∑
v ∈ V E unary ( f v ) +
∑
(u, v ) ∈ V 2 E pairwise ( f u , f v ) (7a)
unary ( f v = l) =
{
0 , ∀ l if h v = 0
0 if h v = l ∞ if otherwise
(7b)
pairwise ( f u = l u , f v = l v ) =
{e −d 2 (u, v ) if l u � = l v
e −( 1 −d(u, v ) ) 2 if l u = l v (7c)
here d(u, v ) is the geodesic distance [59,60] between polygon u
nd polygon v . All distances are normalized to [0, 1].
The unary term tells that we only want to correct unlabeled
olygons while the pairwise terms favor the same label for adja-
ent polygons. We use mean-field approximation [61] to solve (7a) .
. Evaluation
In this section, we present experimental validations and analy-
es of our approach. We test the segmentation algorithm on the
ell-known Princeton Segmentation Benchmark dataset [1] . This
ataset has been intensively used to evaluate 3D shape segmenta-
ion and 3D shape retrieval algorithms. The dataset has 19 different
bject categories with 20 objects for each category which results
n 380 models in total. For each category, we randomly select 16
odels for training and 4 models for testing. Since there are multi-
le human generated segmentations for each model, we manually
elect one segmentation which is the most consistent among the
bject category. The ground truth edge images can be easily ob-
ained by rendering the edges between different segments overlaid
y the 3D shape with the same color as background. To further en-
ance the quality of the ground truth images, we use polygon off-
et in OpenGL . The ground truth edge images are used in training
oth the MV-CNN and the LSTM. Fig. 5 shows some representative
egmentations of our MV-RNN approach on this dataset.
To evaluate our segmentation method, we adopt four metrics
hat are defined by Chen et al. [1] , including Rand Index, Cut Dis-
repancy, Hamming Distance and Consistency Error. Rand Index,
amed after William M. Rand, measures the similarity between
wo segmentations of the same shape. From a mathematical point
f view, Rand Index is related to the accuracy, but is applicable
ven when class labels are not used. In this paper, we use Rand In-
ex Error, which equals to one minus the Rand Index. Cut Discrep-
ncy is a boundary-based method evaluating the distance between
ifferent cuts. It sums the distances from points along the cuts
n the computed segmentation to the closest cuts in the ground
ruth segmentation, and vice-versa. Hamming Distance, named af-
er Richard Hamming, is a region-based method and measures the
umber of substitutions required to change one region into the
ther. Hamming Distance is directional, hence it includes miss-
ng rate (Rm) and false alarm (Rf) distances. Consistency Errors,
hether the global version (GCE) or local version (LCE), are used to
110 T. Le et al. / Computers & Graphics 66 (2017) 103–112
Fig. 8. More comparisons of segmentation algorithms.
T
s
k
t
M
e
a
p
s
a
r
t
a
t
l
i
m
v
i
t
c
W
t
compute the hierarchical differences and similarities between seg-
mentations, which are based on the theory that humans percep-
tual organization imposes a hierarchical tree structure on objects.
Regarding all four metrics, smaller value indicates better result.
Comparison : We compare our method with the following seg-
mentation algorithms:
• MV-CNN: we apply non-maximum suppression [15] on the
boundary probability maps returned from the multi-view CNN
(HED in this case) and unproject them back to 3D (without
LSTM) followed by CRF. This serves as a baseline for multi-view
paradigm.
• [Shu2016] [41] : an unsupervised 3D shape segmentation via
stacked auto-encoders.
• WcSeg [62] ; approximate convexity analysis.
• RandCuts [22] : randomized cuts.
• ShapeDiam [27] : shape diameter function.
• NormCuts [22] : normalized cuts.
• CoreExtra [63] : core extraction.
• RandWalks [23] : random walks.
• FitPrim [18] : fitting primitives.
• KMeans [64] : k -means.
Figs. 7 and 8 provide a side-by-side comparison of segmen-
tations obtained from various algorithms. Although there are
large shape variations, the absolute majority of our segmenta-
tion results are desirable and consistent with our perception.
he baseline MV-CNN indeed yields better segmentations than
ome of the methods based on hand-crafted features such as
-means, fitting primitives, random walks. Due to the inconsis-
ency of the boundary probability maps across multiple views, the
V-CNN is still not as good as the shape diameter function. How-
ver, the added LSTM has a significant contribution to the over-
ll robustness, which vastly improves the nature of multi-view
aradigm.
Numerical comparison : The Rand Index score statistics of our
egmentation on the dataset, as well as those of other methods,
re detailed in Table 1 , from which we can see that our algo-
ithm obtains an average Rand Index of 0.084 that outperforms
he related algorithms. In addition to Rand Index, our MV-RNN
lso shines out of other methods with respect to other evalua-
ion metrics (see Fig. 6 and Table 2 ). Comparing with the base-
ine MV-CNN, the LSTM in our framework indeed has a significant
mprovement because it correlates the outputs from CNN across
ultiple views.
Different number of views : We also experiment with various
alues of K . According to Fig. 9 , using too few number of views
s not good due to occlusion. As using more views equally dis-
ributed around the object, the object’s surface area is more fully
overed, hence we get higher accuracy (or lower Rand Index score).
e choose K = 60 as a reasonable trade-off between accuracy and
ime/memory consumption.
T. Le et al. / Computers & Graphics 66 (2017) 103–112 111
Table 2
Average cut discrepancy, hamming distance, consistency error scores of segmentation for each category with different methods. Smaller is better.
Fig. 9. The Rand Index with respect to the number of views. We choose K = 60 as
a reasonable trade-off between accuracy and time/memory usage.
Fig. 10. Limitation of our approach. The area under the torso is occluded and hence
the left and right thighs are not separated although our MV-RNN can detect 2D
edges correctly in all views.
5
h
t
t
c
o
e
v
i
p
6
t
m
fi
a
r
M
n
e
o
t
t
p
m
d
w
t
w
t
s
A
m
v
[
l
[
r
R
.1. Limitation
Because our approach belongs to the multi-view paradigm, it
as a common occlusion issue. For example, the left and right
highs of the man in Fig. 10 are not separated due to occlusion (i.e.
he area under torso is not revealed from any of K = 60 views). In-
rease the number of views could reduce the occlusions at the cost
f more computations. Since we can easily computed occluded ar-
as given the current set of views, we plan to use adaptive best
iew prediction to focus the camera on these areas, which is sim-
lar to the next-best-view prediction in 3D attention model pro-
osed by Xu et al. [65] .
. Conclusion
We have presented our novel MV-RNN for 3D shape segmen-
ation which combines the MV-CNN and LSTM to enhance the
ulti-view paradigm. To the best of our knowledge, we are the
rst group that treats multiple views as a temporal sequence and
pplies RNN to predict the edge images by aggregating the cor-
esponding edge probability maps obtained by feed-forwarding a
V-CNN. Our MV-RNN detects 3D edges in an end-to-end man-
er and the segmentation is obtained as a post-processing. The 3D
dges can be either semantic-based (e.g. semantic segmentation)
r geometric-based (e.g. CAD model segmentation, suggestive con-
our, ridge and valley). According to our experimental results on
he Princeton Segmentation Benchmark dataset, our MV-RNN com-
ares favorably with other state-of-the-art methods on mesh seg-
entation.
In the future, we would like to conduct more experiments on
ifferent datasets such as those in [49,66] . Additionally, our frame-
ork right now work on meshes only. In the future we would like
o extend it to handle point clouds as well. The proposed frame-
ork is purely data-driven, thus in the future we would like to ex-
end our method to other interesting problems in shape modeling
uch as suggestive contours [8,9] and ridge-valley detection [10] .
cknowledgment
We would like to acknowledge the authors of Princeton Seg-
entation Benchmark [1] who made the dataset public and pro-
ided evaluation toolbox. We also appreciate the authors of HED
40] for their edge detection network. Last but not least, we would
ike to thanks all the authors of other segmentation algorithms
18,22,23,27,41,62–64] for their contribution of the segmentation
esults on the Princeton Segmentation Benchmark dataset.
eferences
[1] Chen X , Golovinskiy A , Funkhouser T . A benchmark for 3d mesh segmentation.ACM Trans Graph 2009;28(3) . 73:1–73:12.
[2] Biasotti S , Marini S , Mortara M , Patane G . An overview on properties and effi-cacy of topological skeletons in shape modeling. In: Shape modeling interna-
tional; 2003. p. 245–54 .
[3] Katz S , Tal A . Hierarchical mesh decomposition using fuzzy clustering and cuts.ACM Trans Graph 2003;22(3):954–61 .
[4] Funkhouser T , Kazhdan M , Shilane P , Min P , Kiefer W , Tal A , et al. Modelingby example. ACM Trans Graph 2004;23(3):652–63 .
[5] Zockler M , Stalling D , Hege H-C . Fast and intuitive generation of geometricshape transitions. Vis Comput 20 0 0;16(5):241–53 .
[6] Zuckerberger E , Tal A , Shlafman S . Polyhedral surface decomposition with ap-
plications. Comput Graph 2002;26(5):733–43 . [7] Levy B , Petitjean S , Ray N , Maillot J . Least squares conformal maps for auto-
matic texture atlas generation. ACM Trans Graph 2002;21(3):362–71 . [8] DeCarlo D , Finkelstein A , Rusinkiewicz S , Santella A . Suggestive contours for
conveying shape. ACM Trans Graph 2003;22(3):848–55 . [9] Burns M , Klawe J , Rusinkiewicz S , Finkelstein A , DeCarlo D . Line drawings from
volume data. ACM Trans Graph 2005;24(3):512–18 . [10] Ohtake Y , Belyaev A , Seidel H-P . Ridge-valley lines on meshes via implicit sur-
face fitting. ACM Trans Graph 2004;23(3):609–12 .
[11] Su H , Maji S , Kalogerakis E , Learned-Miller EG . Multi-view convolutional neuralnetworks for 3d shape recognition. In: IEEE international conference on com-
puter vision; 2015 . [12] Xie Z , Xu K , Shan W , Liu L , Xiong Y , Huang H . Projective feature learning for 3d
shapes with multi-view depth images. Comput Graph Forum 2015;34(7):1–11 .
112 T. Le et al. / Computers & Graphics 66 (2017) 103–112
[
[13] Qi CR , Su H , Nießner M , Dai A , Yan M , Guibas L . Volumetric and multi-viewcnns for object classification on 3d data. In: IEEE international conference on
computer vision and pattern recognition . [14] Choy CB , Xu D , Gwak J , Chen K , Savarese S . 3d-r2n2: a unified approach for
single and multi-view 3d object reconstruction. In: IEEE european conferenceon computer vision; 2016. p. 628–44. ISBN 978-3-319-46484-8 .
[15] Dollar P , Zitnick CL . Structured forests for fast edge detection. In: IEEE interna-tional conference on computer vision; 2013 .
[16] Vieira M , Shimada K . Surface mesh segmentation and smooth surface extrac-
tion through region growing. Comput Aided Geometr Des 2005;22:771–92 . [17] Jagannathan A , Miller E . Three-dimensional surface mesh segmentation using
curvedness-based region growing approach. IEEE Trans Pattern Anal Mach In-tell 2007;29(12):2195–204 .
[18] Attene M , Falcidieno B , Spagnuolo M . Hierarchical mesh segmentation basedon fitting primitives. Vis Comput 2006;22:181–93 .
[19] Garland M , Willmott A , Heckbert PS . Hierarchical face clustering on polygo-
nal surfaces. In: Processing of the symposium on interactive 3D graphics. NewYork, NY, USA: ACM; 2001. p. 49–58. ISBN 1-58113-292-1 . I3D ’01.
[20] Shi J , Malik J . Normalized cuts and image segmentation. IEEE Trans PatternAnal Mach Intell 20 0 0;22(8):888–905 .
[21] Yamauchi H , Lee S , Lee Y , Ohtake Y , Belyaev AG , Seidel H-P . Feature sensi-tive mesh segmentation with mean shift. In: Processing of the international
conference on shape modeling and applications. IEEE; 2005. p. 238–45. ISBN
0-7695-2379-X . [22] Golovinskiy A , Funkhouser T . Randomized cuts for 3d mesh analysis. ACM
Trans Graph 2008;27(5) . 145:1–145:12. [23] Lai Y-K , Hu S-M , Martin RR , Rosin PL . Fast mesh segmentation using random
walks. In: Processing of the ACM symposium on solid and physical modeling.New York, NY, USA: ACM; 2008. p. 183–91. ISBN 978-1-60558-106-4 . SPM ’08.
[24] Benjamin W , Polk AW , Vishwanathan S , Ramani K . Heat walk: robust salient
segmentation of non-rigid shapes. Comput Graph Forum 2011;30(7):2097–106 .[25] Xiao D , Lin H , Xian C , Gao S . Cad mesh model segmentation by clustering.
Comput Graph 2011;35(3):685–91 . Shape Modeling International (SMI) Con-ference 2011.
[26] Lavoue G , Dupont F , Baskurt A . A new cad mesh segmentation method basedon curvature tensor analysis. Comput Aided Des 2005;37(10):975–87 .
[27] Shapira L , Shamir A , Cohen-Or D . Consistent mesh partitioning and skeletoni-
sation using the shape diameter function. Vis Comput 2008;24(4):249–59 . [28] Shamir A . A survey on mesh segmentation techniques. Comput Graph Forum
2008;27(6):1539–56 . [29] Agathos A , Pratikakis I , Perantonis S , Sapidis N , Azariadis P . 3d mesh seg-
mentation methodologies for cad applications. Comput Aided Des Applic2007;4(6):827–41 .
[30] Theologou P , Pratikakis I , Theoharis T . A comprehensive overview of method-
ologies and performance evaluation frameworks in 3d mesh segmentation.Comput Vis Image Underst 2015;135:49–82 .
[31] Krizhevsky A , Sutskever I , Hinton GE . Imagenet classification with deep convo-lutional neural networks. In: Advances in neural information processing sys-
tems; 2012. p. 1097–105 . [32] Simonyan K , Zisserman A . Very deep convolutional networks for large-scale
image recognition. In: International conference on learning representations . [33] Szegedy C , Liu W , Jia Y , Sermanet P , Reed S , Anguelov D , et al. Going deeper
with convolutions. In: IEEE conference on computer vision and pattern recog-
nition; 2015. p. 1–9 . [34] He K , Zhang X , Ren S , Sun J . Deep residual learning for image recognition. In:
IEEE conference on computer vision and pattern recognition; 2016. p. 770–8 . [35] Farabet C , Couprie C , Najman L , LeCun Y . Learning hierarchical features for
scene labeling. IEEE Trans Pattern Anal Mach Intell 2013;35(8):1915–29 . [36] Long J , Shelhamer E , Darrell T . Fully convolutional networks for semantic seg-
mentation. In: IEEE international conference on pattern recognition; 2015 .
[37] Noh H , Hong S , Han B . Learning deconvolution network for semantic segmen-tation. In: IEEE international conference on computer vision; 2015 .
[38] Sharma A , Tuzel O , Jacobs DW . Deep hierarchical parsing for semantic segmen-tation. In: 2015 IEEE conference on computer vision and pattern recognition
(CVPR); 2015. p. 530–8 . [39] Hong S, Oh J, Lee H, Han B. Learning transferrable knowledge for semantic
segmentation with deep convolutional neural network. Comput Res Repos -
[40] Xie S , Tu Z . Holistically-nested edge detection. In: IEEE international confer-ence on computer vision); 2015. p. 1395–403 .
[41] Shu Z , Qi C , Xin S , Hu C , Wang L , Zhang Y , et al. Unsupervised 3d shape seg-mentation and co-segmentation via deep learning. Comput Aided Geometr Des
2016;43:39–52 . Geometric Modeling and Processing 2016. [42] Guo K , Zou D , Chen X . 3d mesh labeling via deep convolutional neural net-
works. ACM Trans Graph 2015;35(1) . 3:1–3:12. [43] Wu Z , Song S , Khosla A , Yu F , Zhang L , Tang X , et al. 3d shapenets:
a deep representation for volumetric shapes. In: IEEE international confer-
ence on computer vision and pattern recognition; 2015. p. 1912–20. ISBN978-1-4673-6964-0 .
44] Karpathy A , Toderici G , Shetty S , Leung T , Sukthankar R , Fei-Fei L . Large-scalevideo classification with convolutional neural networks. In: 2014 IEEE confer-
ence on computer vision and pattern recognition; 2014. p. 1725–32 . [45] Simonyan K , Zisserman A . Two-stream convolutional networks for action
recognition in videos. In: Advances in neural information processing systems;
2014b. p. 568–76 . [46] Donahue J , Hendricks LA , Rohrbach M , Venugopalan S , Guadarrama S ,
Saenko K , et al. Long-term recurrent convolutional networks for visual recog-nition and description. IEEE Trans Pattern Anal Mach Intell 2017;39(4):677–91 .
[47] Kalogerakis E, Averkiou M, Maji S, Chaudhuri S. 3d shape segmentation withprojective convolutional networks. Comput Res Repos - arXiv 2016 . Url: http:
//arxiv.org/abs/1612.02808 .
[48] Qi CR, Su H, Mo K, Guibas LJ. Pointnet: Deep learning on point sets for 3dclassification and segmentation. Computing Research Repository - arXiv 2016b .
Url: https://arxiv.org/abs/1612.00593 . [49] Yi L , Kim VG , Ceylan D , Shen I-C , Yan M , Su H , et al. A scalable active
framework for region annotation in 3d shape collections. ACM Trans Graph2016;35(6) . 210:1–210:12.
[50] Nair V , Hinton GE . Rectified linear units improve restricted boltzmann ma-
chines. In: IEEE international conference on machine learning; 2010. p. 807–14 .[51] Hochreiter S , Schmidhuber J . Long short-term memory. Neural Comput
1997;9(8):1735–80 . [52] Li Z , Gan Y , Liang X , Yu Y , Cheng H , Lin L . Lstm-cf: unifying context modeling
and fusion with lstms for rgb-d scene labeling. In: IEEE european conferenceon computer vision; 2016. p. 541–57. ISBN 978-3-319-46475-6 .
[53] Leopardi P . A partition of the unit sphere into regions of equal area and small
diameter. Electron Trans Numer Anal 2006;25 . [54] Phong BT . Illumination for computer generated pictures. Commun ACM
1975;18(6):311–17 . [55] Gupta S , Girshick R , Arbeláez P , Malik J . Learning rich features from rgb-d im-
ages for object detection and segmentation. In: IEEE european conference oncomputer vision; 2014. p. 345–60. ISBN 978-3-319-10584-0 .
[56] Schmidhuber J . A local learning algorithm for dynamic feedforward and recur-
rent networks. Connect Sci 1989;1:403–12 . [57] Gers FA , Schmidhuber E . Lstm recurrent networks learn simple context-free
and context-sensitive languages. IEEE Trans Neural Netw 2001;12(6):1333–40 . [58] Kingma DP , Ba J . Adam: a method for stochastic optimization. In: IEEE inter-
national conference for learning representations; 2015 . [59] Hilaga M , Shinagawa Y , Kohmura T , Kunii TL . Topology matching for fully auto-
matic similarity estimation of 3d shapes. ACM Trans Graph 2001:203–12 . ISBN1-58113-374-X.
[60] Zhang E , Mischaikow K , Turk G . Feature-based surface parameterization and
texture mapping. ACM Trans Graph 2005;24(1):1–27 . [61] Krahenbuhl P , Koltun V . Efficient inference in fully connected crfs with gaus-
sian edge potentials. Neural Inf Process Syst 2011:109–17 . [62] Kaick OV , Fish N , Kleiman Y , Asafi S , Cohen-Or D . Shape segmentation by ap-
proximate convexity analysis. ACM Trans Graph 2014;34(1) . 4:1–4:11. [63] Katz S , Leifman G , Tal A . Mesh segmentation using feature point and core ex-
traction. Visual Comput 2005;21(8):649–58 .
[64] Shlafman S , Tal A , Katz S . Metamorphosis of polyhedral surfaces using decom-position. Comput Graph Forum 2002;21(3):219–28 .
[65] Xu K , Shi Y , Zheng L , Zhang J , Liu M , Huang H , et al. 3d attention–driven depth acquisition for object identification. ACM Trans Graph 2016;35(6) .
238:1–238:14. [66] Wang Y , Asafi S , van Kaick O , Zhang H , Cohen-Or D , Chen B . Active co-analysis
of a set of shapes. ACM Trans Graph 2012;31(6) . 165:1–165:10.