-
DeepSketch: Fast Sketch-Based 3D Shape Retrieval
Cecilia Zhang Weilun Sun
Abstract
Freehand sketches are a simple but powerful tool for
communication. They con-tain rich information to specify shapes,
and thus retrieving 3D shapes from 2Dsketches has received
considerable attention in the field of computer graphics
andcomputer vision. In this project, we present a system for
cross-domain similar-ity search that helps us with sketch-based 3D
shape retrieval. Instead of usinghand crafted features for
searching, we propose our DeepSketch neural networkthat is built on
Siamese network to learn features that are basis for later
similaritysearch using K-nearest Neighbor (KNN). We further did
analysis on how individ-ual strokes of a sketch image affect
retrieval results, and also visualized featureslearned from our
DeepSketch network.
1 Introduction
Retrieving 3D shapes is an important problem that has many
applications in fields such as interiordesign and animation, where
it is necessary to find specific 3D models for tasks. Using
sketches for3D shape retrieval is an attractive idea since sketches
are easy and fast to input while contain richsemantic and geometric
information.
Standard sketch-based shape retrieval amounts to solve two main
challenges. One is to find anoptimal view for a single 3D model,
and the other is to find a feature space that is representativeof
sketches and views of the 3D models. Multiple views are rendered
for each 3D model and anautomatic procedure is used to return the
most representative view for the model. However, a lotof time
human-input sketch images have a large variation even if the
sketches belong to the sameobject. Thus it is hard to define a best
view for a 3D model, and sometimes the best view may noteven
exist.
To bypass the ’best view’ challenge, we are inspired by [8] to
use convolutional neural network(CNN) to learn cross domain
similarities between the sketch space and the 3D model view
space.More specifically, our training network is built upon two
separate Siamese networks [1]; one forsketches and one for rendered
3D model views. The key idea is to define a cross domain
lossconnecting the two Siamese CNN such that features extracted
from the rendered views of 3D shapeslie close to those extracted
from the sketches.
We test our algorithm on the SHREC13 [4], a large scale
hand-drawn sketch query dataset forquerying from a 3D model
dataset, and demonstrate its effectiveness for 3D shape model
retrievalfrom 2D sketches. We further did analysis on how
individual strokes affect the retrieval results byplotting an
importance map for each sketch. We also wrote a graphics user
interface to do real-timeinference and retrieve 3D models. Figure1
is an illustration of our end-to-end system design ofsketch based
3D shape retrieval.
2 Related Work
3D shape retrieval has received interest from the computer
vision and graphics communities overthe past decade. Initial works
[7] focused on retrieving 3D shape models using other 3D shapes
ortext keywords as input.
1
-
Figure 1: System overview. Given an input 2D sketch query from
the user, our trained Siamesenetwork outputs a feature vector and
finds matching rendered views of 3D shape models.
Recently, more attention has been paid to 3D shape retrieval
from 2D sketches. The method of [5]uses initial keyword input and
then refines the retrieval using an input sketch of the desired
view,and the method of [3] uses image-based retrieval. [2] performs
3D shape model retrieval from 2Dsketches using rendered views of
the 3D models and a feature transform based on bags-of-visual-words
of features from a bank of Gabor filters, and we extend this work
by exploring faster retrievalalgorithms as well as better methods
of clustering examples for calculating our
bag-of-visual-wordsfeatures. Most recently, convolutional neural
networks are used [8] as feature extraction to perform3D shape
retrieval from 2D sketches.
3 Learning feature representations
3.1 Network Architecture
Siamese CNN has been demonstrated of great success in weakly
supervised metric learning [1]. Thenetwork takes in pairs of data
that usually has binary labels. Loss function is defined over the
pairs
and is written in L(d, S) = 12N
∑N
n=1(S)d2+(1−S)max(margin−d, 0)2 where d is L2 distance
between the two features of the input pair d = ‖y1 − y2‖2.
We have two domains in our task, the sketch domain and the
rendered view domain. We trainedtwo separate Siamese CNN for the
two domains and have a contrastive loss that connects the
twonetworks. We used two separate networks to learn features from
and sketch and view because wewant to enforce optimal learned
filters for each domain. Replacing the two separate Siamese CNNwith
two standard classification net (e.g. Alexnet) are also possible,
but training both sketch andview images with a single Alexnet has
bad performance, since the filters for the two domains do notshare
completely.
An illustration of our network architecture is shown in Figure
3.1. On the left, we show an illustra-tion of sketch and view
domains before and after learning. Different geometry shapes
correspond todifferent view information, and different color
corresponds to different categories. Note that duringtraining we
did not specify any view similarity within a class (e.g. the object
could be facing backor front), and thus after training, images of
different views will still be clustered together as long asthey
belong to the same class. We will also show results of clustering
learned features within a class.It appears that even if we did not
include constraints on views, the learned features have some
viewinformation. On the right, we demonstrate our network
architecture. All three losses, the sketchloss, view loss and the
cross domain loss are contrastive losses. Input image size is 128
by 128. Wespecify the length of each feature vector to be 64, and
weights are shared in the sketch Siamese andalso in the view
Siamese, but not between the two.
3.2 Retrieval
Learned feature vectors are of length 64 for each sketch and
view images. Retrieval is done using Knearest neighbors (KNN). We
used euclidean distance as the metric.
3.3 Rendered views
We randomly chose three views for the 3D models. We did not
choose the most representative viewfor a 3D model, but instead we
offer these three views and believe that the chance of all three
views
2
-
Figure 2: Illustration and Architecture of our DeepSketch
Siamese CNN.
Figure 3: (LEFT)examples of rendered views from 3D
models.(RIGHT) example model and sketchfor the data set
SHREC’13
being degenerate is low. All 3D models are rendered from the
same three randomly chosen views.Some example renderings are shown
in Figure 3.3
4 Experiments
4.1 Dataset
SHREC’13 dataset is a subset of the Princeton Shape Benchmark.
It has 1258 3D models and 90classes. The sketches in each class has
80 instances. These 80 sketch instances are split in two sets:50
for training and 30 for testing. No validation set is used. Note
that the number of 3D models ineach class varies. For example, the
largest class has 184 instances but there are 23 classes
containingno more than 5 3D models.Thus the dataset we are using is
not optimal and definitely has bias.
4.2 Generate data Pairs for DeepSketch
To make sure there is enough similar and non-similar pairs of
data. For each sketch, we generate 2positive pairs and 10 negative
pairs. All data pairs contain 4 images: 2 for sketch and 2 for
view.Positive data pair contains images all from the same class
while negative data pair contains imagesall from different classes.
Positive data pairs are labeled 1 and negative pairs are labeled 0.
We didnot do data augmentation in our experiments. In total we
trained on 5445 data pairs and tested on2700 data pairs. We show
some of our positive and negative data pairs in Figure 4.2
3
-
Figure 4: examples of input training pairs (LEFT), positive
example input pair (RIGHT) negativeexample input pair
4.3 Network Settings
4.4 Evaluation
We did sketch-sketch retrieval, sketch-view retrieval as well as
view-sketch retrieval. All experi-ments are evaluated on top 1 and
top 3 retrieval accuracy.
5 Results
5.1 Retrieval Accuracy
We evaluated the retrieval accuracy across the sketch and view
domains. More specifically, given asketch or a rendered view, we
retrieve the nearest (top 1) or the three nearest (top 3) images
fromthe search domain. Chance accuracy is 1.1% since we have 90
classes in total. Overall retrievalaccuracies are reasonably good
given the bias and small dataset we use. We can observe view-sketch
retrieval over-performs sketch-view. We believe this is caused by
the small variation of 3Dmodels of a class and large variation of
sketch images of a class.
sketch-sketch sketch-view view-sketch
Top 1 0.383 0.326 0.564
Top 3 0.523 0.457 0.792
Table 1: retrieval accuracy in the two domains
For sketch-view retrieval, we also show the top 5 classes that
have the highest retrieval accuracies.For view-sketch retrieval, we
here list the top 5 classes with the highest retrieval accuracies.
We can
wineglass wine-bottle chair hot_air_balloon pig
Top 1 0.80 0.77 0.73 0.73 0.73
Top 3 0.83 0.83 0.77 0.77 0.73
Table 2: top 5 classes with top sketch-view retrieval
accuracy
observe some overlapping classes with the one got from
sketch-view retrieval. The learned featuresin sketch and view
domains are consistent in this manner.
beer-mug duck hammer hand hot_air_balloon
Table 3: top 5 classes with top view-sketch retrieval
accuracy
Below we show some positive and negative retrieval results for
sketch-view and view-sketch re-trieval.
4
-
Figure 5: examples of sketch-view retrieval results (LEFT)
positive retrieval result (RIGHT) negativeretrieval result
Figure 6: examples of view-sketch retrieval results (LEFT)
positive retrieval result (RIGHT) negativeretrieval result
5.2 Feature Visualization
We visualized the learned feature vectors using T-SNE, in sketch
space and view space separately,as well as in the combined
space.
There are fewer clusters shown in the T-SNE for view space than
in the sketch space. One reasoncould be the data itself in that the
number of models per class is unbalanced for 3D models.
We think it could be interesting to look at a combined space for
both sketch and view features. InFigure 5.2, we did T-SNE for all
the sketch and view features, and plot them separately.
Notice that in the combined T-SNE graph, there are certain
classes that map to the same position forsketch and view domain,
shown in red circled regions. But there is also a big cluster in
the sketchdomain not being able to map to the view domain, circled
in green.
5.3 Stroke Analysis
Inspired by [6], we did analysis on individual strokes in the
sketch domain, and found that differentstrokes weigh differently in
retrieval results.
We did the stroke analysis on two scenarios. One is on sketches
that originally perform well while theother is on sketches that
originally perform bad. For both cases, we removed one/several
stroke(s)from the original sketch, and did inference again on the
’modified’ sketch to see their retrieval
5
-
Figure 7: T-SNE visualization of sketch domain
results, shown in 5.3. We plot the importance map for those
strokes; green are strokes that playa positive role in retrieval,
blue are neutral strokes and red are negative strokes, which means
thatremoving those strokes result in better retrieval results.
5.4 Feature clustering
We also did a K-means clustering within the learned sketch
feature space. Notice that we did notspecify view information when
training our network, but we found there are certain view
informationlearned, as shown in 5.4.
6 Discussion
A common question asked is how well it performs if we train
sketch and view images together asa classification task. We tried
out to use Alexnet and trained on all images to do a 90-class
clas-sification. However, the accuracy stays at 10% for a long time
and the learning cannot converge.Although it is still higher than
chance rate, the result is not satisfying. We think the difference
be-tween the sketch feature space and view feature space (as shown
in the combined T-sne visualizationin 5.2 can somehow explain the
low accuracy of training all images in a single network).
6
-
Figure 8: T-SNE visualization of view domain
Another concern is the bias in the SHREC’13 Dataset. It is still
a small dataset comparing to currentlarge-scale image dataset such
as ImageNet. The unbalanced number of 3D models per class is
alsonot desirable, especially there are even some duplicated
models. And the rendering algorithm weused to generate view images
from 3D models also affect our learned feature space.
But the overall retrieval results are reasonably good given the
dataset we use. Our deepSketchnetwork also demonstrates its
effectiveness in learning feature representations of both the
sketchand view space for shape retrieval task.
7 Conclusion
In this work, we presented a system to retrieve 3D shape models
from a dataset given the query ofa single 2D sketch. We
demonstrated that with the help of Siamese CNNs, we are able to
learn thefeature space for both sketch and view domains to retrieve
accurate 3D shape models with hand-drawn 2D sketches.
It would also be interesting to explore methods to combine
multiple input query methods, such as acombination of sketches and
keywords.
We have put our code as well as caffe protobuf and models online
at ceciliavision/sketchRetrieval.
7
-
Figure 9: T-SNE visualization of combined sketch and view
domains
Figure 10: Individual stroke analysis
8
-
Figure 11: K-means clustering in sketch feature space
9
-
References
[1] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
metric discriminatively, with ap-plication to face verification. In
Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE
Computer Society Conference on, volume 1, pages 539–546. IEEE,
2005.
[2] M. Eitz, R. Richter, T. Boubekeur, K. Hildebrand, and M.
Alexa. Sketch-based shape retrieval.ACM Trans. Graph., 31(4):31–1,
2012.
[3] T. Funkhouser, P. Min, M. Kazhdan, J. Chen, A. Halderman, D.
Dobkin, and D. Jacobs. Asearch engine for 3d models. ACM
Transactions on Graphics (TOG), 22(1):83–105, 2003.
[4] B. Li, Y. Lu, A. Godil, T. Schreck, M. Aono, H. Johan, J. M.
Saavedra, and S. Tashiro. Shrec’13track: large scale sketch-based
3d shape retrieval. In Proceedings of the Sixth
EurographicsWorkshop on 3D Object Retrieval, pages 89–96.
Eurographics Association, 2013.
[5] J. Loffler. Content-based retrieval of 3d models in
distributed web databases by visual shapeinformation. In
Information Visualization, 2000. Proceedings. IEEE International
Conferenceon, pages 82–87. IEEE, 2000.
[6] R. G. Schneider and T. Tuytelaars. Sketch classification and
classification-driven analysis usingfisher vectors.
[7] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The
princeton shape benchmark. In Shapemodeling applications, 2004.
Proceedings, pages 167–178. IEEE, 2004.
[8] F. Wang, L. Kang, and Y. Li. Sketch-based 3d shape retrieval
using convolutional neural net-works. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,pages
1875–1883, 2015.
10
IntroductionRelated WorkLearning feature representationsNetwork
ArchitectureRetrievalRendered views
ExperimentsDatasetGenerate data Pairs for DeepSketchNetwork
SettingsEvaluation
ResultsRetrieval AccuracyFeature VisualizationStroke
AnalysisFeature clustering
DiscussionConclusion