Medical Image Analysis - GitHub Pages...notation by positioning a snake around cusp points of each tooth. Wu et al. (2014) proposed to take advantage of the morphological technique

Medical Image Analysis 69 (2021) 101949

Contents lists available at ScienceDirect

Medical Image Analysis

journal homepage: www.elsevier.com/locate/media

TSegNet: An efficient and accurate tooth segmentation network on 3D

dental model

Zhiming Cui a , d , Changjian Li b , a , Nenglun Chen

a , Guodong Wei a , Runnan Chen

a , Yuanfeng Zhou

c , Dinggang Shen

d , e , f , ∗, Wenping Wang

a , ∗

a Department of Computer Science, The University of Hong Kong, Hong Kong, China b Department of Computer Science, University College London, London, UK c Department of Software Engineering, Shandong University, Jinan, China d School of Biomedical Engineering, ShanghaiTech University, Shanghai, China e Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China f Department of Artificial Intelligence, Korea University, Seoul 02841, Republic of Korea

a r t i c l e i n f o

Article history:

Received 29 June 2020

Revised 6 November 2020

Accepted 12 December 2020

Available online 19 December 2020

Keywords:

Dental model segmentation

Tooth centroid prediction

Confidence-aware cascade segmentation

3D point cloud

a b s t r a c t

Automatic and accurate segmentation of dental models is a fundamental task in computer-aided den-

tistry. Previous methods can achieve satisfactory segmentation results on normal dental models; however,

they fail to robustly handle challenging clinical cases such as dental models with missing, crowding, or

misaligned teeth before orthodontic treatments. In this paper, we propose a novel end-to-end learning-

based method, called TSegNet , for robust and efficient tooth segmentation on 3D scanned point cloud data

of dental models. Our algorithm detects all the teeth using a distance-aware tooth centroid voting scheme

in the first stage, which ensures the accurate localization of tooth objects even with irregular positions

on abnormal dental models. Then, a confidence-aware cascade segmentation module in the second stage

is designed to segment each individual tooth and resolve ambiguities caused by aforementioned chal-

lenging cases. We evaluated our method on a large-scale real-world dataset consisting of dental models

scanned before or after orthodontic treatments. Extensive evaluations, ablation studies and comparisons

demonstrate that our method can generate accurate tooth labels robustly in various challenging cases and

significantly outperforms state-of-the-art approaches by 6.5% of Dice Coefficient, 3.0% of F1 score in term

of accuracy, while achieving 20 times speedup of computational time.

© 2020 Elsevier B.V. All rights reserved.

1

t

n

a

m

d

L

2

i

t

c

i

u

W

t

m

t

f

s

t

i

a

t

t

o

a

p

p

h

1

. Introduction

Computer-aided design (CAD) has been widely used in or-

hodontics for diagnosis, dental restoration and treatment plan-

ing. CAD systems in dentistry require dental models as input to

ssist dentists to delete, extract or rearrange the teeth for treat-

ent procedures. In this regard, segmenting 3D tooth models in

ifferent image modalities such as CBCT images ( Cui et al., 2019;

echuga and Weidlich, 2016 ) and dental models ( Hajeer et al.,

0 04a; 20 04b; Lian et al., 2019; Zanjani et al., 2019 ), is of great

mportance. As dental model scanners are free of X-ray radiation,

hey are widely used to acquire high-precision dental models of

rown shapes in surface representation, compared with CBCT us-

ng 3D volumetric representation. Because it is laborious to man-

ally label teeth from the dental model, the development of au-

∗ Corresponding authors.

E-mail addresses: [email protected] (D. Shen), [email protected] (W.

ang).

h

s

t

ttps://doi.org/10.1016/j.media.2020.101949

361-8415/© 2020 Elsevier B.V. All rights reserved.

omatic and accurate 3D tooth segmentation methods for dental

odels has attracted tremendous research attention.

Although considerable effort s have been put into improving 3D

ooth segmentation performance, developing an automatic method

or robustly extracting the individual tooth from dental models is

till a challenging task, due to following factors. First, some pa-

ients suffer from complex abnormalities such as the teeth crowd-

ng, missing and misalignment problems ( Fig. 1 (a) and (d)). Thus

djacent teeth are often irregular and hard to be separated. Second,

he lack of pronounced shape variation at the boundaries between

eeth and gum bring difficulties to segmentation methods based

n geometric features ( Fig. 1 (c)). Lastly, dental models may have

rtifacts from the model-making process or dental braces worn by

atients ( Fig. 1 (b)). All this may greatly affect the tooth shape ap-

earance, thus making segmentation error-prone.

To address these challenges, many previous works exploited

andcrafted geometric features for dental model segmentation,

uch as surface curvature ( Yuan et al., 2010 ), geodesic informa-

ion ( Sinthanayothin and Tharanont, 2008 ) and harmonic field

https://doi.org/10.1016/j.media.2020.101949

http://www.ScienceDirect.com

http://www.elsevier.com/locate/media

http://crossmark.crossref.org/dialog/?doi=10.1016/j.media.2020.101949&domain=pdf

mailto:[email protected]

mailto:[email protected]

https://doi.org/10.1016/j.media.2020.101949

Z. Cui, C. Li, N. Chen et al. Medical Image Analysis 69 (2021) 101949

Fig. 1. Four typical examples with extreme appearance, including (a) missing teeth, (b) additional braces, (c) blurred boundary signals between incisors and the gum, and

(d) crowding teeth.

(

d

r

v

c

2

b

o

d

d

o

o

a

t

b

t

l

(

(

t

fi

m

m

b

m

(

b

m

w

t

t

o

w

t

t

s

r

a

t

t

a

f

s

p

d

S

t

p

b

o

t

l

a

s

b

t

m

a

o

c

t

2

2

t

3

c

b

b

m

b

a

g

c

t

u

d

n

W

t

m

v

i

u

T

a

c

b

t

Zou et al., 2015 ). However, these methods typically rely on

omain-specific knowledge and lack the robustness required to

epresent intricate tooth shape appearances. Recently, with the ad-

ance in deep learning, more learning-based methods employing

onvolutional neural networks (CNNs) ( Xu et al., 2018; Tian et al.,

019 ) or mesh-based graph neural networks ( Lian et al., 2019 ) have

een proposed with compelling performance. Unfortunately, most

f these methods make a strongly restrictive assumption that the

ental models consist of a complete set of natural teeth, which is

ifficult to be satisfied, for example nearly 70% of the patients in

rthodontic clinics are at the tooth exfoliation time so they they

ften do not have a fixed number of teeth ( Cobourne and DiBi-

se, 2015 ). Mask-MCNet ( Zanjani et al., 2019 ) transforms the den-

al model into point cloud data and uses a volumetric anchor-

ased region proposal network for tooth detection and segmenta-

ion. However, the proposal generation module results in the reso-

ution deduction and requires huge memory resources.

Another line of deep learning methods, including PointNet

Qi et al., 2017a ), PointNet++ ( Qi et al., 2017b ) and PointCNN

Li et al., 2018 ), directly take 3D point cloud data (e.g., mesh ver-

ices) as input and learn deep geometric features to make a classi-

cation and segmentation for general geometric processing tasks. A

ajor limitation of these methods, when applied to our tooth seg-

entation task, is that it is difficult to accurately separate neigh-

oring teeth with similar shape appearances such as incisors, pre-

olars and molars, especially on dental models with missing teeth

Fig. 9 , PointNet++).

To tackle these issues, we present a novel end-to-end learning-

ased method for automatic tooth segmentation on 3D dental

odels. The core of our method is a two-stage neural network

hich firstly detects all the teeth and then segments each detected

ooth accurately. In the tooth detection stage, instead of the tradi-

ional approach that utilizes bounding boxes to crop the detected

bjects ( He et al., 2017; Hou et al., 2019; Zhou and Tuzel, 2018 ),

e exploit the centroid (i.e. the center of mass) of a tooth to iden-

ify each tooth object based on our observation that regardless of

he tooth shape, position and orientation, the centroid point is a

table feature point inside the tooth shape. Therefore it is a more

eliable signal than the bounding box especially when the teeth

re relatively small and packed tightly. In this way, the tooth de-

ection problem is naturally converted to a tooth centroid predic-

ion problem. To predict all the tooth centroids reliably, we design

distance-aware voting scheme that generates the tooth centroids

rom subsampled points with reliable learning local context. In the

econd stage of individual tooth segmentation, we first crop the

oints and corresponding features with the guidance of the pre-

icted tooth centroid, and combine them as one tooth proposal.

ubsequently, all the tooth proposals are sent to the segmenta-

ion module to generate individual tooth labels. Moreover, to im-

rove segmentation accuracy, especially for tooth boundaries with

lurring signals, we introduce a point-wise confidence map based

n a cascade network to enhance the label learning with an at-

ention mechanism. The newly proposed novel components and

oss functions efficiently produce an accurate tooth segmentation

nd boost the usability of our algorithm in the real-world clinical

cenario.

2

Our main contributions are summarized as follows:

• We propose a novel pipeline that formulates the dental model

segmentation as two sub-problems: robust tooth centroids pre-

diction and accurate individual tooth segmentation on point

cloud data. • We design a distance-aware voting scheme to efficiently predict

all tooth centroids. Besides, a confidence-aware attention mech-

anism is introduced to improve segmentation in noisy areas. • Extensive evaluations and ablation studies are conducted on a

dataset collected from dental clinics. Compared with the state-

of-the-art methods, the proposed framework achieves supe-

rior results both qualitatively and quantitatively by a significant

margin.

The rest of the paper is organized as follows. In Section 2 , we

riefly review the existing methods for dental model segmenta-

ion and point cloud learning. Section 3 describes the proposed

ethodology in detail. In Section 4 , we present the quantitative

nd qualitative results of our method and compare with the state-

f-the-art methods. We also discuss the effectiveness of different

omponents of the network and the limitations of our approach in

his section. Section 5 provides the conclusion of our study.

. Related works

.1. Dental model segmentation

Non-learning based methods Many traditional methods based on

he handcrafted geometric features have been proposed to segment

D dental models. These methods can roughly divided into three

ategories: surface curvature based methods, surface contour line

ased methods and harmonic field based methods.

The surface curvature based methods aim to identify the tooth

oundaries. For example, Yuan et al. (2010) calculated the mini-

um curvatures of the teeth surface and extracted the boundary

etween the tooth and soft tissues. Zhao et al. (2006) proposed

n interactive method based on the curvature values of the trian-

le mesh. Kumar et al. (2011) developed a system in which users

an set a certain curvature threshold via an intuitive slider. In addi-

ion, Li et al. (2007) integrated fast marching watersheds and man-

al threshold regulating to improve segmentation accuracy and re-

uce processing time. Kronfeld et al. (2010) minimized user an-

otation by positioning a snake around cusp points of each tooth.

u et al. (2014) proposed to take advantage of the morphological

echnique to facilitate effective tooth separation. However, these

ethods based on the surface curvature are very sensitive to the

ariation of tooth surfaces and appearances.

The methods based on surface contour lines are more reliable

n generating tooth boundaries because the contour lines are man-

ally annotated. Specifically, these methods ( Sinthanayothin and

haranont, 2008; Yaqi and Zhongke, 2010 ) allowed users to manu-

lly select tooth boundary landmarks on a dental model. Then, the

ontour lines computed from the geodesic information of neigh-

oring landmarks are formed as the desired tooth boundaries. Al-

hough achieving the good performance, these methods require


u

l

c

l

p

c

s

i

w

m

Y

i

t

K

t

(

v

s

s

w

l

i

m

i

m

v

L

s

t

r

p

c

c

M

t

a

2

o

f

fi

b

Q

u

f

2

p

c

i

m

m

s

3

m

t

t

c

t

c

t

t

c

d

(

3

t

t

s

w

p

c

a

p

fi

t

t

n

t

s

p

a

a

g

t

i

f

s

p

t

t

s

s

s

(

w

p

f

o

e

t

T

t

p

t

p

f

e

a

s

3

o

L

w

L

sers to translate or rotate the 3D model multiple times to se-

ect the particular landmarks carefully, which is tedious and time-

onsuming.

As for the third category ( Zou et al., 2015 ), the framework al-

owed users to annotate a limited number of surface points as

riors and employed a harmonic field to segment the tooth suc-

essfully. Compared to other interactive methods, this method pre-

ented a more efficient and smarter user interfaces with minimum

nteractions. However, the input models is assumed to manifold,

hich requires a sophisticated preprocessing step.

Another group of methods that aimed to effectively seg-

ent 3D dental models are based on 2D images. For example,

amany and El-Bialy (1999) built a 2D image representation us-

ng the curvature and surface normal information, and extracted

he structures of high/low curvatures as the segmentation results.

ondo et al. (2004) proposed to detect the tooth features both on

he plane-view and panoramic-view images. Similarly, some works

Wongwaen and Sinthanayothin, 2010; Grzegorzek et al., 2010 ) de-

eloped systems to find the contour or cutting points on the 2D

ectional images and then converted it back to the 3D space for

eparating individual tooth. Unfortunately, these methods often fail

hen dental models have severe malocclusion.

Learning based methods Recently, with the development of deep

earning techniques, many studies leverage neural networks on 2D

mages, meshes and point clouds to extract teeth from a dental

odel. Specifically, Xu et al. (2018) used a 2D CNN to classify the

mage produced from the pre-defined handcrafted features of each

esh face. Tian et al. (2019) employed a 3D CNN and a sparse

oxel octree for tooth segmentation. In addition, Lian et al., 2020;

ian, Wang, Wu, Liu, Durán, Ko, Shen ; Sun et al. (2020) integrated a

eries of graph-constrained learning modules to hierarchically ex-

ract multi-scale contextual features for automatically labeling on

aw dental surface. However, since these methods typically group

oints or faces into pre-defined clusters, they usually fail to pro-

ess the data with missing teeth, which is common for real-world

linical scenarios. In addition, Zanjani et al. (2019) extended the

ask R-CNN ( He et al., 2017 ) to a 3D point cloud extracted from

he dental model, it suffered from low efficiency and segmentation

rtifacts.

.2. 3D point cloud learning

3D understanding is an essential task in computer vision. State-

f-the-art methods take as input all kinds of 3D data to per-

orm tasks such as 3D shape segmentation, detection and classi-

cation. Among the input data, 3D point cloud representation is

ecoming more popular since it is flexible and memory efficient.

i et al. (2017a) designed a novel network to take as input an

nstructured point cloud and learn translation-invariant geometric

eatures. Some state-of-the-art methods ( Qi et al., 2017b; Li et al.,

018; Wu et al., 2019 ) improved the framework by recursively ap-

lying neural networks on a nested partitioning of the input point

loud, which had the ability to learn local features with increas-

ng contextual scales and achieved state-of-the-art performance on

any segmentation and classification tasks. However, their perfor-

ance is limited in our specific task, because the tooth is very

mall compared with the whole dental model.

. Methods

In this section, we present a novel framework for tooth seg-

entation on 3D dental model. As shown in Fig. 2 , our approach

akes as input the 3D point cloud extracted from the input den-

al model, and aims to assign every point a unique label. Specifi-

ally, we first introduce the distance-aware tooth centroid predic-

ion module that generates a set of candidate points for the tooth

3

entroids ( Section 3.1 ). Then, we propose a confidence-aware at-

ention mechanism to segment each tooth guided by the predicted

ooth centroid ( Section 3.2 ). At the testing stage, we utilize a tooth

entroid clustering algorithm to speed up the segmentation, and

irectly transfer the point cloud labels back to the dental model

Section 3.3 ).

.1. Distance-aware tooth centroid prediction

To identify a tooth object properly, we formulate it as the

ooth centroid prediction problem. Formally, given an input den-

al model, we first extract the mesh vertices and uniformly down-

ample it to obtain the input point cloud P with dimension N × 6 ,

here N = 16 , 0 0 0 is the number of sampled input points and each

oint is described by a 6- D vector. Specifically, other than the 3D

oordinates ( 3-dims ), we also acquire the normal vector ( 3-dims )

t each point from the dental mesh as an additional feature to

rovide auxiliary information. Having the input point cloud P, we

rst normalize it within a unit ball, and extract the geometric fea-

ures utilizing PointNet++ as the backbone encoder, which includes

hree blocks of multi-layer perceptrons (MLPs) followed by a batch

ormalization layer and a ReLU nonlinearity layer. The output of

he backbone encoder is a set of subsampled points F with dimen-

ion M × (3 + 256) , where M = 256 is the number of subsampled

oints. For each point, in addition to the 3D coordinates, there are

nother 256- D features encoding the local contextual information

round it.

For the dental model of an upper or lower jaw, we have the

round truth tooth centroid set C = { c 1 , c 2 , . . . , c k } , and the goal is

o predict all tooth centroids from the subsampled points F us-

ng the learned local features. Therefore, we design a displacement

unction to learn the offset of each subsampled point to its corre-

ponding tooth centroid c i . This is feasible because if a subsampled

oint appears around a tooth, the encoded features capturing the

ooth shape have the ability to predict the centroid of the nearby

ooth. Specifically, as shown in Fig. 2 , the MLPs take as input the

ubsampled points F with the learned local features to output a

et of M displacement vectors �C = { (�x i , �y i , �z i ) } . Finally, the

et of regressed centroid points, ˆ C = { (x i + �x i , y i + �y i , z i + �z i ) } i ∈ [1 , M] ), are generated to approximate the ground truth set C,

here (x i , y i , z i ) denotes the 3D coordinate of the i th subsampled

oint F i .

However, since the subsampled points F are uniformly sampled

rom the input point cloud by the farthest sampling operation, we

bserve that some subsampled points may be far from any tooth,

.g., on the dental palate ( Fig. 4 (a)), which encode little informa-

ion of any tooth object and cannot predict reliable tooth centroids.

o filter such points automatically, we exploit another distance es-

imation branch to regress a distance value for each subsampled

oint, measuring the closeness of the point to its nearest ground

ruth tooth centroid, as shown in Fig. 2 .

To train the network, we propose three novel loss terms to su-

ervise the prediction of all tooth centroids.

Distance estimation To remove the subsampled points that are

ar way from any tooth, we first measure the distance between

ach subsampled point and its nearest tooth centroid, and set it

s the ground truth of the distance estimation. Then we utilize

mooth L1 loss to calculate the regression error. Let F (3) denote the

D coordinates of the subsampled points F . Then the loss function

f the distance estimation is defined as follows:

D =

∑

f (3) i

∈ F (3)

L smooth 1 ( ̂ d i − min

c k ∈ C

∥∥ f (3) i

− c k ∥∥

2 ) , (1)

here

smooth 1 (x ) =

{0 . 5 x 2 if | x | < 1

| x | − 0 . 5 otherwise , (2)


Fig. 2. The two-stage network architecture and the algorithm pipeline. The dental mesh is first fed into the centroid prediction network in stage one, then the cropped

features based on the regressed points go through the tooth segmentation network in stage two. Finally, we derive the accurately segmented tooth objects. The numbers 1,

2, 3 in the proposal box, represent the input signals for the segmentation network, i.e., cropped coordinate feature, propagated point feature and dense distance field feature

respectively. See algorithm details in Section 3 .

Fig. 3. The ISO standard tooth numbering system and the corresponding color cod-

ing.

w

p

m

t

a

p

s

t

g

f

c

i

c

I

t

L

L

w

t

l

t

e

w

c

C

l

L

w

t

T

p

t

p

f

L

here ˆ d i refers to the predicted distance value from the subsam-

led point F i to its nearest tooth centroid. With this distance esti-

ation module in the framework, we filter the subsampled points

hat have a relative large predicted distance both at the training

nd testing stage. The threshold α is set to 0.2 on the normalized

oint sets, which is consistent with the receptive field of the last

et abstraction layer in the encoder.

Chamfer distance In the tooth centroid prediction branch, we

rain the network by minimizing the distance between the re-

4

ressed centroid set ˆ C and ground truth centroid set C, which is

ormulated to consider the following two factors: (1) every tooth

entroid in C should correspond to at least one regressed centroid

n

ˆ C (surjection function); (2) every regressed centroid in

ˆ C should

orrespond to exactly one tooth centroid in C (injection function).

t is a bidirectional distance minimization and we use Chamfer dis-

ance to supervise the tooth centroid prediction. The loss function

CD of the two sets of centroids is formulated as:

CD =

∑

ˆ c i ∈ ̂ C , ̂ d i <α

min

c k ∈ C || ̂ c i − c k || 2 2 +

∑

c k ∈ C min

ˆ c i ∈ ̂ C , ̂ d i <α|| c k − ˆ c i || 2 2 , (3)

here α = 0 . 2 is introduced in the distance estimation term.

Separation loss The tooth centroid prediction with distance es-

imation and chamfer distance supervision already achieves excel-

ent performance. But we still observe that a few predicted cen-

roids are located near the boundary between two adjacent teeth,

specially for the incisors of the lower jaw as shown in Fig. 5 ,

hich are relatively small and closely packed. This happens be-

ause these ambiguous centroids receive little penalization from

hamfer distance loss. To tackle this issue, we add a separation

oss, defined as:

S =

∑

ˆ c i ∈ ̂ C , ̂ d i <α

�d 1 �d 2

, (4)

here �d 1 and �d 2 are the distances of the predicted tooth cen-

roid ˆ c i to its first and second closest centroids in C, respectively.

his term encourages each predicted centroid to be as close as

ossible to a correctly corresponding tooth centroid in the ground

ruth set C.

Finally, our training loss function L cp for the robust centroid

oint prediction is obtained by combining the three loss terms as

ollows:

cp = L D + L CD + βL S , (5)


W

a

3

t

m

o

i

t

i

d

a

c

r

n

s

l

p

d

w

3

d

c

c

m

a

g

w

f

b

A

c

f

n

(

m

T

t

t

a

c

t

d

o

s

L

w

l

a

b

m

p

C

h

c

s

a

L

w

a

p

s

t

t

L

3

c

d

t

t

d

r

c

a

m

o

f

f

o

a

a

A

d

w

n

l

u

c

i

4

f

t

3

t

f

m

F

a

4

s

c

here β is the balancing weight and is empirically set to 0.1 for

ll experiments.

.2. Confidence-aware tooth segmentation

Now we discuss how to use the accurately predicted tooth cen-

roids as the guidance information to perform individual tooth seg-

entation.

Tooth proposal generation Since each tooth is found by at least

ne predicted centroid, we first generate tooth proposals accord-

ng to the predicted centroids. Instead of utilizing a bounding box

o crop a tooth object, we crop the nearest n = 4096 points in the

nput point cloud data based on the Euclidean distance to the pre-

icted tooth centroid, which are roughly a quarter of the points of

input dental model (16,0 0 0) and ensures a complete tooth is in-

luded in the proposal. As highlighted in the top row of Fig. 2 with

ed dotted box, a tooth proposal is represented by three compo-

ents. The first is the cropped points coordinates ( 3-dims ), and the

econd is the cropped points propagation features ( 32-dims ). The

ast component is a dense distance field df (i ) ( 1-dim ) for the i − th

roposal, defined as:

f j

(i ) = exp

(−4

∥∥ ˆ c i − ˆ p j (i )

∥∥2

), (6)

here ˆ c i is the predicted centroid of proposal i, while ˆ p j (i )

is the

D coordinate of point j in the cropped points. By proposing the

istance field, the foreground tooth corresponding to the predicted

entroid will have a higher value compared to other teeth in the

ropped points, which is regarded as a guidance map for the seg-

entation sub-network.

At last, we directly concatenate the three individual features

nd feed them into the segmentation network to segment the fore-

round tooth shape.

Confidence-aware cascaded segmentation The segmentation net-

ork building upon PointNet++, takes as input the concatenated

eature of dimension n × (3 + 32 + 1) and outputs the binary la-

el of each point belonging to the tooth shape or the background.

lthough PointNet++ demonstrates excellent performance in point

loud segmentation, it is hard to separate the tooth shape clearly

rom the surrounding gum due to the blurred geometric signals

ear the tooth boundary and large variations of tooth shapes

Fig. 6 ). Thus we first design our network using a cascaded seg-

entation scheme with two segmentation sub-networks S 1 and S 2 .

he cascaded scheme that S 2 takes as input both the proposal fea-

ures and the 1-dimensional segmentation result from S 1 . In addi-

ion, to further improve segmentation accuracy near the bound-

ry of complicated tooth shapes ( Fig. 7 (a)), we propose a novel

onfidence-aware attention mechanism for tooth segmentation and

he details are given below.

In the first segmentation sub-network S 1 , in addition to pre-

icting the segmentation results of the proposals, we introduce an-

ther branch to estimate the point-wise confidence value λ, mea-

uring the accuracy of the segmentation, defined as:

S 1 =

1

n

n ∑

j

(L

bce S 1 j

× λ j ) 2 +

(1 − λ j

)2 , (7)

here L

bce S 1 j

indicates the point-wise binary cross-entropy (BCE)

oss between the predicted point label and the ground truth label

nd λ is trained in an unsupervised manner to measure the am-

iguity of the predicted label. That is, the higher is the value, the

ore accurate is the prediction results. A visual illustration of the

oint-wise confidence map is presented in Fig. 2 (the weight map).

learly, the boundary area with blurred geometric signals tends to

ave lower confidence value.

5

In the second segmentation sub-network S 2 , we convert the

onfidence map into a normalized weight map that emphasizes the

egmentation of the area with lower λ in S 2 , e.g., the boundary

rea. The training loss is:

S 2 =

1

n

n ∑

j

(1 . 0 + W j ) × L

bce S 2 j

, (8)

here W j = 1 . 0 − λ j is a point-wise value on the weighted map

nd L

bce S 2 j

refers to the point-wise BCE loss in S 2 .

In addition, to identify the foreground tooth ID in each pro-

osal, we utilize the global feature extracted in S 2 to make a clas-

ification and calculate the cross entropy loss L ID to supervise the

ask. Finally, we train the cascaded segmentation network using

he loss function:

seg = L S 1 + L S 2 + L ID . (9)

.3. Centroid clustering and label prediction

In the previous step, the predicted tooth centroids exhibit the

lustering tendency as shown in Figs. 4, 5, 8 . To remove redun-

ant tooth centroids and speed up processing, in both training and

esting phases, we first apply the DBSCAN ( Ester et al., 1996 ) clus-

ering algorithm to all the predicted centroids controlled by the

istance threshold l . Here, l is empirically set to 0.015, which is

elatively small compared to the tooth size in the normalized point

loud data. For every cluster, we calculate the representative aver-

ge centroid point and derive the corresponding proposal for seg-

entation.

During the testing phase, after the individual tooth extraction

n the generated proposals, the next step is to produce the labels

or the input point cloud data. To this end, we first calculate the

oreground point overlap of each two proposals. If the Intersection

ver Union (IoU) is higher than the threshold 0.35, the two propos-

ls are regarded to contain the same tooth. In this case, we aver-

ge the point-wise label probability to fuse the overlapped points.

t last, the point cloud labels are directly transferred back to the

ental surface based on the trilinear interpolation.

In implementation, we first train the centroid prediction net-

ork for 500 epochs, then connect the single tooth segmentation

etwork and jointly train the framework for 100 epochs. We uti-

ize Adam’s solver with a fixed learning rate of 1 × 10 −3 . Generally,

sing one Nvidia GeForce 1080Ti GPU, it takes about 4 h for the

entroid prediction network training and 18 h for the joint train-

ng.

. Experiments and results

In this section, we evaluate our algorithm on a dataset collected

rom the real-word clinics, including upper and lower jaws. The

ooth identification is based on the dental notation system (ISO-

950) ( Grace, 20 0 0 ) (as shown in Fig. 3 ), which is consistent with

he color coding of our segmentation results. The teeth subgroups

or evaluation purpose in this section, i.e., incisor, canine, premolar,

olar (in Tables 2 and 3 ), are set according to the types marked in

ig. 3 as well. All experiments are performed on a computer with

Intel(R) Xeon(R) V4 1.9 GHz CPU, a 1080Ti GPU, and 32 GB RAM.

.1. Dataset and evaluation metrics

To train the network, we collected a set of dental models from

ome patients before or after orthodontics, which include many

ases with abnormal tooth shapes, such as crowded teeth, missing


Fig. 4. The qualitative comparison of the centroid prediction results with (b) or

without (a) the robust filter. The first row shows centroid point prediction results

with paired purple and red points indicating the start and end positions, while

the second row shows the corresponding segmentation results using bNet seg . The

wrongly predicted points lead to incorrect tooth segmentation (red color). (For in-

terpretation of the references to color in this figure legend, the reader is referred to

the web version of this article.)

Fig. 5. The qualitative comparison of the centroid points prediction. The first row

shows the predicted centroid points, while the second row shows the segmentation

results using bNet seg . Without the separation loss, bCNet cp -D-CD outputs ambiguous

points that cheat the clustering algorithm in the testing stage to miss some teeth,

as highlighted in the dotted boxes.

Fig. 6. The qualitative comparison of tooth segmentation with (b) or without (a)

the cascaded refinement. With the refinement module, bNet seg -C generates results

without artifacts in the tooth body part.

Fig. 7. The qualitative comparison of tooth segmentation with (b) or without (a)

the confidence-aware refinement. The tooth boundaries are highlighted in red boxes

and TSegNet generates a more accurate boundary. (For interpretation of the refer-

ences to color in this figure legend, the reader is referred to the web version of this

article.)

t

d

d

T

m

t

t

o

o

(

p

M

M

w

c

t

t

(

r

D

D

w

c

t

a

i

q

u

4

o

b

u

b

n

l

l

o

6

eeth and additional braces. The dataset includes a total of 20 0 0

ental models (10 0 0 upper jaws and 10 0 0 lower jaws), where each

ental surface contains about 150,0 0 0 faces and 80,0 0 0 vertices.

o train the network, we randomly split it into three subsets, 1500

odels for training, 100 models for validating and 400 models for

esting. To obtain the ground truth, we manually annotated the

ooth-level label, and the centroid of each tooth is calculated based

n the labeled mask. To quantitatively evaluate the performance of

ur method, we use the mean distance (MeanD) and max distance

MaxD) metrics to validate the performance of the tooth centroid

rediction, defined as:

eanD (R

1 , R

2 ) =

1

| R

1 | ∑

r 1 i ∈ R 1

min

r 2 j ∈ R 2

|| r 1 i − r 2 j || 2 2 , (10)

axD (R

1 , R

2 ) = max r 1

i ∈ R 1

min

r 2 j ∈ R 2

|| r 1 i − r 2 j || 2 2 , (11)

here R 1 and R 2 represent two point sets. The two metrics are

omputed by the predicted tooth centroids set and the ground

ruth tooth centroids set in a bidirectional manner ( Table 1 ). For

he segmentation task, we utilize the dice similarity coefficient

DSC) metric to validate on the point cloud and the dental surface

espectively, that are calculated as:

SC point = 2 × | L GT ∩ L P | | L GT + L P | , (12)

SC sur face = 2 × Area ( L GT ∩ L P )

Area ( L GT ) + Area ( L P ) , (13)

here L GT and L P denote the ground truth tooth labels, and the

orresponding predicted labels, respectively. Note that the DSC of

he dental surface is calculated in a manner weighted by the face

rea. Besides, macro F1-score (F1) is used to measure the tooth

dentification accuracy ( Opitz and Burst, 2019 ). In the following

uantitative results, other than Fig. 10 , we report the average val-

es computed on the testing subset.

.2. Ablation analysis of key components

We conduct extensive experiments to validate the effectiveness

f our network components and loss functions. First, we build

aseline networks for both tooth centroid prediction and individ-

al tooth segmentation tasks, which are denoted as bNet cp and

Net seg , respectively. For the tooth centroid prediction baseline

etwork, we directly supervise all the subsampled points to trans-

ate to their nearest tooth centroids, while the segmentation base-

ine network is the single PointNet++ segmentation module with-

ut the confidence-aware cascade mechanism. All the alternative


Fig. 8. Representative segmentation results. From left to right: input, predicted centroid points, tooth segmentation on the point cloud, tooth segmentation on dental models

with two different views. The accurate segmentation boundary is highlighted in the boxes.

Fig. 9. The visual comparison of dental model segmentation results produced by different methods, with each row corresponding to a typical example of the upper or lower

jaw. From left to right are the scanned dental surface, the ground truth result, results of other methods (3rd-5th columns) and result of our method (last column). Red

dotted circles and arrows represent some segmentation details. ‘ † ’ denotes the method is a semi-automatic method. (For interpretation of the references to color in this

figure legend, the reader is referred to the web version of this article.)

n

d

s

t

t

m

t

n

p

i

t

m

C

m

t

t

r

i

s

d

p

t

o

s

etworks are derived by augmenting the baseline network with

ifferent network com ponents or loss terms, and are trained on the

ame training dataset. We describe the details and present quanti-

ative and qualitative results in the following section.

Benefits of distance estimation The distance estimator in the

ooth centroid prediction module serves as a robust filter to re-

ove the subsampled points that are far away from their nearest

ooth centroids. To validate its benefits, we augment the baseline

etwork bNet cp with the distance-aware filter (bNet cp -D) and com-

are tooth centroid prediction results of both networks as shown

n Table 1 and Fig. 4 . Quantitatively, bNet cp -D consistently improve

ooth centroid prediction results of all metrics. Specifically, the

ax distances MaxD ( ̂ C , C) from the predicted tooth centroids set

7

ˆ to the ground truth tooth centroids set C are reduced by a large

argin ( 86 . 245 × 10 −3 vs. 60 . 065 × 10 −3 ), which demonstrates that

he filtered subsampled points encode little tooth shape informa-

ion and usually produce unreliable tooth centroid predictions. Cor-

espondingly, the segmentation performance is improved accord-

ngly ( 1 . 2% and 1 . 3% improvements of DSC point and DSC sur face , re-

pectively).

In addition, to analyze the effectiveness of the proposed

istance-aware filter more comprehensively, we visualize the dis-

lacement vectors in the point cloud (the first row of Fig. 4 ) and

heir corresponding dental model segmentation results (the sec-

nd row of Fig. 4 ). On the one hand, we have efficiently filtered

ample points that are far away from any centroid points and less


Table 1

Statistical performance of the tooth centroid prediction and segmentation with different tooth centroid prediction loss terms. Here,

the distance metric is calculated on the point cloud data normalized within a unit ball, i.e., the ratio referencing to the unit length.

Network Tooth centroid prediction [ ×10 −3 ] Segmentation [ % ]

MeanD (C, ̂ C ) MaxD (C, ̂ C ) MeanD ( ̂ C , C) MaxD ( ̂ C , C) DSC point DSC sur face

bNet cp 3.679 ± 1.850 16.553 ± 18.135 10.650 ± 10.662 86.245 ± 67.863 94.3 ± 2.4 95.2 ± 2.1

bNet cp -D 2.998 ± 1.123 10.091 ± 12.584 9.350 ± 3.576 60.065 ± 18.760 95.5 ± 1.7 96.5 ± 1.4

bNet cp -D-CD 2.673 ± 1.135 9.893 ± 12.159 6.857 ± 3.908 42.991 ± 19.836 95.9 ± 1.4 96.8 ± 1.1

bNet cp -D-CD-S 2.565 ± 0.880 8.785 ± 11.899 6.961 ± 3.490 14.883 ± 10.652 96.1 ± 1.2 96.9 ± 0.9

Fig. 10. The segmentation performance of different methods on normal and abnormal cases. (a) DSC accuracy on the point cloud; (b) DSC accuracy on the dental surface.

l

i

h

t

g

p

i

p

C

C

M

9

t

n

D

r

l

c

t

a

w

a

t

t

c

C

m

a

p

c

p

r

t

a

t

f

b

s

r

t

g

p

u

l

c

p

(

s

T

b

a

p

i

b

h

a

c

t

t

w

p

c

g

m

m

ikely to find an optimal position. Usually, these points will result

n wrongly regressed centroid points and segmentation results, as

ighlighted using the red color in Fig. 4 . On the other hand, with

he learned filter, the predicted points tend to lie close to the tar-

et points, which demonstrates the clustering effect benefiting the

roposal generation.

Chamfer distance loss To supervise the tooth centroid prediction,

nstead of using the intuitive way that directly forces the subsam-

led point to move to its nearest tooth centroid, we utilize the

hamfer distance to calculate bidirectionally distances (bNet cp -D-

D). Compared to bNet cp -D, the mean distances MeanD (C, ˆ C ) and

eanD ( ̂ C , C) are reduced from 2 . 998 × 10 −3 to 2 . 673 × 10 −3 and

. 350 × 10 −3 to 6 . 857 × 10 −3 , demonstrating that the Chamfer dis-

ance is the key to the success of our tooth detection compo-

ent. And not surprisingly, it improves the segmentation accuracy

SC point from 95 . 5% to 95 . 9% and DSC sur face from 96 . 5% to 96 . 8% ,

espectively.

Separation loss To validate the effectiveness of the separation

oss in the accurate tooth centroids prediction, especially for in-

isors that are crowding and packing together, we explore the al-

ernative loss combination by augmenting the bNet cp -C-CD with

separation loss, that is denoted as bNet cp -C-CD-S. Statistically,

ith the separation loss, the MaxD ( ̂ C , C) gains about a remark-

bly 28 × 10 −3 reducing, as shown in Table 1 . Although only lit-

le improvement is achieved in the other three metrics, the impor-

ance of the separation loss is presented more clearly in the visual

omparison in Fig. 5 . For the lower jaw dental model, bNet cp -C-

D-S successfully predicts all correct centroids, while bNet cp -C-CD

isses one incisor since the predicted centroids around incisors

re clustered into one group so as to miss one proposal in the

roposal generation stage. Considering the small crown area of in-

isors, it contributes slightly to the segmentation metrics, 0 . 2% im-

rovement of DSC point and 0 . 1% improvement of DSC sur face .

It is interesting to find that the distance metrics along two di-

ections differ a lot. The reason is that under the supervision of

8

he centroid detection losses, every ground truth centroid receives

t least one regressed centroid, and most of them are close enough

o the target, but a few regressed centroids are a little bit far away

rom the nearest tooth centroid. Thus, from

ˆ C to C, the errors are

igger. The bidirectional constraints reveal that the two metrics

hare equal importance. Higher error of ( ̂ C , C) indicates that some

egressed centroids are far away from any tooth, which would lead

o over-detection; while higher error of (C, ˆ C ) implies that some

round truth centroids are not detected by any regressed centroid

oint, which usually leads to miss-detection.

Cascaded segmentation refinement The cascading mechanism

sually has a beneficial influence on the image segmentation prob-

em. To validate the efficacy of the cascading scheme in our spe-

ific task, we first use the bNet cp -C-CD-S as the tooth centroid

rediction network and augment the base segmentation network

bNet seg ) with another sub-module to refine the preliminary re-

ults (denoted as bNet seg -C). The quantitative results are listed in

able 2 for comparison. It can be seen that the cascading network

Net seg -C consistently improves the segmentation performance of

ll four teeth types with higher DSC point and DSC sur face values, es-

ecially for the tooth with irregular shape. One typical example

n Fig. 6 presents the visual comparison. As highlighted in brown

oxes, the special case with extreme appearance can be robustly

andled by the network bNet seg -C. In addition, the cascade scheme

lso brings 0 . 9% F1 score improvement for the tooth identification

oming from the correctly predicted labels.

Confidence-aware segmentation refinement To validate the effec-

iveness of the confidence-aware cascading mechanism, we fur-

her augment bNet seg -C with the confidence map as our final net-

ork (TSegNet), which encourages the segmentation network to

ay more attention to the area with relative low confidence, espe-

ially near the tooth boundary and ambiguous regions with blurred

eometric signals. As illustrated in Table 2 , compared to the com-

on cascade mechanism (bNet seg -C), the confidence-aware seg-

entation network improves the average DSC point and DSC sur face


Fig. 11. Segmentation results of dental models with complex appearances, including teeth missing, crowding and irregular shapes highlighted by red boxes. Four attention

maps of abnormal cases in the last row are also presented, and the red color indicates higher segmentation confidence while the blue color indicates lower segmentation

confidence. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 2

Numerical performance of segmentation accuracy for different segmentation network variants. The F1 scores are also included.

Methods DSC point [%] DSC sur face [%] F1[%]

Incisor Canine Premolar Molar Mean Incisor Canine Premolar Molar Mean

bNet seg 95.1 96.5 96.3 96.6 96.1 96.4 96.9 97.0 97.2 96.9 92.5

bNet seg -C 97.1 98.1 97.4 97.2 97.4 97.6 98.1 97.6 97.7 97.8 93.4

TSegNet 97.9 98.2 98.1 97.9 98.0 98.3 98.5 98.6 98.8 98.6 94.2

Table 3

The qualitative comparisons with state-of-the-art methods on tooth detection, segmentation and running time metrics. ‘ † ’ denotes the method is a

semi-automatic method and ‘’ means the metric is non-applicable.

Methods DSC point (%) DSC sur face [%] F1[%] Time[s]

Incisor Canine Premolar Molar Mean Incisor Canine Premolar Molar Mean

PointNet + 90.5 91.9 73.0 88.8 86.1 91.4 92.6 76.5 90.2 87.7 87.4 0.3

Harmonic Field † 92.2 90.0 95.0 95.4 93.2 92.5 90.1 95.1 96.0 93.4 – 30.0

Mask-MCNet 91.3 92.6 90.1 91.9 91.5 92.4 93.0 91.6 93.0 92.5 91.2 18.1

Ours (TSegNet) 97.9 98.2 98.1 97.9 98.0 98.3 98.5 98.6 98.8 98.6 94.2 0.8

w

F

t

s

4

p

(

M

3

m

s

Z

o

o

a

s

M

o

s

t

w

t

t

1

w

t

c

v

m

h

f

t

m

l

M

t

t

(

9

m

ith 0.6% and 0.8% rising, respectively. The qualitative results in

ig. 7 also show that TSegNet can produce more reliable segmenta-

ion results without artifacts. More representative and challenging

egmentation results of TSegNet are presented in Figs. 8 and 11 .

.3. Comparison with state-of-the-art methods

We compare our framework with several state-of-the-art

oint or mesh segmentation approaches, including PointNet++

Qi et al., 2017b ), harmonic field ( Zou et al., 2015 ) and Mask-

CNet ( Zanjani et al., 2019 ). The first one directly takes the

D point cloud as input and achieves the state-of-the-art perfor-

ance in many public segmentation datasets. The last two are

pecialized methods for dental model segmentation. Specifically,

ou et al. (2015) presents a semi-automatic method based on ge-

metric surface features and outperforms other traditional meth-

ds. Zanjani et al. (2019) extends Mask-RCNN to 3D point clouds

nd achieves the leading performance in automatic dental model

egmentation. For a fair comparison, we train PointNet++, Mask-

CNet and our method with the same point cloud input (i.e., co-

rdinates and normals). The statistic and visual comparisons are

hown in Table 3 and Fig. 9 , respectively.

9

Quantitative comparison The overall tooth segmentation, iden-

ification and running time results are summarised in Table 3 ,

here our framework significantly outperforms other state-of-

he-art methods by a large margin. Concretely, compared with

he backbone network PointNet++, our method leads to 11 . 9% ,

0 . 9% and 6 . 8% improvements of DSC point , DSC sur face and F1 score,

hich demonstrates the effectiveness of the network architec-

ure and loss design. Moreover, our framework inherits the effi-

iency of PointNet++ as it shows comparable running time (0.8 s

s. 0.3 s). Although ( Zou et al., 2015 ) proposed a semi-automatic

ethod that employs harmonic field of the crown surface and

igh-level semantic information manually provided by users, our

ull-automatic framework still outperforms it in terms of segmen-

ation accuracy and running time. Note that harmonic field based

ethod is unable to predict tooth identification automatically. At

ast, it is observed that our approach achieves better results than

ask-MCNet, that is a state-of-the-art learning based method in

his specific task. In particular, our method boosts the segmen-

ation accuracy from 91 . 5% to 98 . 0% ( DSC point ), 92 . 5% to 98 . 6%

DSC sur face ), and F1 score of tooth identification from 91 . 2% to

4 . 2% . In the meanwhile, because Mask-MCNet is an anchor-based

ethod that has to crop the dental model into several patches,


o

f

I

7

p

t

h

n

a

o

2

2

t

w

d

o

o

D

H

r

a

p

a

c

t

i

m

t

b

P

t

t

m

b

s

t

p

F

t

t

m

4

e

M

m

d

t

m

t

a

t

e

m

s

s

m

t

r

e

t

i

i

c

p

p

4

s

t

N

t

g

w

d

w

s

p

fi

w

c

t

f

s

o

c

t

s

o

d

w

o

r

m

b

r

m

e

t

c

a

t

T

a

b

m

b

g

d

m

F

i

F

i

l

n

s

t

w

t

p

s

o

c

ur anchor-free method is more efficient and nearly 25 times

aster.

We also quantify segmentation results per tooth type in Table 3 .

t can be seen that PointNet++ only obtains 73 . 0% DSC point and

6 . 5% DSC sur face for premolar teeth, which is much lower com-

ared to other types. The reason is that most patients seeking or-

hodontic treatment are in the tooth exfoliation period, and usually

ave unfixed number of premolars. In addition, young children do

ot have premolars because these teeth do not grow until they are

round 10 years old. Thus, these clustering-based learning meth-

ds, such as PointNet++ ( Qi et al., 2017b ), MeshSegNet ( Lian et al.,

020; Lian, Wang, Wu, Liu, Durán, Ko, Shen ) and TGCNN ( Xu et al.,

018 ), cannot robustly handle the cases with missing teeth even

hough it is a common situation in real-world clinics.

To further demonstrate the robustness of our proposed method,

e construct two testing subsets containing the abnormal (206

ental models) and normal cases (194 dental models) based on

ur testing dataset (400 dental models). As shown in Fig. 10 ,

ur method is robust to handle the abnormal cases and the two

SC metrics change mildly on the normal and abnormal subsets.

owever, the performance of PointNet++ and Mask-MCNet drops

apidly on abnormal cases, due to their lack of ability to handle

bnormal cases with teeth crowding, missing and misalignment

roblems. It is also worth noting that Harmonic Field is a semi-

utomatic method, where additional human input would help pro-

ess the abnormal cases to some extent, but it is still hard to find

he accurate tooth boundaries.

Qualitative comparison The visual comparison results are shown

n Fig. 9 for upper and lower jaws. It can be observed that seg-

entation results produced by our method match better with

he ground truth, especially for extreme cases, such as additional

races (the third row) or crowding teeth (the fourth row). Notably,

ointNet++ and Mask-MCNet, usually produce lots of artifacts in

he tooth body and boundary areas. This shows that high-level fea-

ures extracted by such methods are not reliable when the dental

odel has blurred geometric signals. Besides, the harmonic field

ased method heavily depends on human interactions and is sen-

itive to variations of tooth shape appearances. For example, it fails

o extract a complete tooth body when the tooth surface is com-

licated (as highlighted by red arrows in the fourth column of

ig. 9 ). The qualitative results shown in Fig. 9 are consistent with

he quantitative comparison, which further demonstrates the effec-

iveness and efficiency of our framework for automatic tooth seg-

entation and identification on dental models.

.4. Discussions

In clinical practice, automatic dental model segmentation is an

ssential yet challenging problem in computed-aided orthodontics.

any algorithms, including traditional and deep learning based

ethods, are explored to extract the tooth individually from the

ental model. However, these methods cannot robustly handle

ypical cases with extreme appearance before orthodontics treat-

ents. In this paper, we propose a two-stage framework with

he distance-aware centroid prediction module and the confidence-

ware cascade segmentation module to successfully extract all

eeth from dental models with large variations.

Parameter analysis Totally, there are five core hyper param-

ters used in our method. To analyze the robustness of our

ethod to these parameters, we conduct five experiments, as

hown in Fig. 12 , with different parameter settings and report the

tatistics in terms of the segmentation accuracy. Specifically, our

ethod achieves comparable segmentation results when changing

he number ( N = 16 , 0 0 0 ) of input sampled points ( Fig. 12 (a)). The

eason is that after the first set abstraction layer of the point cloud

ncoder, 1024 points are sampled via farthest sampling to encode

10

he local context, which is far smaller and the distribution is sim-

lar given different numbers of input points. Overall, our network

s insensitive to the choice of N. As for the number ( n = 4096 ) of

ropped points within a tooth proposal, it should ensure a com-

lete foreground tooth is included without too many background

oints. Then, when varying n to a smaller or bigger value than

096, the slight performance degradation appears ( Fig. 12 (b)). The

imilar curve tendency can be observed in Fig. 12 (a), since the

wo parameters n and N are tightly coupled. In our configuration,

= 16 , 0 0 0 and n = 4096 achieved the best performance. In the

ooth centroid detection stage, M = 256 candidate points are re-

ressed for indicating tooth objects. Thus, if M is small, some teeth

ith small crown area, e.g. the lower central incisor, may be miss-

etected. As illustrated in Fig. 12 (c), the performance drops rapidly

hen M is set to 64. Instead, when M is set to 256 or bigger, it is

ufficient to capture all tooth objects in the dental model and no

erformance fluctuation occurs. Meanwhile, in the distance-aware

lter, the distance threshold α is set to 0.2, which is consistent

ith the receptive field of the last set abstraction layer in the en-

oder. A smaller α would filter more candidate points and leads

o miss-detection, while a bigger α usually takes some points far

rom any tooth into consideration and leads to over-detection, as

hown in Fig. 12 (d). The last core parameter is the distance thresh-

ld l = 0 . 015 in DBSCAN clustering algorithm. Due to the strong

lustering effect achieved by our method in the tooth centroid de-

ection stage, it is insensitive to the choice of l ( Fig. 12 (e)).

Centroid vs. bounding box In the first stage of our network, in-

tead of utilizing the bounding box, an intuitive way to indicate an

bject in 2D or 3D images, we design a centroid voting method to

etect and represent each tooth. To investigate the effectiveness,

e conduct an experiment to compare the two representations by

nly replacing the centroid prediction module to a bounding box

egression module in the TSegNet. As illustrated in Fig. 13 , a pre-

olar tooth is failed to be detected by any bounding box. This is

ecause the predicted bounding box of the premolar tooth has a

elative large overlap with the bounding box of the neighboring

olar tooth and is filtered by the non-max suppression (NMS) op-

rator that is designed to remove redundant boxes. In addition,

here is an under-segmentation in the canine tooth because the

orresponding bounding box does not cover the tooth appropri-

tely. Generally, the centroid representation has two main advan-

ages compared to the bounding box representation in this task.

he first is that the centroid voting and detection is more efficient

nd accurate than the NMS operator to filter redundant bounding

oxes. The second is that the tooth centroid is more stable infor-

ation and less sensitive to the tooth shape boundary, while the

ounding box is mainly decided by the tooth shape with blurred

eometric signals.

Teeth missing problem Missing teeth is a common problem in

ental clinics. As shown in the Fig. 14 , there are two typical teeth

issing cases referencing the normal tooth case in Fig. 14 (a). In

ig. 14 (b), it misses a cuspid tooth in the left half and a premolar

n the right half, but visually there is no vacant position. While in

ig. 14 (c), it misses a cuspid tooth in the left with a correspond-

ng vacant position. Generally, it is challenging for semantic-based

earning methods, e.g., PointNet++, to successfully process the ab-

ormal case in Fig. 14 (b). However, instead of relying on the global

emantic context encoded in the input point cloud, we regress a

ooth centroid from the local points around the target tooth. Thus,

ith the proposed distance filter, the Chamfer distance loss and

he separation loss, our method can accurately regress the centroid

oints to indicate the tooth object confidently.

Limitation Although our proposed framework has achieved out-

tanding tooth segmentation results and outperforms many state-

f-the-art methods, it presents some limitations that are worth

onsidering. One typical example is that it tends to yield incom-


Fig. 12. The tooth segmentation performance of our TSegNet, when changing the value of different parameters.

Fig. 13. Visual comparison between the bounding box and tooth centroid represen-

tations. (a) The predicted bounding boxes and corresponding segmentation results.

(b) The predicted tooth centroids and corresponding segmentation results.

Fig. 14. Normal and two typical examples with missing teeth. (a) A normal case;

(b) Missing a cuspid tooth in the left half and a premolar in the right half; (c)

Missing a cuspid tooth in the left half.

p

a

a

i

m

Fig. 15. Failure cases. (a) The wisdom tooth with abnormal shape is failed to be

detected. (b) The rudimentary tooth is hard to be segmented accurately due to the

small crown area.

t

w

o

t

o

p

t

w

5

s

i

w

a

n

q

o

m

o

D

c

i

lete tooth segmentation in some cases such as the wisdom tooth

nd the rudimentary tooth. One possible reason is that these cases

re quite rare and seldom seen by the network during the train-

ng phase. Specifically, the wisdom tooth is a special case for hu-

ans, because it has large variations and usually a small part of

11

he crown is appeared on the dental model. As shown in Fig. 15 (a),

e fail to detect the wisdom tooth marked in the brown box. An-

ther case is the rudimentary tooth, it shares similar situation that

he seen crown part has small area and is quite different from

ther teeth. Thus, some background area is likely to be treated as

art of a rudimentary tooth, as illustrated in Fig. 15 (b). In the fu-

ure, we would like to explore more effective method on datasets

ith imbalanced tooth type distribution.

. Conclusion

In this work, we develop a novel fully automatic algorithm to

egment tooth on 3D dental models guided by the tooth centroid

nformation. The algorithm builds upon a two-stage neural net-

ork containing a robust tooth centroid prediction subnetwork and

single tooth segmentation subnetwork with our novel compo-

ents and loss functions. We have evaluated our algorithm both

ualitatively and quantitatively, and compared it with the state-

f-the-art learning and non-learning based methods, where our

ethod produces superior results and significantly outperforms

thers.

eclaration of Competing Interest

The authors declare that they have no known competing finan-

ial interests or personal relationships that could have appeared to

nfluence the work reported in this paper.


C

C

N

W

Y

W

R

C

C

E

G

G

H

H

H

H

K

K

K

L

L

L

L

OL

Q

Q

S

S

T

W

W

W

X

Y

Y

Y

Z

Z

Z

Z

RediT authorship contribution statement

Zhiming Cui: Methodology, Software, Writing - original draft.

hangjian Li: Methodology, Writing - original draft, Software.

englun Chen: Methodology, Writing - original draft. Guodong

ei: Methodology, Data curation. Runnan Chen: Methodology.

uanfeng Zhou: Data curation, Writing - original draft. Wenping

ang: Supervision, Writing - original draft.

eferences

obourne, M.T. , DiBiase, A.T. , 2015. Handbook of Orthodontics. Elsevier Health Sci-

ences .

ui, Z. , Li, C. , Wang, W. , 2019. ToothNet: automatic tooth instance segmentation andidentification from cone beam CT images. In: Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition, pp. 6368–6377 . ster, M. , Kriegel, H.P. , Sander, J. , Xu, X. , et al. , 1996. A density-based algorithm for

discovering clusters in large spatial databases with noise. In: Kdd, pp. 226–231 . race, M. , 20 0 0. Dental notation. Br. Dent. J. 188, 229 .

rzegorzek, M., Trierscheid, M., Papoutsis, D., Paulus, D., 2010. A multi-stage ap-

proach for 3D teeth segmentation from dentition surfaces. Springer. Interna- tional Conference on Image and Signal Processing, 521–530.

ajeer, M. , Millett, D. , Ayoub, A. , Siebert, J. , 2004. Applications of 3D imaging inorthodontics: part i. J. Orthod. 31, 62–70 .

ajeer, M. , Millett, D. , Ayoub, A. , Siebert, J. , 2004. Applications of 3D imaging inorthodontics: part ii. J. Orthod. 31, 154–162 .

e, K. , Gkioxari, G. , Dollár, P. , Girshick, R. , 2017. Mask r-CNN. In: Proceedings of the

IEEE International Conference on Computer Vision, pp. 2961–2969 . ou, J. , Dai, A. , Nießner, M. , 2019. 3D-SIS: 3D semantic instance segmentation of

RGB-D scans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4 421–4 430 .

ondo, T. , Ong, S.H. , Foong, K.W. , 2004. Tooth segmentation of dental study modelsusing range images. IEEE Trans. Med. Imaging 23, 350–362 .

ronfeld, T. , Brunner, D. , Brunnett, G. , 2010. Snake-based segmentation of teeth fromvirtual dental casts. Comput. Aided Des. Appl. 7, 221–233 .

umar, Y. , Janardan, R. , Larson, B. , Moon, J. , 2011. Improved segmentation of teeth

in dental models. Comput. Aided Des. Appl. 8, 211–224 . echuga, L. , Weidlich, G.A. , 2016. Cone beam CT vs. fan beam CT: a comparison of

image quality and dose delivered between two differing ct imaging modalities. Cureus (Palo Alto, CA) 8 (9), E778 .

i, Y. , Bu, R. , Sun, M. , Wu, W. , Di, X. , Chen, B. , 2018. Pointcnn: convolution onx-transformed points. In: Advances in Neural Information Processing Systems,

pp. 820–830 .

i, Z., Ning, X., Wang, Z., 2007. A fast segmentation method for STL teeth model.IEEE. 2007 IEEE/ICME International Conference on Complex Medical Engineer-

ing, 163–166. ian, C., Wang, L., Wu, T. H., Liu, M., Durán, F., Ko, C. C., Shen, D., 2019. MeshsNet:

deep multi-scale mesh feature learning for end-to-end tooth labeling on 3D dental surfaces. Springer. International Conference on Medical Image Comput-

ing and Computer-Assisted Intervention, 837–845,

12

pitz, J., Burst, S., 2019. Macro f1 and macro f1. arXiv preprint arXiv: 1911.03347 . ian, C. , Wang, L. , Wu, T.H. , Wang, F. , Yap, P.T. , Ko, C.C. , Shen, D. , 2020. Deep multi-

-scale mesh feature learning for automated labeling of raw dental surfaces from

3D intraoral scanners. IEEE Trans. Med. Imaging 39 (7), 2440–2450 .

i, C.R. , Su, H. , Mo, K. , Guibas, L.J. , 2017. PointNet: deep learning on point sets for3Dclassification and segmentation. In: Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition, pp. 652–660 . i, C.R. , Yi, L. , Su, H. , Guibas, L.J. , 2017. PointNet++: deep hierarchical feature learn-

ing on point sets in a metric space. In: Advances in Neural Information Process-

ing Systems, pp. 5099–5108 . inthanayothin, C., Tharanont, W., 2008. Orthodontics treatment simulation by teeth

segmentation and setup. IEEE. 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Tech-

nology, 81–84. un, D., Pei, Y., Song, G., Guo, Y., Ma, G., Xu, T., Zha, H., 2020. Tooth segmentation

and labeling from digital dental casts. IEEE. 2020 IEEE 17th International Sym-

posium on Biomedical Imaging (ISBI), 669–673. ian, S. , Dai, N. , Zhang, B. , Yuan, F. , Yu, Q. , Cheng, X. , 2019. Automatic classification

and segmentation of teeth on 3D dental model using hierarchical deep learning networks. IEEE Access 7, 84 817–84 828 .

ongwaen, N., Sinthanayothin, C., 2010. Computerized algorithm for 3D teeth segmentation. IEEE. 2010 International Conference on Electronics and Information

Engineering, V1–277.

u, K. , Chen, L. , Li, J. , Zhou, Y. , 2014. Tooth segmentation on dental meshes usingmorphologic skeleton. Comput. Graph. 38, 199–211 .

u, W. , Qi, Z. , Fuxin, L. , 2019. Pointconv: deep convolutional networks on 3D pointclouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern

Recognition, pp. 9621–9630 . u, X. , Liu, C. , Zheng, Y. , 2018. 3D tooth segmentation and labeling using deep con-

volutional neural networks. IEEE Trans. Vis. Comput.Graph. 25, 2336–2348 .

amany, S. M., El-Bialy, A. M., 1999. Efficient free-form surface representation with application in orthodontics. International Society for Optics and Photon-

ics. Three-Dimensional Image Capture and Applications II, 115–124. aqi, M., Zhongke, L., 2010. Computer aided orthodontics treatment by virtual seg-

mentation and adjustment. IEEE. 2010 International Conference on Image Anal- ysis and Signal Processing, 336–339.

uan, T. , Liao, W. , Dai, N. , Cheng, X. , Yu, Q. , 2010. Single-tooth modeling for 3D den-

tal model. J. Biomed. Imaging 2010, 9 . anjani, F. G., Moin, D. A., Claessen, F., Cherici, T., Parinussa, S., Pourtaherian, A.,

Zinger, S., et al., 2019. Mask-MCNet: instance segmentation in 3D point cloud of intra-oral scans. Springer. 22nd International Conference on Medical Image

Computing and Computer Assisted Intervention, (MICCAI2019). hao, M., Ma, L., Tan, W., Nie, D., 2006. Interactive tooth segmentation of dental

models. IEEE. 2005 IEEE Engineering in Medicine and Biology 27th Annual Con-

ference, 654–657. hou, Y. , Tuzel, O. , 2018. VoxelNet: end-to-end learning for point cloud based 3D

object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4 490–4 499 .

ou, B.j. , Liu, S.j. , Liao, S.h. , Ding, X. , Liang, Y. , 2015. Interactive tooth partitionof dental mesh base on tooth-target harmonic field. Comput. Biol. Med. 56,

132–144 .

http://refhub.elsevier.com/S1361-8415(20)30313-3/sbref0001

























































http://arxiv.org/abs/1911.03347






















































Medical Image Analysis - GitHub Pages...notation by positioning a snake around cusp points of each tooth. Wu et al. (2014) proposed to take advantage of the morphological technique

Documents