DraftDraft 1 EXPLOITATION OF DEEP LEARNING IN 2 THE AUTOMATIC DETECTION OF 3 CRACKS ON PAVED ROADS 4 5 6 7 8 9 10 11 12 Won Mo Jung a, Faizaan Naveed a, Baoxin Hu *a ...

Draft

EXPLOITATION OF DEEP LEARNING IN THE AUTOMATIC DETECTION OF CRACKS ON PAVED ROADS

Journal: Geomatica

Manuscript ID geomat-2019-0008.R1

Manuscript Type: Research Article

Date Submitted by the Author: 25-Sep-2019

Complete List of Authors: Jung, Won Mo ; York University, Earth and Space Science and EngineeringNaveed , Faizaan ; York University, Earth and Space Science and EngineeringHu, Baoxin; York UniversityWang, Jianguo; York UniversityLi, Ningyuan ; Government of Ontario Ministry of Transportation

Is the invited manuscript for consideration in a Special

Issue? :Not applicable (regular submission)

Keywords: Crack Detection,Convolutional Neural Networks (CNN),Fully Convolutional Networks (FCN),

https://mc06.manuscriptcentral.com/geomatica-pubs

Geomatica

Draft

1 EXPLOITATION OF DEEP LEARNING IN 2 THE AUTOMATIC DETECTION OF 3 CRACKS ON PAVED ROADS456789

1011

12 Won Mo Jung a, Faizaan Naveed a, Baoxin Hu *a, Jianguo 13 Wang a and Ningyuan Li b14151617181920

21

22 a Earth and Space Science and Engineering Department, Lassonde School of 23 Engineering, York University, 4700 Keele Street, Toronto, Canada - 24 ([email protected], [email protected], [email protected], 25 [email protected])

26 b Ministry of Transportation of Ontario, 159 Sir William Hearst Ave, Toronto, 27 Canada – ([email protected])

28 * Corresponding Author

Page 1 of 27


Geomatica

mailto:[email protected]

Draft

29 ABSTRACT

30 With the advance of deep learning networks, their applications in the assessment of

31 pavement conditions are gaining more attention. A convolutional neural network (CNN) is the

32 most commonly used one in image classification. In terms of pavement assessment, most existing

33 CNNs are designed to only distinguish between cracks and non-cracks. Few networks classify

34 cracks in different levels of severity. Information on the severity of pavement cracks is critical for

35 pavement repair services. In this study, the state-of-the-art CNN used in the detection of pavement

36 cracks was improved to localize the cracks and identify their distress levels based on three

37 categories (low, medium, and high). In addition, a fully convolutional network (FCN) was, for the

38 first time, utilized in the detection of pavement cracks. These designed architectures were validated

39 using the data acquired on four highways in Ontario, Canada, and compared with the ground truth

40 that was provided by the Ministry of Transportation of Ontario (MTO). The results showed that

41 with the improved CNN, the prediction precision on a series of test image patches were 72.9%,

42 73.9%, and 73.1% for cracks with the severity levels of low, medium, and high, respectively. The

43 precision for the FCN was tested on whole pavement images, resulting in 62.8%, 63.3%, and

44 66.4%, respectively, for cracks with the severity levels of low, medium, and high. It is worth

45 mentioning that the ground truth contained some uncertainties, which partially contributed to the

46 relatively low precision.

47

48

49 KEY WORDS: Crack detection, convolutional neural network (CNN), fully convolutional

50 network (FCN), pavement distress, instance segmentation

Page 2 of 27


Geomatica

Draft

51 I. INTRODUCTION

52 Due to traffic, environmental factors, and aging, road pavements usually experience

53 different types of distress and deterioration that include cracking, surface defects, and profile

54 deformation (McGhee 2004; Miller et al. 2003; Bennett et al. 2007). The qualities of pavements

55 are usually characterized by distress type, severity, and extent. Effectively and accurately

56 monitoring the condition of a pavement surface is paramount to determine if it provides a

57 comfortable, safe, and efficient service to the public. It also assists the road management authority

58 to make decisions on appropriate maintenance and rehabilitation. Traditionally, pavement surveys

59 involve transportation personnel observing and recording surface defects and degradation through

60 walking or slowly driving over pavements (asphalt or concrete). Overall, manual techniques are

61 considered labor intensive, slow, expensive, and sometimes unsafe. The early efforts to develop

62 automatic or semi-automatic systems for pavement assessment were mainly camera-based. The

63 qualities of the cameras and software for data processing and analysis limited their operational

64 employment (Cafiso et al. 2006; Teomete et al. 2005; Yu et al. 2007).

65 With the development of Light Detection and Ranging (LiDAR) instruments, multi-sensor

66 integrated systems using both LiDAR and cameras are gaining attention (Zhang et al. 2014; Yu et

67 al. 2014; Murakami et al. 2018). However, the costs of the systems and their operations limit their

68 usage. As an example, the Ministry of Transportation of Ontario (MTO) only employs such

69 systems for major highways and uses camera-based systems for all other roads instead. In addition,

70 even with the integrated systems, LiDAR data and camera imagery are hardly utilized together for

71 the detection of pavement cracks. The camera systems are mainly used for asset mapping and thus

72 point forward and/or sideways (Fugro 2014). The advance in imaging technologies and artificial

73 intelligence provides a good opportunity to improve the performance of the camera-based system

Page 3 of 27


Geomatica

Draft

74 for the assessment of pavement conditions. In this study, the focus was on the development of

75 automatic image-based methods to accurately detect cracks on pavements. As described in the next

76 section, several methods have been developed in the past decade for crack detection (Davis 2011;

77 Oliveira et al. 2013; Shi et al. 2016). However, the accuracies in the localization of cracks are

78 relatively low (about 60%), and the widths of cracks tend to be over-estimated (Oliveira et al.

79 2013). In addition, most methods are not able to distinguish cracks with different distress levels,

80 which prevents the detection results from being used in an automated system for road repair

81 services (Davis 2011). The objective of this current study was to develop deep learning solutions

82 to improve the localization of pavement cracks and the classification of their severity.

83 II. RELATED WORK

84 The methods that have been developed to characterize the cracks and distress of paved

85 roads using optical imagery primarily rely on the difference in the intensities between the road

86 surface and cracks (Sun et al. 2009; Oliveira et al. 2013; Danilescu et al. 2015; Shi et al. 2016;

87 Hoang et al. 2018). The most commonly used techniques include thresholding, edge detection,

88 and morphological operation. Specifically, Sun et al. (2009) developed a method to employ

89 thresholding techniques to identify crack pixels and morphological operations to connect them

90 based on pixel connectivity, but their method performed poorly on small cracks, which led to

91 disconnection between continuous cracks. Similarly, Danilescu et al. (2015) proposed a crack

92 detection method using the thresholding technique and a series of morphological operations for

93 post-processing. Their results showed problems of misclassification of road cracks, and the method

94 was sensitive to noise (Danilescu et al. 2015). Hoang et al (2018) investigated different edge

95 detection methods, such as Roberts, Sobel, Prewitt, and Canny, for crack detection. Like

Page 4 of 27


Geomatica

Draft

96 thresholding methods, edge detection methods tend to detect more cracks and are sensitive to noise

97 as well. Oliveira et al. (2013) proposed an unsupervised classification method. With their method,

98 the image was divided into a series of non-overlapping image blocks, and each image block was

99 classified into either a block with crack pixels or a block without. The features used in the

100 classification included the mean and standard deviation of the intensity within a given block, and

101 it was assumed that the standard deviation was higher in the block with crack pixels compared

102 with the block without. The severity of cracks was determined based on the average width of each

103 crack. The results were pixelated, and the cracks were not accurately localized. The severity of the

104 detected cracks was also not accurately characterized. In Shi et al.’s (2016) study, new descriptors

105 based on random structured forests were proposed to characterize cracks. Even though the

106 accuracies for the test data were high (up to 90%), the performance of this method highly depended

107 on the extracted features, and thus this method might not be effective for all pavements. In addition,

108 the method proposed by Shi et al. (2016) could not characterize the severity of cracks.

109 Recently, with the advance of deep learning, convolutional neural networks (CNNs) are

110 being exploited to detect cracks in pavements (Fan et al. 2018). Fan et al. (2018) implemented a

111 CNN with structured predictions, considered state-of-the-art in the detection of pavement cracks.

112 The CNN proposed by Fan et al. (2018) consisted of four convolutional layers with two max-

113 pooling layers, followed by three fully connected layers. Every convolutional layer was applied

114 with 3 by 3 kernel and a stride of 1 pixel. Additionally, zero paddings were applied on the boundary

115 of each input image before the convolutional filters were applied, to preserve the spatial resolution

116 of the feature map. After each pair of convolutional layers, max pooling was applied with stride of

117 2 over 2 by 2 kernel. In the fully connected layers, 64 neurons were used to output 25 neurons,

118 which were then reshaped to form a 5 by 5 image patch with binary prediction, 1 being the crack

Page 5 of 27


Geomatica

Draft

119 pixel, and 0 being the non-crack pixel. Furthermore, Fan et al. (2018) explored different output

120 sizes, ranging from one single pixel (traditional CNN output) to a 7 by 7 image patch (structured

121 prediction, and the output with a 5 by 5 image patch was shown to have achieved the best result).

122 In addition, Fan et al. (2018) also investigated the effect of the ratio between the numbers of crack

123 and non-crack image patches in the training stage and concluded that the 1:3 ratio of crack to non-

124 crack patches resulted in the best outcome. Even though a good accuracy (91%) was achieved in

125 the prediction of crack and non-crack for a given pixel, the designed CNN was not able to

126 determine the severity of cracks due to its binary output. Additionally, the output from this network

127 was restricted to tiny image patches that needed to be stitched together to form an image. A variant

128 of the network proposed in Fan et al. (2018) was implemented in this study to detect cracks with

129 different severities.

130 In addition to a CNN, a fully convolutional network (FCN) was exploited in this study for

131 the detection of cracks with identification of their severity levels as well. Based on our knowledge,

132 this was the first time an FCN was used in crack detection with severity level classification. The

133 network was trained to detect three levels of pavement distress: low, medium, and high. The results

134 were cross validated with a CNN with structured prediction. Four major highways in Ontario,

135 Canada, with varying crack sizes and levels of distress, provided by the MTO, were used to train

136 and test the algorithms, and the results were compared with those obtained using the current

137 integrated LiDAR and camera-based system.

138 III. TRAINING AND TEST DATA SET

139 The dataset used in this study was provided by the MTO, using an Automatic Road

140 Analyzer (ARAN) 9000 system developed by Fugro Roadware (Fugro 2014). The pavement

141 images were collected from four highways: 89, 34, 21, and 138 in Ontario, Canada. The ARAN

Page 6 of 27


Geomatica

Draft

142 9000 system consists of two LiDAR instruments at that back of the vehicle and one camera facing

143 forward at the front part of the vehicle. Since the purpose of this study was to detect cracks using

144 optical imagery, ideally the images obtained by the camera should have been used. However, due

145 to the orientations of the camera, the acquired images needed to be rectified first. This process

146 reduced the image quality significantly and caused the images to be blurry. Since few nadir-view

147 images were available, the nadir-view optical imagery was simulated (with the camera facing

148 downward to the road surface) using the LiDAR intensity for the development of algorithms. In

149 the simulated dataset, there were 3,332 road images in total, and each had a size of 2500x1037

150 pixels, where each 2.5 pixels represented 1 cm in the real world. Examples of image patches with

151 non-crack and cracks with different levels of severity are shown in Figure 1. As described in the

152 following paragraph, the severity of a crack was determined by its depth and width.

153 The ground truth associated with these road images was provided by the MTO. As

154 mentioned earlier, the MTO processed the data collected by an ARAN 9000 system over the

155 highways 89, 34, 21, and 138 in Ontario using a software called Laser Crack Measurement System

156 (LCMS) (Pavemetrics 2018). The software uses the laser intensity to compute the depth of the

157 cracks, and the corresponding cracks are classified into different severity levels. Even though the

158 results from this software were satisfactory, there were two problems: (1) With LCMS,

159 thresholding techniques were used to classify the cracks and non-cracks, which means some of the

160 road markings might be recognized as cracks. (2) A universal threshold was used by the software

161 as well. As pavements varied in their conditions, one threshold did not work well for all four

162 highways. Cautions were made in choosing quality training sites. The test samples were randomly

163 selected. The issues mentioned above with the LCMS software affected the quality of the extracted

164 image samples, which contributed to the lower classification accuracy, and the main factor of this

Page 7 of 27


Geomatica

Draft

165 was due to the inaccurate ground truth. This issue will be further discussed in experimental results

166 and discussion section.

167 To train and validate the CNN with the structured prediction, 330,401 image patches of a

168 size 27 by 27 pixels and the labels of a 5 by 5 neighbour around the central pixels for these image

169 patches were used with 70% of the patches as training, and the rest, roughly 99,000 patches as

170 testing.

171 To train an FCN, images with 160x576 pixels were extracted from the original images

172 using an overlap of 75%. Since not every image contained crack pixels, only images that contained

173 more than 0.5% crack pixels were used. In total, 220,000 images were generated for training and

174 1400 images for testing.

175 IV. APPROACHES

176 In this study, two different deep learning architectures were designed for the detection of

177 cracks on paved roads: the first architecture was an improved CNN with structured prediction that

178 was originally proposed by Fan et al. (2018), and the second one was an FCN with convolutional

179 layers and the extraction of initial weights from VGG-16 (Simonyan et al. 2014). These two

180 networks are described in the following section in detail.

181 4.1 An Improved CNN with Structured Prediction

182 As mentioned in Section II, the state-of-the-art CNN network proposed by Fan et al. (2018)

183 had a couple of issues in the detection of cracks in pavements. In this study, an improvement was

184 made to the architecture (Fan et al. 2018) to detect cracks with different distress levels. The

185 proposed CNN is shown in Figure 2. In the original network (Fan et al. 2018), only two classes

186 (crack and non-crack) were identified, and thus there were 25 output neurons, which were then

Page 8 of 27


Geomatica

Draft

187 reconstructed into a 5x5 output patch and two consecutive, fully connected layers of 64 neurons.

188 In this study, the implementation of the detection of three levels of cracks required 75 neurons in

189 the output layer and thus more neurons in the fully connected layers. In addition, the classification

190 of different levels of severity of cracks also required the network to learn more complex features,

191 and the network had to be deeper. As shown in Figure 2, the number of convolutional layers was

192 increased from four to six; the number of max-pooling layers was increased to three; and the

193 number of fully connected layers remained the same but the number of neurons within each layer

194 were increased. With 231,280 training patches, it took about 30 hours to generate and recall the

195 generated images using i7-8700K with 16GB RAM and GTX 1070.

196 In order to use the trained network to classify every pixel on a pavement image, image

197 stitching needed to be carried out after the prediction of image patches. Specifically, to produce a

198 classification map of the whole pavement image, this image had to be sub-divided into 27 by 27

199 image patches to predict 5 by 5 output patch, and these outputs from each input patch had to be

200 collected to form a corresponding prediction for the whole image. Since for each patch, the output

201 of a 5 by 5 window around the central pixel of the patch was generated, to avoid any gaps in the

202 predicted image, the next patch needed to be centered at the location of 5 pixels away at the

203 maximum. The prediction with n pixels between two adjacent image patches was called with a

204 stride of n. Considering the processing speed, a stride of 5 was selected in this study. However,

205 the effect of stride size was also investigated in this study, and the results are presented in the next

206 section.

207 The resulting prediction image contained some isolated crack pixels, as shown in Figure

208 3(a). A morphological opening operation was used to eliminate isolated regions smaller than the

209 structuring element of an ellipse of 4 by 4 pixels (Figure 3(b)).

Page 9 of 27


Geomatica

Draft

210 4.2 FCN

211 FCNs, proposed by Long et al. (2015), can make dense predictions on the pixel basis for

212 semantic segmentation tasks. FCNs are built on VGG-16 designed by the Visual Geometry Group

213 (VGG) from Oxford University, the winner of the ImageNet competition in 2014 (Simonyan et al.

214 2014). The structure of the FCN, as shown in Figure 4, is based on encoder-decoder architecture,

215 where the initial seven layers of the network are layers of a typical CNN, and subsequent layers

216 are used for generating the segmentation map by upsampling the results. The input image is

217 downsampled, as it goes through convolutional layers, and then upsampled through

218 deconvolutional layers, which are simply transposed convolutional layers. In the FCN structure,

219 the fully connected layer (7th layer) is replaced by a 1x1 convolutional layer, which generates a

220 map of the features detected by the network. This network architecture is also sometimes dubbed

221 a pixel-to-pixel network, as the labels are pixelwise predictions of the same 2-D dimensions as the

222 input image. The output from the network is then upsampled using deconvolutional operations.

223 Since the upsampled images are coarse, the FCN architecture makes use of the feature maps from

224 earlier layers to refine the coarse segmentation. These additions are known as skip connections.

225 Long et al. (2015) presented three different examples of the FCN architecture: FCN-32,

226 FCN-16, and FCN-8. FCN-32 directly produced segmentation maps from the 7th convolution layer

227 by using a deconvolution operation with a stride of 32 pixels. This resulted in a 32x upsampling

228 from the output of the final convolution layer, yielding the same 2-D dimensions as the input image.

229 Since no skip connection was applied from previous layers, the segmentation results were coarse.

230 FCN-16 and FCN-8 were 16x and 8x upsampling of the output, respectively. In FCN-16, the output

231 of the final 1 by 1 convolutional layer was upsampled by 16x, and the activations from pooling

232 layer 4 were added. This resulted in a relatively refined segmentation result. In FCN-8, outputs

Page 10 of 27


Geomatica

Draft

233 from further pooling layer 3 were added to the results, which further helped retrieve fine-grained

234 spatial information. During the training phase, the deconvolutional layers were trained in the same

235 fashion as the convolutional layers. In other words, the weights for the deconvolution were learned

236 through the training process.

237 For this study, the FCN-8 structure was implemented for the best outcome. The network

238 contained 7 convolutions and 5 pooling layers, and 3 deconvolution layers with skip connections

239 from pooling layers 3 and 4 (Long et al. 2015).

240 V. EXPERIMENTAL RESULTS AND DISCUSSION

241 The trained improved CNN with structured prediction and FCN-8 was applied to three

242 stretches of pavements with cracks of various levels of severity and the results are shown in Figures

243 5-7. It is clear from these figures the predictions by both the CNN and FCN were consistent with

244 the observations of the cracks in the images and the ground truth. However, the ground truth labels

245 were not correct when compared with the original input images. The possible cause of this issue

246 will be discussed further later in this section. In addition, the results from the CNN were not very

247 smooth. As mentioned before, image stitching was required for the CNN to generate a prediction

248 for each pixel, which created a pixelated shape of cracks, rather than a smooth shape. This led the

249 overall crack shapes to be unnatural. The CNN with structured prediction considered a 5 by 5

250 neighborhood together. The trained network might have performed the best considering a 5 by 5

251 neighborhood, but was not necessarily the best for individual pixels. This issue will be discussed

252 further later.

253 To quantify the performance of the proposed CNN and FCN, the commonly used measures,

254 precision, recall, and F1 score (Estrada and Jepson 2009) was used. The precision was measured

Page 11 of 27


Geomatica

Draft

255 the percentage of the number of the crack pixels that were correctly classified over the total number

256 of the crack pixels in the test dataset. Recall provided the percentage of the number of the crack

257 pixels that correctly classified crack pixels over the total number of predicted crack pixels.

258 Precision and recall are also referred to as the user’s accuracy and producer’s accuracy in the

259 remote sensing community. F1 score was computed from the precision and recall as a weighted

260 harmonic mean of the two. The precision, recall, and F1 score were calculated based on the test

261 datasets described in Section III.

262 As mentioned earlier, the test data set for the FCN includes 1,400 images of the size of

263 160 by 576 pixels. The calculated precisions, recalls, and F1 scores are shown in Table 1.

264 Considering both localization and classification, the FCN resulted in the precision of 62.8% in low

265 severity, 63.3% in medium severity, 66.4% in high severity; recall of 39.5% in low severity, 46.7%

266 in medium severity, 41.4% in high severity; F1 score of 46.1% in low severity, 51.6% in medium

267 severity, and 47.8% in high severity. Since the FCN was designed to perform both crack

268 localization and severity level classification at the same time, the evaluation of FCN confirmed

269 whether the predicted cracks were correctly localized and correctly classified into corresponding

270 severity level. The results in Table 1 indicate that the precisions, recalls, and F1 scores were

271 satisfactory but relatively low. Such low evaluation results were partially affected by the issues of

272 the ground truth. Generally, most of the generated ground truth labels were reasonable in terms of

273 localizing the cracks and classifying them accordingly. However, when the cracks on the pavement

274 were too weak to pass the threshold, those cracks were simply disregarded from the ground truth

275 labels. This can be clearly observed by the examples shown in Figure 8. It is clear from Figure 8

276 that concerning the ground truth, either a discontinuity existed between the cracks or a crack

277 segment was completely missed, whereas FCN predictions correctly detected the cracks. One of

Page 12 of 27


Geomatica

Draft

278 the other reasons for the low accuracies in the detection of the cracks was due to the presence of

279 the lane markings on the roads. As shown in Figure 9, these markings were sometimes detected as

280 cracks either by the FCN or on the ground truth. This was because of the difference in the intensity

281 between road markings and the road surface. In the future work, either a pre-process or a post-

282 process will be added to detect and remove road markings.

283 In terms of the validation of the CNN, as mentioned in Section III, roughly 99,000 image

284 patches were in the test set for the CNN. Based on the test set, the calculated precisions, recalls,

285 and F1 scores are shown in Table 2. The results shown in Table 2 only indicate the accuracy of the

286 classification of crack severities, but not of the localization. The precisions were 72.9% in low

287 severity, 73.9% in medium severity, 73.1% in high severity; the recalls were 55.8% in low severity,

288 58.5% in medium severity, 55.8% in high severity; and the F1 scores were 63.2% in low severity,

289 65.3% in medium severity, and 63.3% in high severity. Similarly, the issue associated with the

290 accuracy of the ground truth was one of the contributing factors to the low precisions, recalls, and

291 F1 scores. Even so, the accuracies in the classification of cracks of different levels of severity were

292 satisfactory.

293 To compare with the traditional machine learning methods in the localization of crack

294 pixels in a pavement image, both the CNN with structured prediction and the FCN were trained in

295 a binary classification (crack vs. non-crack). The training and test sets used for the FCN were

296 employed for this investigation. Examples of the localization of crack pixels in an image generated

297 from the two deep learning networks are shown in Figure 10, together with the original image and

298 the ground truth generated from LCMS. In addition, the result generated from morphological crack

299 detection method (Wang et al. 2018) is also shown in Figure 10 as a representation of machine

300 learning methods. To generate the results in Figure 10, a given pixel in an image was identified as

Page 13 of 27


Geomatica

Draft

301 either crack or non-crack. As shown in Figure 10, the results from both the networks were

302 consistent with the ground truth, with the one from the FCN closer to the ground truth than the

303 ones from the CNN and the morphological crack detection. Table 3 shows the crack localization

304 accuracy for the CNN with structured prediction, the FCN, and the morphological method. As

305 expected, the morphological method showed the lowest accuracy, and the FCN performed slightly

306 better than the CNN method. This was expected as morphological methods rely on simple kernel-

307 based thresholding operations to detect local maxima/minima. Due to this, the method is sensitive

308 to noise and the presence of road markings. Furthermore, the morphological method was also not

309 capable of detecting cracks with low severity, as there was not enough difference in the intensity

310 between the paved road and the low-severity cracks. The detected cracks were also observed to be

311 discontinuous due to the limitations in the size of the structuring element. Thus, different ranges

312 of structuring element sizes were implemented, but a larger structuring element resulted in a larger

313 omission error of low- and medium-severity cracks, and a small structuring element was very

314 sensitive to noise.

315 As discussed earlier, the disadvantage of the CNN was to perform image stitching after the

316 prediction of an image patch. For the CNN with structured prediction, for each patch, the output

317 of a 5 by 5 window around the central pixel of the patch was generated. In this study, the stride of

318 5 pixels (the next patch was centered at the location of 5 pixels away) was used to avoid any gaps

319 in the predicted image and with the fastest processing time. The effect of the size of the stride was

320 investigated in this study by comparing the results generated with three stride sizes (1, 3, and 5

321 pixels) shown in Figure 11. With the stride of 1 pixel, for each patch, only the predicted result for

322 the central pixel was recorded. As shown in Figure 11, the predictions generated from the CNNs

323 with different stride sizes were similar. However, the result from the CNN with a stride of 1 seemed

Page 14 of 27


Geomatica

Draft

324 more detailed but rather noisy. The CNN with a stride of 1 pixel would be desirable if we could

325 find a way to improve the result by minimizing the isolated crack pixels and reducing the

326 processing time. One way to improve the CNN with structured prediction would be to change the

327 loss function. In the current version, all 5 by 5 pixels considered in the loss function were weighted

328 equally. A weighted loss function is being investigated. In addition, with the current

329 implementation, the overlapping in the prediction between adjacent patches was not considered.

330 An ongoing study is being carried out to seek a good way to combine the predictions from

331 overlapping patches.

332 VI. CONCLUSIONS

333 In this study, two deep learning networks, an improved CNN with structured prediction

334 and an FCN, were exploited to detect cracks on pavements with their level of severity. Prior to this

335 study, the CNN with structured prediction proposed by Fan et al. (2018) was restricted to crack

336 localization, but by expanding the number of convolutional layers and fully connected layers, an

337 improved network designed in this study was successfully trained on pavement images with

338 categorized crack severity. In addition, an FCN network on crack detection with different distress

339 levels was, for the first time, developed. As outlined in Table 1, the FCN resulted in a precision of

340 62.8% in low severity, 63.3% in medium severity, 66.4% in high severity; recall of 39.5% in low

341 severity, 46.7% in medium severity, 41.4% in high severity; F1 score of 46.1% in low severity,

342 51.6% in medium severity, and 47.8% in high severity. As outlined in Table 2, when the CNN

343 with structured prediction was tested on each individual image patch, it resulted in the precision

344 of 72.9% in low severity, 73.9% in medium severity, 73.1% in high severity; recall of 55.8% in

345 low severity, 58.5% in medium severity, 55.8% in high severity; F1 score of 63.2% in low severity,

346 65.3% in medium severity, and 63.3% in high severity. The CNN with structured prediction was

Page 15 of 27


Geomatica

Draft

347 tested on a series of sub-sampled image patches, but the FCN was tested on the whole image. In

348 comparison with the current state-of-the-art CNN for road crack localization, the FCN architecture

349 was not only more robust for this task but also did not suffer from the limitations of using a fixed-

350 size input image due to the presence of the fully connected layer. In terms of the localization of

351 the cracks, both neural networks performed better than the morphological method. Even with the

352 satisfactory results obtained by both the CNN with structured prediction and the FCN, further

353 validations and improvements of the two networks are being pursued.

354

355 VII. ACKNOWLEDGEMENTS

356 We would like to express our very great appreciation to the MTO and Natural Sciences and

357 Engineering Research Counsil (NSERC) for funding this research project. Special thanks to

358 Gideon Gumisiriza at the MTO for providing us necessary data for training and testing the software.

Page 16 of 27


Geomatica

Draft

(a)

(b)

(c)

(d)

Page 17 of 27


Geomatica

Draft

Figure 1: Examples of the pavement images with non-crack pixels (a) and crack pixels with different levels of seveity (b), (c), and (d) where the left panels are the original images and the right are corresponding crack levels with low, medium and high severity printed in cyan, green, and orange, respectively.

Figure 2: The illustration of the CNN architecture with a single grayscale image as the input.

(a)

(b)

Figure 3: The llustration of the effect of the post-processing: (a) the prediction from CNN and (b) the prediction after morphological post-processing. Crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively.

Page 18 of 27


Geomatica

Draft

Figure 4: The structure of a Fully Convolutional Neural Network. The image was adapted from the one in Long et al. (2015).

(a)

(b)

(c)

(d)

Figure 5: The results for the first stretch pavement: (a) the pavement image, (b) the ground truth generated from LCMS Pavemetrics where crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively; (c) the result produced by the FCN-8; and (d) the result generated by the improved CNN with structured prediction.

Page 19 of 27


Geomatica

Draft

(a)

(b)

(c)

(d)

Figure 6: The results for the second stretch pavement: (a) the pavement image, (b) the ground truth generated from LCMS Pavemetrics where crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively; (c) the result produced by the FCN-8; and (d) the result generated by the improved CNN with structured prediction.

Page 20 of 27


Geomatica

Draft

(a)

(b)

(c)

(d)

Figure 7: The results for the third stretch pavement: (a) the pavement image, (b) the ground truth generated from LCMS Pavemetrics where crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively; (c) the result produced by the FCN-8; and (d) the result generated by the improved CNN with structured prediction.

Page 21 of 27


Geomatica

Draft

(a)

(b)

(c)

Figure 8: An illustration of the missing weal cracks in the ground truth: (a) pavement images, (b) ground truth, and (c) FCN predicted result. For (b) and (c), crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively.

Page 22 of 27


Geomatica

Draft

(a)

(b)

(c)

Figure 9: An illustration of the effect of road markings on crack detection: (a) the pavement images, (b) ground truth, and (c) the results generated from the FCN. For (b) and (c), crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively.

Page 23 of 27


Geomatica

Draft

(a)

(b)

(c)

(d)

(e)

Figure 10: The localization of the crack pixels among different methods: (a) the original pavement images, (b) ground truth for binary classification generated from LCMS Pavementrics, (c) the results from FCN-8, (d) the results from CNN with structured prediction, (e) the results from the morphological crack detection method.

Page 24 of 27


Geomatica

Draft

(a)

(b)

(c)

(d)

(e)

Figure 11: The effect of the stride size used in the CNN with structured prediction: (a) a pavement image, (b) the ground truth, (c) the result generated by the CNN with stride 1, (d) the result generated by the CNN

Page 25 of 27


Geomatica

Draft

with stride 3, and (e) the result generated by the CNN with stride 5. Crack pixels with the levels of low, medium and high severity are printed in cyan, green, and orange, respectively.

Page 26 of 27


Geomatica

Draft

Table 1: The prediction results on image patches in the test data set for the CNN with structured prediction

Architecture Severity Precision Recall F1-ScoreHigh 73.1% 55.8% 63.3%

Medium 73.9% 58.5% 65.3%CNN with structured prediction

Low 72.9% 55.8% 63.2%

Table 2: Predicion results on the tested pavement images in the test data set for the FCN.

Architecture Severity Precision Recall F1-ScoreHigh 66.4% 41.4% 47.8%

Medium 63.3% 46.7% 51.6%FCN-8

Low 62.8% 39.5% 46.1%

Table 3: The results on the localization of cracks based on the the test data set for the FCN

Architecture Precision Recall F1-ScoreCNN with structured prediction

68.05% 44.82% 54.05%

FCN-8 77.0% 43.6% 53.4%Morphological 53.1% 53.7% 53.1%

Page 27 of 27


Geomatica

DraftDraft 1 EXPLOITATION OF DEEP LEARNING IN 2 THE AUTOMATIC DETECTION OF 3 CRACKS ON PAVED ROADS 4 5 6 7 8 9 10 11 12 Won Mo Jung a, Faizaan Naveed a, Baoxin Hu *a ...

Documents