Underwater Image Processing and Object Detection Based on ...

Research ArticleUnderwater Image Processing and Object Detection Based onDeep CNN Method

Fenglei Han, Jingzheng Yao , Haitao Zhu, and Chunhui Wang

College of Shipbuilding Engineering, Harbin Engineering University, No. 145 Nantong Street, NanGang District, Harbin,Heilongjiang Province 150001, China

Correspondence should be addressed to Jingzheng Yao; [email protected]

Received 18 January 2020; Revised 17 February 2020; Accepted 6 May 2020; Published 22 May 2020

Academic Editor: Xavier Vilanova

Copyright © 2020 Fenglei Han et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Due to the importance of underwater exploration in the development and utilization of deep-sea resources, underwaterautonomous operation is more and more important to avoid the dangerous high-pressure deep-sea environment. Forunderwater autonomous operation, the intelligent computer vision is the most important technology. In an underwaterenvironment, weak illumination and low-quality image enhancement, as a preprocessing procedure, is necessary for underwatervision. In this paper, a combination of max-RGB method and shades of gray method is applied to achieve the enhancement ofunderwater vision, and then a CNN (Convolutional Neutral Network) method for solving the weakly illuminated problem forunderwater images is proposed to train the mapping relationship to obtain the illumination map. After the image processing, adeep CNN method is proposed to perform the underwater detection and classification, according to the characteristics ofunderwater vision, two improved schemes are applied to modify the deep CNN structure. In the first scheme, a 1 ∗ 1convolution kernel is used on the 26 ∗ 26 feature map, and then a downsampling layer is added to resize the output toequal 13 ∗ 13. In the second scheme, a downsampling layer is added firstly, and then the convolution layer is inserted in thenetwork, the result is combined with the last output to achieve the detection. Through comparison with the Fast RCNN, FasterRCNN, and the original YOLO V3, scheme 2 is verified to be better in detecting underwater objects. The detection speed isabout 50 FPS (Frames per Second), and mAP (mean Average Precision) is about 90%. The program is applied in an underwaterrobot; the real-time detection results show that the detection and classification are accurate and fast enough to assist the robotto achieve underwater working operation.

1. Introduction

With the development of computer vision and image pro-cessing technology, the application of image processingmethods to improve the underwater image quality to satisfythe requirements of the human vision system and machinerecognition has gradually become a hot issue. At present,the methods of underwater image enhancement and restora-tion can be divided into nonphysical model image enhance-ment and physical model-based image restoration.

For underwater image enhancement, traditional imageprocessing methods include color correction algorithmsand contrast enhancement algorithms, the white balancemethod [1], gray world hypothesis [2], and gray edge hypoth-esis [3] are the typical color correction methods, and the

contrast enhancement algorithms include the histogramequalization [4] and restricted contrast histogram equaliza-tion [5], which are commonly used to enhance underwaterimages. Compared with the good results obtained by com-mon image processing, the results obtained by these methodsare unsatisfactory for underwater vision. The main reason isthat the ocean environment is complex, and many unfavor-able factors, such as the scattering and absorption of lightby water, and the underwater suspended particles have seri-ous interference on image quality.

More complex and comprehensive underwater imageenhancement methods are proposed for solving the degrada-tion of color fading, contrast reduction, and detail blurringproblems. For example, Ghani et al. [6] proposed a methodto solve the low contrast problem of underwater images; the

HindawiJournal of SensorsVolume 2020, Article ID 6707328, 20 pageshttps://doi.org/10.1155/2020/6707328

https://orcid.org/0000-0002-9582-2028

https://creativecommons.org/licenses/by/4.0/

https://creativecommons.org/licenses/by/4.0/

https://doi.org/10.1155/2020/6707328

Rayleigh stretch limited contrast adaptive histogram wasused to normalize the global contrast-enhanced image andthe local contrast-enhanced image, so as to realize theenhancement for the low quality of underwater images. Liet al. [7] considered the multiple degradation factors of theunderwater image, adopted image dehazing algorithm, colorcompensation histogram, equalization saturation, illumina-tion intensity stretching, and bilateral filtering algorithm tosolve the problems of blurring, color fading, low contrast,and noise problems. Braik et al. [8] used particle swarm opti-mization (PSO) to enhance underwater images by reducingthe influence of light absorption and scattering. In addition,the Retinex theory is often applied in the underwater imageenhancement process [9]; Fu et al. [10] proposed an under-water image enhancement method based on the Retinexmodel. This method applied different strategies to enhancethe reflection and illumination components of the underwa-ter image on the basis of color correction, and then the finalenhancement results are synthesized. Perez et al. [11] pro-posed an underwater image enhancement method basedon deep learning, which constructed a training data set con-sisting of groups of degraded underwater images andrestored underwater images. The model between degradedunderwater images and restored underwater images wasobtained from a large number of training sets by deeplearning method, which is used to enhance the underwaterimage quality.

Underwater detection mainly depends on the digitalcameras, and the image processing is commonly used toenhance the quality and reduce the noise; contour segmenta-tion methods are commonly used to locate the objects. A lotof such methods are proposed to realize the target detection.For instance, Chen Chang et al. [12] proposed a new image-denoising filter based on a standard median filter, which isused to detect noise and change the original pixel value toa newer median. Prabhakar et al. [13] proposed a noveldenoising method to remove additive noise present inthe underwater images, homomorphic filtering for correctingnonuniform illumination is used, and anisotropic filtering isapplied for smoothing. A new approach for denoising com-bining wavelet decomposition with high-pass filter is appliedto enhance the underwater images (Sun et al., 2011); both thelow-frequency components of the back-scattering noise andthe uncorrelated high-frequency noise can be effectivelydepressed simultaneously. However, the unsharpness in theprocessed image is serious based on the wavelet method.Kocak et al. [14] used a median filter to remove the noise,the quality of the images are enhanced by RGB color levelstretching, the atmospheric light is obtained through thedark channel prior, and this method is helpful in the caseof images with minor noise. For noisy images, a bilateralfiltering method is utilized by Zhang et al. [15], the resultsare good, but the time processing is very high. An exactunbiased inverse of the generalized Anscombe transforma-tion is introduced by Markku et al. [16]; the comparisonshows that the method plays an integral part in ensuringaccurate denoising results.

A Laser Underwater Camera Image Enhancer system isdesigned and built by Forand et al. [17] to enhance the laser

underwater image quality, and it is testified that the systemhas a range of 3 to 5 times than that of a conventional camerawith floodlights. Yang et al. [18] proposed a method ofdetecting underwater laser weak target based on Gabortransform, which is processed on laser underwater compli-cated nonstationary signal to turn it to become an approx-imate stationary signal, and then the triple correlation iscomputed with Gabor transform coefficient and it can elim-inate random interference and extrude target signal’s corre-lation. Ouyang et al. [19] investigated the application oflight field rendering (LFR) to images taken from a distrib-uted bistatic nonsynchronous laser line scan imager usingboth line-of-sight and non-line-of-sight imaging geometriesto create a multiperspective rendering of an unknownunderwater scene.

Chang et al. [20] introduced a significant amount ofpolarization into light at scattering angles near 90 degrees:This light can then be distinguished from light scattered byan object that remains almost completely unpolarized.Results were obtained from a Monte Carlo simulation andfrom a small-scale experiment, in which an object wasimmersed in a cell filled with polystyrene latex spheres sus-pended in water. Gruev et al. [21] described two approachesfor creating focal plane polarization imaging sensors. Thefirst approach combines polymer polarization filters with aCMOS active pixel sensor and computes polarization infor-mation at the focal plane. The second approach outlines theinitial work on polarization filters using aluminum nano-wires. Measurements from the first polarization image sensorprototype are discussed in detail, and applications for mate-rial detection using polarization techniques are described.Underwater Polarization Imaging Technology is introducedin detail by Li et al. [22].

The above methods are based on wavelet decomposition,statistical methods or by means of laser technology, or colorpolarization theories, the results show that the methods arereasonable and effective, but the common weakness is thatthe processing is very time consumable, and it is difficult toachieve real-time detection.

The Convolution Neural Network (CNN) is recognizedas the fastest detection method by many ways in differentresearch fields; Krizhevsky et al. [23] applied CNN methodto deal with classification problem winning the championof ILSVRC (ImageNet Large Scale Visual RecognitionChallenge), which reduce the top 5 error rate to 15.3%, fromthen on deep CNN has been widely applied. Girshick [24]proposed Region Convolutional Neural Network (RCNN)through combining the RPN (Region Proposal Network)and CNN methods, which are testified on Pascal VOC2007, mAP reaches 66%. Based on RCNN, SPP-Net (SpatialPyramid Pooling in Deep Convolutional Networks for VisualRecognition) is presented by He K. et al. [25] to improve thedetection efficiency. RESNET is proposed by [26]; the successof RESNET is to solve the problem of networkmigration withthe help of the introduction of residual module, so as toimprove the depth of the network, which can obtain the fea-tures with stronger expression ability and higher accuracy.Multilayer Perceptron (MLP) is applied to replace SVM(Support Vector Machine); the training and classification

2 Journal of Sensors

are optimized significantly, which is named Fast RCNN [6].In Fast RCNN, Ren S, He K, and Girshick [27] added RPNto select and modify the region proposals instead of selectivesearch, which is aimed at solving the end-to-end detectionproblem; this is the Faster RCNN method. Liu Wei proposeda SSD (Single Shot MultiBox) method in ECCV2016 (Euro-pean Conference on Computer Vision). Compared with Fas-ter RCNN, it has a distinct speed advantage, which is able todirectly predict the coordinates and categories of boundingbox without processing of generating a proposal.

In 2016CVPR(IEEE Conference on Computer Visionand Pattern Recognition), Redmon proposed YOLO (YouOnly Look Once) [28] Regression object detection algorithm;by this method, the detection speed is improved significantly,and the real-time detection is possible to be realized. Whenthe YOLO algorithm was put forward, the accuracy andspeed of computation were not as good as that of the SSDalgorithm. Then, Redmon proposed YOLO V2 [29] versionto optimize the original YOLO multitarget detection frame-work through a series of methods, and the accuracy is greatlyimproved under the advantage of maintaining the originalspeed. Earlier of 2018, Redmon put forward the YOLO v3[30], which is generally recognized as the fastest detectionmethod, and the accuracy and the detection speed are greatlyimproved compared with the other methods.

In this paper, we applied a combination of max-RGBmethod and shades of gray method to enhance the underwa-ter images, and a CNNmethod is used for weakly illuminatedimage. For the underwater object detection, a new CNNmethod is proposed to solve the underwater object detectionproblem; considering the particularity of underwater vision,two improved schemes are proposed to improve the detec-tion accuracy, and the results are compared with FastRCNN[6], Faster RCNN [27], and original YOLO V3[30].It is testified through comparison that the modification iseffective, and the program is installed on an underwaterrobot to test the real-time detection.

2. Image Preprocessing

For underwater computer vision, the image preprocessing isthe most important procedure for object detection. Becauseof the effects of light scattering and absorption in the water,the images obtained by the underwater vision system showthe characteristics of uneven illumination, low contrast, andserious noise. By analyzing the current image processingalgorithms, enhancement algorithms for underwater imagesare proposed in this paper.

2.1. The Underwater Vision Detection Architecture. The typ-ical underwater visual system is composed of light illumina-tion, camera or sensor, image acquisition card, andapplication software. The software process of the underwatervisual recognition system generally includes several parts,such as image acquisition, image preprocessing, convolutionneural network, and target recognition, as shown in Figure 1.

Image preprocessing is at the low level, the fundamentalpurpose is to improve image contrast, to weaken or suppressthe influence of various kinds of noise as far as possible, andit is important to retain useful details in the image enhance-ment and image filtering process. Convolutional NeutralNetwork is used to divide images into multiple nonoverlap-ping regions; the basis of object detection and classificationis based on feature extraction, which is aimed at extractingthe most effective essential features that reflect the target.Every aspect is closely related, so every effort should be madeto achieve satisfactory results. The research of this papermainly focuses on image preprocessing and recognition oftypical targets from the underwater vision.

2.2. Combination of Max-RGB Method and Shades of GrayMethod. The absorption of water to light leads to the declineof the color of underwater images. As the red and orangelight are completely absorbed at 10 meters deep in the water,the underwater images generally get blue-green color. Inorder to eliminate the color deviation of underwater images,color correction of underwater images must be carried out.

The color correction of the normal image has been verymature. Many white balance methods, such as Gray Wordmethod, max-RGB method, Shades of Gray method, andGray Edge method, are used to correct the color deviationof the image according to the color temperature. Generally,the application scenarios of these methods are general partialcolor conditions, and the treatments for severe underwatervision are not satisfied. In this paper, the original max-RGBmethod and shades of gray method are combined to identifythe illuminant color.

I xð Þ =ðwe λð Þs λ, xð Þc λð Þdλ, ð1Þ

where IðxÞ is the input underwater image, eðλÞ is the radi-ance given by the light source, λ is the wavelength, sðλ, xÞrepresents the surface reflectance, cðλÞ denotes the sensitivityof the sensors, and w is the visible spectrum.

Light illumination Camera or sensor

The detection process

Image acquisition card Application softwares

Image preprocessingConvolution networkObject detection and classification

Figure 1: The image processing and object detection structure.

3Journal of Sensors

The illuminant e is defined as

e =ðwe λð Þc λð Þdλ: ð2Þ

The average reflectance of the scene is gray according tothe Grey-World assumption [31]

k =Ðs λ, xð ÞdxÐ

dx: ð3Þ

Assume k is a constant value, the physical meaning ofequation (1) can be simply described as that the observedimage IðxÞ can be decomposed into the product of the reflec-tance of image SðxÞ and the illumination map eðλÞ. Thus,weak illumination image enhancement means removingweak illumination from the input image; equation (3) issubstituted in equation (1)Ð

s λ, xð ÞdxÐdx

= 1Ðdx

∬we λð Þs λ, xð Þc λð Þdλdx: ð4Þ

The illumination by explaining that the average color ofthe entire image raised to a power n

ke =ÐIndxÐdx

� �1/nð5Þ

According to the max-RGB method, the above equationcan be modified as

ke =max I xð Þ ∗ÐIndxÐdx

� �1/n, ð6Þ

where n can take any number between 1 and ∞, the defaultvalue of n = 6, which is defined in shades of gray method pro-posed by Finlayson [31].

2.3. CNN Method for Weakly Illuminated ImageEnhancement. Retinex model can be used to enhance theimage based on the estimated illumination map; for under-water vision, the images are always weakly illuminated, so atrainable CNN method is applied to predict the mapping

relations between weakly illuminated image and the corre-sponding illumination map. A four-layer convolutional net-work is used, the first and the third layers focus on the highlight regions, and the second layer focuses on low-lightregions while the last layer is to reconstruct the illuminationmap. The Convolutional Neural Network directly learnsfrom an end-to-end mapping between dark and brightimages. Low-light image enhancement in this paper isregarded as a machine learning problem. A weakly illumi-nated image is input, and a 32 ∗ 6 ∗ 6 convolution layer isapplied to change the image into 32 channels; the 3-D viewfigure means multilayers feature map, and then c18 ∗ 6 ∗ 6and 8 ∗ 1 ∗ 1 convolution layers are added in the network;the output is a one channel feature map. In this model, mostof the parameters are optimized by back-propagation, whilethe parameters of traditional models depend on the neutralnetwork. The four-layer convolutional network structure isshown in Figure 2.

The input image is the weakly illuminated image, and theoutput is the corresponding illumination map. Similar withChongyi Li et al. [32] and Dong et al. [33], the network con-tains four convolutional layers with specific tasks. Observingthe feature maps in Figure 2, different convolutional layershave different effects on the final illumination map. Forexample, the first two layers focus on the high-light regionsand the third layer focuses on low-light regions, while the lastlayer is to reconstruct the illumination map. The specificoperation form of the four convolutional layers is describedas shown in Figure 2.

The enhancement effects are shown in Figure 3, theunderwater background color is improved significantly, andthe weakly illuminated images are enhanced using the train-able CNN method.

3. The Object Detection Theories

The images are resized into 448 ∗ 448, the input images areresized, the image will stretch, and the label will be recalcu-lated too. In this case, in fact, a scale factor is calculated torecord the scale of width and height, respectively, and xmin,xmax, ymin, and ymax are calculated, respectively, but the out-put images are resized to be same as the original images. ACNNmethod is used to predict the bounding boxes and clas-sification probabilities. For the underwater detection, the

Input weaklyilluminated image

16 channelsConv 16⁎6⁎6 1 channel

32 channelsConv 32⁎6⁎6

8 channelsConv 8⁎1⁎1

Figure 2: The mapping relationship prediction between the input image and illumination map CNN structure.


targets are difficult to be identified from the background. Inorder to improve the detection accuracy, the whole imageinformation is used to predict the bounding boxes of the tar-gets and classify the objects at the same time; through thisproposal, the end-to-end real time targets detection can berealized.

3.1. Convolutional Neutral Network. The image is dividedinto 4 ∗ 4 grid cells, which is used to locate the center of thedetection object. For each grid cell, the bounding boxes(bbox) are predicted, which includes 5 parameters, (x, y) isthe center location of the bounding box, (w, h) is the widthand height of the box, confidence is the Intersection of Union(IoU), which equals the intersection divided by the unionbetween the bbox and the ground truth, the process is shownin Figure 4.

The bounding box is predicted through a fully-connectedlayer; if the width and height are only related to the scales andratios of the input images, the location of the different objectsin different shapes cannot be very accurate. Therefore,Region Proposal Network is applied to predict the boundingbox and confidence [27], in which the predicted boxes withdifferent scales and ratios are used, and the offsets of theboxes are calculated in RPN, as shown in Figure 5. The fully

connected layer is removed, and the convolution layer withanchor boxes is added to predict the bounding box. In orderto keep the high quality of the original image, a pooling layeris removed, and the input image is 448 ∗ 448, the scale of thefinal feature map is 14 ∗ 14 with only one center.

Through a series of convolutions, a common feature mapis obtained, then, RPN is applied. Firstly through a convolu-tion, a new feature map is given, which can also be seen ashigh dimensional feature vectors, then through two 1 ∗ 1convolutions, a 18 ∗ 16 ∗ 16 feature map and a 36 ∗ 16 ∗ 16characteristic map are obtained. That is 16 ∗ 16 ∗ 9 results,each result contains 2 scores and 4 coordinates, and thencombined with the predefined anchors; after preprocessing,the bounding box is calculated.

In deep learning process, grid cell data is input in thedeep learning results, the center of some pixels is within thecertain range of a specific grid cell, and then, all the pixels sat-isfied the feature of the object are clustered in a certain range.After many times of trial training with penalty, it can find theexact range through sliding window. However, the centerposition cannot exceed the range of the grid cell. This greatlylimits the model’s computation when it is sliding around inthe picture. In this way, position detection and category rec-ognition are combined into a CNN network to predict, you

(a)

(b)

(c)

(d)

(e)

Figure 3: The enhancement effect by different methods: (a) original image; (b) the method proposed by D.J. [34]; (c) the method proposed by[35]; (d) max-RGB and shape of gray method; (e) weakly illuminated image enhancement.

5Journal of Sensors

only need to scan the picture once to infer the position infor-mation and category of all objects in the picture.

3.2. Cluster Analysis. The k-means cluster method is used totrain the bounding boxes, the target is to obtain a betterIoU between the bbox and the grounding truth, so the dis-tance from the center of bbox to the cluster center is calcu-lated as a parameter:

d box, centroidð Þ = 1 − IoU box, centroidð Þ: ð7Þ

The Euclidean distance is applied in the traditional k-means cluster method, which means that the bigger boxes

with more errors compared with the smaller boxes, the resultmay be deviated from the true value. So the IoU score is pro-posed to substitute the traditional method.

The convolutional kernel is 3 ∗ 3, the max-pooling size is2 ∗ 2, and the dimension of the feature map is reduced 2times. The global average pooling is applied to complete theprediction; the 1 ∗ 1 convolution is used to compress thechannels of the feature maps, so as to reduce the parametersand the amount of calculation. A batch normalization layer isadded to accelerate the convergence speed and avoid theoverfitting.

Data preprocessing (unified format, equalization, noisereduction, etc.) can greatly improve the speed of training

Classification

Bounding box andconfidence

Final detectionInput image

Class probability map

Figure 4: Detection process.

Intermediate layer

Score calculation layer Regression layer

Sliding windows

K anchors

Figure 5: The convolutional feature map.


and enhance the training effect. Batch Normalization (BN)is proposed by Google, which is commonly used in theCNN network. After the convolution or pooling andbefore the activation function, all of the input data is nor-malized as follows:

x kð Þ = x kð Þ − E x kð Þ� �ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVar x kð Þ� �q ,

y kð Þ = r kð Þx kð Þ + β kð Þ

ð8Þ

where E is Batch mean value and Var is variance; γ and βare the scale and shift coefficients, which are obtainedfrom training.

3.3. Location Prediction. In order to solve the unstable prob-lem for using the anchor boxes, especially in the process ofearly iteration, the following procedures are applied to pre-dict the location of boxes:

x = tx ∗ ωað Þ − xa,y = ty ∗ ha

� �− ya,

ð9Þ

where (x, y) is the predicted value, (xa, ya) is the coordinatesof anchor, (x∗, y∗) is the real coordinates value, (tx, ty) is theoffset value, and (wa, ha) is the width and height of the box.

When tx = 1, the box is offset a distance equaled the widthof the box to the right; if tx = −1 the offset is to the left, soevery predicted box can be located at any position on theimage, which is the reason why the model is unstable, andthe prediction is very time consumable. The predictionbox is limited in the grid cell, and sigmoid function isused to calculate the offset value, which is defined between0-1; the tx, ty, tw, and th can be computed from the fol-lowing equations:

bx = σ txð Þ + cx,by = σ ty

� �+ cy

bw = pwetw

bh = pheth

ð10Þ

In the above equations, (cx , cy) is the upper left cornercoordinates of the grid cell, as shown in Figure 6; whenthe scale of the grid cell is 1, the center is limited in theinternal of the cell by the sigmoid function. The pw andph are the priori width and height.

3.4. Loss Function. In the process of training, the loss functionform is a key technique; for the method proposed in thispaper, a sum squared error loss is used to balance the errors.For the boxes in different size prediction, the width andheight of the bounding box are substituted by the square rootvalue; thus, the smaller box has a relatively large value offsetto make the prediction more effective. The loss function canbe divided into 2 parts:

L1 = 〠s2

i=0〠B

j=0lobjij xi − xið Þ2 + yi − yið Þ2 + wi − wið Þ2 + hi − hi

� �2� �

:

ð11Þ

L1 is aimed at determining the j-th box in i-th grid cell isin charge for the object or not, which is a coordinate predic-tion for the loss.

L2 = 〠s2

i=0〠B

j=0lobjij ci − cið Þ2 + 〠

s2

i=0lobji 〠

classespi cð Þ − pi cð Þ½ �2: ð12Þ

L2 is the confidence prediction loss of the box with theobject. The total loss is the sum of the L1 and L2, which cangive a better balance between the coordinates, confidence,and classification.

(ty)

(tx)

bw

bh

pw

ph

cx

cy

𝜎

𝜎Figure 6: Bounding box prediction.

7Journal of Sensors

4. Underwater Detection CNN Network

For underwater detection, the commonly used methods arenot applicable because of the low-quality vision and the smallobjects for detection. Our original neutral network is shownin Figure 7, the input image is resized into 448 ∗ 448 ∗ 3,the resized images should be batch normalized (BN), the con-volution kernels is 3 ∗ 3 and 1 ∗ 1, the stride is 1, and the out-put feature map is 14 ∗ 14 ∗ 75. In order to solve thephenomenon of gradient dispersion or explosion of the net-work, the better proposal is to change the layer-by-layertraining of deep neural network to step-by-step training.The deep neural network is divided into several subsegments,

each subsegment contains shallow network layers, then,short cut is used to make each subsegment train residual,and each subsegment has a total learning error. At the sametime, the proposed method can control the propagation ofgradient well and avoid the situation of vanishing gradientor exploding gradient, which is not conducive to training.

Firstly, 3 ∗ 3 convolution is used to reduce the numberof channels and training parameters; then, convolutionkernel of different sizes is used to perform convolutionoperation; finally, each feature image is combined accord-ing to the channel. In order to get more advanced features,the previous way is to increase the depth of the network,and we proposed this network to achieve this goal by

448

224

3

64

112

128

56

256

448

32

5121024

28

28

14

14

Input

Conv2D 32 ⁎3 ⁎3

Residualblocks 1 ⁎64


Residualblocks8⁎256

Residual blocks8⁎512

Residual blocks 4⁎1024

102414

14 Conv2D block7514

14 Detection

Figure 7: Original object detection network structure.

448

224

3

112

128

56

256

448

32

5121024

28

28

14

14

Input Conv2D 32 ⁎3 ⁎3






10241414 Conv2D block

Concat

56

384

56

128

Conv2D blocks

56

75

Conv2D 3⁎3

768

28

28Conv1⁎1

56

128

Conv2D+upsamplling

75

Detection

1414

256

28

28Conv 2D

blocks

2828

56

5656

56

448448 224 112 56

Figure 8: The network structure modification scheme 1.


increasing the width of the network. The concept modulecomprehensively considers the results of multiple convolu-tion kernels, different information of the input image andbetter image representation are obtained. In order to pre-vent the middle part of the vanishing gradient process ofthe network structure, we introduced two auxiliary classi-fiers. Softmax operations are used on the output of twoof the perception modules, and then, the auxiliary loss iscalculated. Auxiliary loss is only used for training, notfor the prediction process.

4.1. Network Structure Improvement. For underwater objectdetection, the vision sensors are installed on the underwaterrobot. For the real operation, the common method performsnot well in small objects detection, because the regulardataset used in the experiment are normal images, whichare high-quality and well-lighted images. For underwaterdetection, the objects are always overlapped by otherthings, such as rocks and corals, and the underwater visionis always vague, the clarity is low. Under these conditions,the network structure should retain more original features.In deep CNN, the more layers always extract features thatare more abstract, and the deep semantic information canbe extracted more clearly. On the other hand, the fewerlayers can retain more representation information. The

deep semantic information and the representation infor-mation can be combined to give a more accurate detec-tion. In this paper, the structure is proposed by twoschemes, the first one is that a 1 ∗ 1 convolution kernelis used on the 28 ∗ 28 feature map, and then, a downsam-pling layer is added to resize the output to equal 14 ∗ 14,which is combined with the last output to complete thedetection; the improvement is shown in Figure 8.

Because of the original information loss in convolutionoperation, in the second scheme, the downsampling is addedfirstly, and then the convolution layer is inserted in the net-work, the result is combined with the last output to achievethe detection; the modification is shown in Figure 9.

There are three full convolution feature extractors,respectively, corresponding to the convolutional set, whichis the internal convolution kernel structure of the featureextractor, 1 ∗ 1 convolution kernel is used for dimensionalityreduction, 3 ∗ 3 convolution kernel is used for feature extrac-tion, and multiple convolution kernels are interleaved toachieve the purpose. Each full convolution feature layer isconnected. The input of the current feature layer has a partof the output from the previous layer. Each feature layerhas an output prediction results. Finally, the results areregressed according to the confidence level to get the finalprediction results.

448

224

3

128

56

256

448

32

5121024

28

28

14

14

Input Conv2D 32 ⁎3 ⁎3






102414

14 Conv2D block

Concat

56

384

56

128

Conv2D blocks

56

75

Conv2D 3*3

768

28

28Conv 1⁎1

56

128

Conv2D+upsamplling

75

Down sampling

75

Detection

14

14

1414

112

112

448 448224 56

56

5656

56

Figure 9: The network structure modification scheme 2.

9Journal of Sensors

4.2. Dataset Augmentation. Underwater dataset is difficult toprepare, the underwater images and video are not easy toobtain on the internet, and for underwater images, the back-ground is almost the same in the same area, so the images inthe dataset are similar, because of these factors the trainingoutput model is always not effective to be used in other seaareas. Therefore, the dataset should be modified and aug-mented, so as to make the deep learning model more gener-ally used. The dataset augmentation is mainly based onrotation, flipping, zoom, shift, etc.

The dataset used in this paper is obtained from the videorecorded by an underwater robot. The total number ofimages is about 18000, and the images are similar with eachother, so the rotation and color transformation is applied totransform the original patterns.

The three channels of images are dimensionality reduced;the R (Red), G (Green), and B (Blue) direction vectors areobtained, respectively.

Ixy = Rxy,Gxy, Bxy

ð13Þ

The eigenvalues and eigenvectors of R, G, and B aredefined as

Rxy = prλr ,Gxy = pgλg

Bxy = pbλb

ð14Þ

α is a random variable with a mean value of 0 and avariance of 0.1, and it is added in the transformation func-tion as follows:

Ixy = pr , pg, pbh i

αrλr , αgλg, αbλb T

: ð15Þ

The rotation transformation is presented as

x′ = xi cos θ1 − yi sin θ1,y′ = xi sin θ1 + yi cos θ1:

(ð16Þ

where (x’, y’) is the transformed location coordinates, andθ1 is the rotation angle.

Fishing containersNon pressure hullPolypropyleneFloating glass material

Open frame withpolypropylene plate

Aluminum alloy controlcabin shell

Figure 10: Underwater ROV for marine organisms fishing.

TP

TN FN

FPPositive(retrieved)

Negative(not retrieved)

True(relevant)

False(not relevant)

Figure 11: The parameters definition.


20.00

30.00

40.00

50.00

60.00

70.00

80.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Fast RCNN

Sea cucumberSea urchinScallop

(a)


20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Faster RCNN

(b)


20.00

30.00

40.00

50.00

60.00

70.00

80.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

YOLO V3

(c)

Figure 12: Continued.

11Journal of Sensors


20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Original network

(d)


20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Modified scheme 1

(e)


20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Modified scheme 2

(f)

Figure 12: mAP results obtained by different methods.


Table1:mAPandprecisionof

differentiterationtimes

byFastRCNN,F

asterRCNN,and

YOLO

V3.(Io

U=0:7).

Iteration

FastRCNN

Faster

RCNN

YOLO

V3

mAP(%

)Precision

(%)

mAP

Precision

(%)

mAP(%

)Precision

(%)

Seacucumber

Seaurchin

Scallop

Seacucumber

Seaurchin

Scallop

Seacucumber

Seaurchin

Scallop

2000

27.26

30.13

26.79

24.87

27.53

30.18

27.29

25.13

35.43

37.14

35.42

33.74

4000

37.56

40.51

38.23

33.93

38.74

40.80

39.35

36.06

45.90

48.12

45.50

44.08

6000

41.83

44.45

41.36

39.67

43.15

45.30

42.80

41.35

49.61

51.81

49.87

47.16

8000

45.37

48.67

45.85

41.59

46.59

48.35

47.35

44.08

52.40

54.56

53.14

49.51

10000

48.22

51.33

47.84

45.50

50.28

52.09

50.76

48.00

55.89

58.17

56.99

52.50

12000

50.90

53.75

51.31

47.65

52.53

53.96

53.44

50.20

58.34

59.77

59.48

55.77

14000

53.09

55.69

54.20

49.38

54.43

56.18

54.78

52.31

60.58

63.39

60.91

57.44

16000

55.04

58.85

54.92

51.35

57.32

59.66

56.98

55.34

62.02

64.09

62.50

59.47

18000

56.66

60.49

56.81

52.67

58.62

60.35

59.55

55.95

64.18

66.79

65.15

60.58

20000

58.63

62.12

58.49

55.27

60.93

62.30

61.86

58.63

66.00

68.87

66.20

62.93

22000

60.42

63.95

60.63

56.67

63.07

64.33

63.93

60.95

67.22

70.37

68.02

63.26

24000

61.35

64.57

62.19

57.29

64.37

65.94

64.67

62.51

68.88

71.50

70.54

64.60

26000

63.40

66.60

63.94

59.65

66.38

68.07

67.17

63.90

70.44

72.85

71.83

66.64

28000

65.03

68.81

65.36

60.92

68.15

69.98

68.67

65.82

72.00

74.79

73.17

68.03

30000

66.84

70.09

68.19

62.24

69.42

70.49

70.53

67.24

71.99

74.84

73.44

67.70

32000

67.68

70.73

68.53

63.78

70.68

72.78

71.47

67.79

72.24

74.47

73.78

68.47

34000

69.26

72.65

71.03

64.11

71.72

73.96

72.17

69.01

72.15

74.62

74.01

67.81

36000

70.96

74.75

72.23

65.90

73.75

74.44

75.02

71.79

71.88

74.87

72.82

67.95

38000

71.25

74.83

71.98

66.95

74.80

76.83

75.53

72.03

71.69

74.05

72.37

68.63

40000

72.88

76.46

74.09

68.08

75.75

77.53

76.13

73.58

72.70

75.55

73.70

68.83

42000

73.23

76.30

74.68

68.71

75.99

77.41

76.96

73.59

71.96

75.34

73.21

67.33

44000

73.13

76.19

74.49

68.70

74.86

77.65

73.26

73.67

71.57

74.83

72.59

67.28

46000

72.82

76.00

74.93

67.53

74.46

76.19

74.33

72.85

71.87

74.58

73.41

67.61

48000

73.01

76.46

74.07

68.49

74.62

76.16

74.41

73.30

71.59

74.19

73.39

67.20

50000

72.84

76.35

74.69

67.48

74.64

76.40

73.26

74.25

71.44

74.32

72.32

67.69

The

detectionresults

obtained

bythemetho

dsprop

osed

inthispaperareshow

nin

Table1.


Table2:mAPandprecisionof

differentiterationtimes

byYOLO

v3andmod

ified

metho

ds.(IoU=0:7).

Iteration

Originaln

etwork

Scheme1

Scheme2

mAP(%

)Precision

(%)

mAP

Precision

(%)

mAP

Precision

(%)

Seacucumber

Seaurchin

Scallop

Seacucumber

Seaurchin

Scallop

Seacucumber

Seaurchin

Scallop

2000

24.90

28.88

26.48

19.35

40.29

42.25

40.25

38.37

40.72

42.25

41.25

38.65

4000

33.95

38.90

36.90

26.07

52.47

54.10

53.55

49.76

52.71

54.28

52.54

51.31

6000

40.85

40.57

37.45

44.54

57.94

60.76

58.25

54.81

57.73

59.24

58.51

55.43

8000

42.51

45.51

42.34

39.67

61.63

63.49

61.93

59.46

62.22

64.60

62.26

59.79

10000

44.72

50.58

44.29

39.29

65.27

68.39

66.11

61.31

64.91

67.60

65.33

61.79

12000

49.53

48.75

48.11

51.74

68.04

71.64

67.95

64.53

67.49

69.04

68.98

64.45

14000

48.37

50.35

50.59

44.16

70.50

72.86

70.74

67.89

70.47

72.64

70.35

68.41

16000

54.12

53.40

50.48

58.49

73.61

76.17

74.23

70.44

72.72

75.16

73.45

69.54

18000

52.95

59.29

55.08

44.48

75.38

78.84

74.99

72.31

74.32

76.51

75.40

71.07

20000

57.33

58.04

54.54

59.42

77.20

80.87

77.05

73.69

76.82

78.91

78.04

73.51

22000

58.04

57.21

59.24

57.67

79.54

82.80

79.70

76.13

78.48

80.72

79.70

75.02

24000

55.44

62.44

59.14

44.75

81.18

85.61

80.80

77.14

80.37

82.28

82.03

76.80

26000

57.85

61.98

62.52

49.05

83.02

86.31

82.70

80.06

83.34

85.46

84.48

80.07

28000

60.42

62.81

62.56

55.89

85.01

88.52

85.25

81.25

84.69

87.32

86.04

80.70

30000

59.74

67.91

61.47

49.85

85.60

89.44

85.78

81.58

86.54

89.33

87.92

82.39

32000

62.32

65.73

63.52

57.72

84.90

88.06

85.56

81.06

87.65

90.15

88.55

84.25

34000

63.60

68.72

65.44

56.64

84.96

88.28

85.79

80.82

87.92

90.96

89.55

83.26

36000

64.70

72.76

65.55

55.80

85.13

88.11

85.42

81.87

87.34

89.36

88.08

84.57

38000

63.73

69.71

65.50

55.98

85.65

89.40

85.33

82.21

87.90

89.62

89.28

84.79

40000

71.77

70.26

67.52

77.54

85.02

88.46

85.83

80.76

87.69

89.88

88.17

85.04

42000

67.62

71.46

69.78

61.63

85.12

89.47

84.89

80.99

87.58

90.27

87.99

84.48

44000

69.25

71.82

66.45

69.47

84.79

87.91

84.47

81.99

87.57

90.56

88.08

84.08

46000

66.42

71.26

71.02

56.99

85.04

89.01

84.45

81.66

87.15

89.48

88.64

83.33

48000

66.19

70.46

66.96

61.15

85.05

88.37

85.10

81.66

87.21

89.27

88.42

83.95

50000

70.33

68.86

70.35

71.78

84.59

88.73

84.78

80.27

87.42

90.69

87.68

83.91


The shift transformation is given as

x′ = xi + yi tan θ2,y′ = xi tan θ2 + yi:

(ð17Þ

where θ2 is the shift angle.The above three methods are selected randomly to trans-

form the original image, and the total number is augmentedto 30000.

5. Experiments Results

The method proposed in this paper is going to be used onan underwater remote operated vehicle (ROV) for fishingmarine products. The robot is about 1m long, 0.8 meterswide, and weighs 90 kg. The method of collecting marineproducts is adsorption type; the design and real robotare shown in Figure 10. The robot is remote operated;our team is going to reconstruct the ROV to semiautono-mous, so the key technology is how to detect and locatethe objects.

5.1. Detection Comparison. The GPU used in these com-putations is NVIDIA GTX 1080ti, and the total numberof images is 30000, which are labeled one by one artifi-cially. And in deep learning, 8520 images are used fortraining, 8530 for validation, and 12950 for test. In objectdetection, Precision, Recall, and Mean Average Value arecommonly used to assess the accuracy; the definition isshown in Figure 11.

Precision = TPTP + FP

,

Recall = TPTP + FN

ð18Þ

Mean Average Precision is the mean value of precisionof all the detection classes, which is widely used to evalu-ate the detection system. In this paper, the dataset is pre-pared in Pascal VOC form, the results obtained from FastRCNN [6] and Faster RCNN [27] are shown in Figure 12,and the concrete data is shown in (Table 1 and Table 2).

In order to make clear about the convergence of differentmethods, the mAP values vs. iteration times are shown inFigure 13.

From the above results and comparison, it can be seenthat the detection accuracy of Faster RCNN is better thanthe other methods, but the difference is not very large. Com-pared with the original YOLO V3 method [30], the proposedmethod can give more accurate detection, and the scheme 2is more effective. The convergence of the methods is differ-ent; the YOLO V3 methods convergent after the 28000 itera-tion times, which is earlier than Fast RCNN and FasterRCNN. After 40000 times iteration, all the methods cannotimprove the detection accuracy, the reason is lack of theunderwater samples of the dataset, and the images of thedataset are similar, especially the background of the imagesis the same. This is the main reason for underwater objectdetection, the underwater data in deep sea is too difficultto obtain.

The original network proposed in this paper is not stable;the results fluctuated with the iteration times increasing. Themodified schemes are proposed to improve the stability andaccuracy, as shown in Figure 13. Compared with the othertypical methods, our proposed methods can give a moreaccurate result.

The loss function curves are shown in Figure 14, the lossvalues of all of the methods are convergent, and the lossvalues amplitude of the YOLO V3 methods are smaller com-pared with Fast RCNN [6] and Faster RCNN [27]; the con-vergent speed of the proposed methods are slower than theoriginal YOLO V3 method [30].

20

30

40

50

60

70

80

90

100

0 10000 20000 30000 40000 50000

Fast RCNNFaster RCNNYOLO V3

Original networkScheme 1Scheme 2

Figure 13: mAP results and comparison with other methods (%).


For object detection, the accuracy of all of the abovemethods are enough for application, the real-time detectionis more important, and the detection speed is shown inTable 3.

It is clear that the YOLOV3 [30] methods have a very fastdetection speed, almost four times faster than FasterRCNN[27]. Based on the accuracy and detection speed anal-ysis, the scheme 2 is better than the other methods, which hasthe same accuracy with the Faster RCNN, and the detectionspeed of this method is around 50FPS, even on a NVIDIATX2 card, the detection speed can reach 17FPS, it is enoughfor real application.

5.2. Detection Results. The following typical images are usedto testify the method (scheme 2) proposed in this paper, theimages are provided by the “Underwater Robot Picking Con-test”, and some images are filmed by the underwater ROV.

The method scheme 2 is better in underwater detectionbecause of retaining more representation information, thecomparison is shown in Figure 15, (a) and (b) are the sameimage, and the scheme 2 method can detect the sea cucumberand the sea urchin in the lower-left corner, but the originalmissed the objects. In (c) and (d), the left sea cucumber ismissed by the original YOLO V3 [30] method too, so thismethod is more effective obviously. And from the image (a)

0 5,000 10,000 15,000Iteration

20,000 25,000 30,000

2

1.5

1

0.5

0

Loss

(a)

0 5,000 10,000 15,000Iteration

20,000 25,000 30,000

1.5

1

0.5

0

Loss

(b)

0 5,000 10,000 15,000Iteration

20,000 25,000 30,000

1.41.2

0.6

0.8

0.2

0

Loss

0.4

1

(c)

0 5,000 10,000 15,000Iteration

20,000 25,000 30,000

3

2

0Lo

ss

1

(d)

3

2.5

1.5

0.5

00 5,000

Loss

10,000 15,000Iteration

20,000 25,000 30,000

1

2

(e)

Figure 14: The loss curves of different methods. (a) Fast RCNN. (b) Faster RCNN. (c) YOLO v3. (d) Scheme 1. (e) Scheme 2.

Table 3: Detection speed of different methods (IoU = 0:7, learning rate = 0:001).

Approach Fast RCNN Faster RCNN YOLO V3 Scheme 1 Scheme 2

Time cost (ms) 96 85 20 22 19


(a)

(b)

(c)

(d)

Figure 15: The detection comparison between YOLOV3 and the scheme 2method. (a) Scheme 2. (b) YOLOV3. (c) Scheme 2. (d) YOLOV3.


detection, we can see that the sea cucumber covered by thesands in the lower-left corner can be detected too, which isdifficult to detect by human vision.

In order to verify the method, 8 images are chosen in theexperiment; the detection results are shown in Figure 16.

The training model is applied in the ROV to test thedetection effect, the weather is cloudy, and the sea water isvery turbid; the real-time detection results are presented inFigure 17.

As seen in Figure 15, some of the objects are missed to bedetected, the reason is that the dataset is not large enough,especially the images of the dataset are very similar; the lightand the background are simple, so when the trained model isused to detect in the other sea area or under different envi-ronment conditions, the detection accuracy is going toreduce more or less, so our team is planning to film moreunderwater images in different sea area and under differentconditions to make the dataset more plentiful, so as toachieve the perfect underwater detection.

6. Conclusion

Considering the underwater vision characteristics, some newimage processing procedures are proposed to deal with thelow contrast and the weakly illuminated problems. A deepCNNmethod is proposed to achieve the detection and classi-fication of marine organisms, which is commonly recognized

as the fastest object detection method. The underwater visionis in low quality, and the objects are always overlapped andshaded, so the original YOLO V3 [30] method is not veryeffective for underwater detection; two methods are proposedto deal with these problems. Through detection resultscomparison with the other methods, the scheme 2 can givea better detection. The trained model is used to assist theROV to detect underwater objects; although some of theobjects are missed, the effectiveness and capability of theproposed method are obviously verified by the qualitativeand quantitative evaluation results. The proposed methodis suitable for our underwater robot to detect the objects,which is not better than the typical methods for the otherdataset. And dropout layers and other technologies are notsignificant in this model; the reconstruction of the networkby using a more complicated algorithm would be moreeffective.

Data Availability

The data used to support the findings of this study are avail-able from the corresponding author upon request.

Conflicts of Interest

No potential conflict of interest was reported by the authors.

Figure 16: The detection results of scheme 2 method.

Figure 17: The detection applied in the ROV.


Acknowledgments

We would like to express our gratitude for support fromthe National Key R&D Program of China (Grant No.2018YFC0309402) and the Fundamental Research Fundsfor the Central Universities (Grant No. HEUCF180105).

References

[1] E. Y. Lam, “Combining gray world and retinex theory for auto-matic white balance in digital photography,” in Proceedings ofthe Ninth International Symposium on Consumer Electronics,2005. (ISCE 2005), pp. 134–139, Macau, Macau, June 2005.

[2] G. Buchsbaum, “A spatial processor model for object colourperception,” Journal of the Franklin Institute, vol. 310, no. 1,pp. 1–26, 1980.

[3] J. Van DeWeijer, T. Gevers, and A. Gijsenij, “Edge-based colorconstancy,” IEEE Transactions on Image Processing, vol. 16,no. 9, pp. 2207–2214, 2010.

[4] R. Hummel, “Image enhancement by histogram transforma-tion,” Computer Graphics and Image Processing, vol. 6, no. 2,pp. 184–195, 1977.

[5] K. Zuiderveld, Contrast limited adaptive histogram equaliza-tion[M]// graphics gems IV, Academic Press Professional,Inc., 1994.

[6] A. S. A. Ghani and N. A. M. Isa, “Enhancement of low qualityunderwater image through integrated global and local contrastcorrection,” Applied Soft Computing, vol. 37, no. C, pp. 332–344, 2015.

[7] C. Li and J. Guo, “Underwater image enhancement by dehaz-ing and color correction,” Journal of Electronic Imaging,vol. 24, no. 3, article 033023, 2015.

[8] M. Braik, A. Sheta, and A. Ayesh, “Image enhancement usingparticle swarm optimization,” Journal of Intelligent Systems,vol. 2165, no. 1, pp. 99–115, 2007.

[9] E. H. Land, “The Retinex theory of color vision,” ScientificAmerican, vol. 237, no. 6, pp. 108–128, 1977.

[10] X. Fu, P. Zhuang, Y. Huang, Y. Liao, X.-P. Zhang, and X. Ding,“A retinex-based enhancing approach for single underwaterimage,” in 2014 IEEE International Conference on Image Pro-cessing (ICIP), pp. 4572–4576, Paris, France, Oct. 2014.

[11] J. Perez, A. C. Attanasio, N. Nechyporenko, and P. J. Sanz, “Adeep learning approach for underwater image enhancement,”in International Work-Conference on the Interplay BetweenNatural and Artificial Computation, pp. 183–192, Springer,Cham, 2017.

[12] C. C. Chang, J. Y. Hsiao, and C. P. Hsieh, “An AdaptiveMedian Filter for Image Denoising,” in 2008 Second Interna-tional Symposium on Intelligent Information Technology Appli-cation, pp. 346–350, Shanghai, China, Dec. 2008.

[13] C. J. Prabhakar and P. U. P. Kumar, “Underwater imagedenoising using adaptive wavelet subband thresholding,” in2010 International Conference on Signal and Image Processing,pp. 322–327, Chennai, India, December 2010.

[14] D. M. Kocak and F. M. Caimi, “The current art of underwaterimaging – with a glimpse of the past and vision of the future,”Marine Technology Society Journal, vol. 39, no. 3, pp. 5–26,2005.

[15] M. Zhang and B. K. Gunturk, “Multiresolution bilateral filter-ing for image denoising,” IEEE Transactions on Image Process-

ing A Publication of the IEEE Signal Processing Society, vol. 17,no. 12, pp. 2324–2333, 2008.

[16] M. Mäkitalo and A. Foi, “Optimal inversion of the generalizedAnscombe transformation for Poisson-Gaussian noise,” IEEETransactions on Image Processing, vol. 22, no. 1, pp. 91–103,2013.

[17] J. L. Forand, G. R. Fournier, D. Bonnier, and P. Pace,“LUCIE: a Laser Underwater Camera Image Enhancer,” inProceedings of OCEANS '93, Victoria, BC, Canada, Canada,Oct. 1993.

[18] S. Yang and F. Peng, “Laser underwater target detection basedon Gabor transform,” in 2009 4th International Conference onComputer Science & Education, pp. 95–97, Nanning, China,Jul 2009.

[19] B. Ouyang, F. Dalgleish, A. Vuorenkoski, W. Britton,B. Ramos, and B. Metzger, “Visualization and image enhance-ment for multistatic underwater laser line scan system usingimage-based rendering,” IEEE Journal of Oceanic Engineering,vol. 38, no. 3, pp. 566–580, 2013.

[20] P. C. Y. Chang, J. C. Flitton, K. I. Hopcraft, E. Jakeman, D. L.Jordan, and J. G. Walker, “Improving visibility depth in pas-sive underwater imaging by use of polarization,” AppliedOptics, vol. 42, no. 15, pp. 2794–2803, 2003.

[21] V. Gruev, J. V. D. Spiegel, and N. Engheta, “Advances in inte-grated polarization image sensors,” in 2009 IEEE/NIH Life Sci-ence Systems and Applications Workshop, pp. 62–65, Bethesda,MD, USA, Apr 2009.

[22] Y. Li and S. Wang, “Underwater polarization imaging technol-ogy,” in 2009 Conference on Lasers & Electro Optics & ThePacific Rim Conference on Lasers and Electro-Optics, pp. 1-2,Shanghai, China, Aug 2009.

[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet clas-sification with deep convolutional neural networks,” Commu-nications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.

[24] R. Girshick, “Fast R-CNN,” in 2015 IEEE International Confer-ence on Computer Vision (ICCV), pp. 1440–1448, Santiago,Chile, Dec 2015.

[25] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid poolingin deep convolutional networks for visual recognition,” EEETransactions on Pattern Analysis and Machine Intelligence,vol. 37, no. 9, pp. 1904–1916, 2015.

[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learningfor Image Recognition,” in 2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), Las Vegas, NV, USA,Jun 2016.

[27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich FeatureHierarchies for Accurate Object Detection and Semantic Seg-mentation,” in 2014 IEEE Conference on Computer Visionand Pattern Recognition, pp. 580–587, Columbus, OH, USA,Jun 2014.

[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towardsreal-time object detection with region proposal networks,”IEEE Transactions on Pattern Analysis and Machine Intelli-gence, vol. 39, no. 6, pp. 1137–1149, 2017.

[29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “Youonly look once: unified, real-time object detection,” in2016 IEEE Conference on Computer Vision and Pattern Rec-ognition (CVPR), pp. 779–788, Las Vegas, NV, USA, June2016.

[30] J. Redmon and A. Farhadi, YOLOv3: An Incremental Improve-ment, 2018.


[31] G. D. Finlayson and E. Trezzi, “Shades of gray and colour con-stancy,” in The Twelfth Color Imaging Conference, Is&t-TheSociety for Imaging Science and Technology, pp. 37–41, Scotts-dale, AZ, USA, 2004.

[32] C. Li, J. Guo, F. Porikli, and Y. Pang, “LightenNet: a convolu-tional neural network for weakly illuminated image enhance-ment,” Pattern Recognition Letters, vol. 104, pp. 15–22, 2018.

[33] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression arti-facts reduction by a deep convolutional network,” in 2015IEEE International Conference on Computer Vision (ICCV),Santiago, Chile, Dec. 2015.

[34] D. J. Jobson, Z. Rahman, and G. A. Woodell, “A multi-scaleRetinex for bridging the gap between color images and thehuman observation of scenes,” IEEE Transactions on ImageProcessing, vol. 6, no. 7, pp. 965–976, 1997.

[35] R. C. Gonzalez and R. E. Woods, Digital Image Processing,Prentice–Hall, Englewood Cliffs, NJ, USA, 2017.


Underwater Image Processing and Object Detection Based on ...

Documents