Abstract

Due to the importance of underwater exploration in the development and utilization of deep-sea resources, underwater autonomous operation is more and more important to avoid the dangerous high-pressure deep-sea environment. For underwater autonomous operation, the intelligent computer vision is the most important technology. In an underwater environment, weak illumination and low-quality image enhancement, as a preprocessing procedure, is necessary for underwater vision. In this paper, a combination of max-RGB method and shades of gray method is applied to achieve the enhancement of underwater vision, and then a CNN (Convolutional Neutral Network) method for solving the weakly illuminated problem for underwater images is proposed to train the mapping relationship to obtain the illumination map. After the image processing, a deep CNN method is proposed to perform the underwater detection and classification, according to the characteristics of underwater vision, two improved schemes are applied to modify the deep CNN structure. In the first scheme, a convolution kernel is used on the feature map, and then a downsampling layer is added to resize the output to equal . In the second scheme, a downsampling layer is added firstly, and then the convolution layer is inserted in the network, the result is combined with the last output to achieve the detection. Through comparison with the Fast RCNN, Faster RCNN, and the original YOLO V3, scheme 2 is verified to be better in detecting underwater objects. The detection speed is about 50 FPS (Frames per Second), and mAP (mean Average Precision) is about 90%. The program is applied in an underwater robot; the real-time detection results show that the detection and classification are accurate and fast enough to assist the robot to achieve underwater working operation.

1. Introduction

With the development of computer vision and image processing technology, the application of image processing methods to improve the underwater image quality to satisfy the requirements of the human vision system and machine recognition has gradually become a hot issue. At present, the methods of underwater image enhancement and restoration can be divided into nonphysical model image enhancement and physical model-based image restoration.

For underwater image enhancement, traditional image processing methods include color correction algorithms and contrast enhancement algorithms, the white balance method [1], gray world hypothesis [2], and gray edge hypothesis [3] are the typical color correction methods, and the contrast enhancement algorithms include the histogram equalization [4] and restricted contrast histogram equalization [5], which are commonly used to enhance underwater images. Compared with the good results obtained by common image processing, the results obtained by these methods are unsatisfactory for underwater vision. The main reason is that the ocean environment is complex, and many unfavorable factors, such as the scattering and absorption of light by water, and the underwater suspended particles have serious interference on image quality.

More complex and comprehensive underwater image enhancement methods are proposed for solving the degradation of color fading, contrast reduction, and detail blurring problems. For example, Ghani et al. [6] proposed a method to solve the low contrast problem of underwater images; the Rayleigh stretch limited contrast adaptive histogram was used to normalize the global contrast-enhanced image and the local contrast-enhanced image, so as to realize the enhancement for the low quality of underwater images. Li et al. [7] considered the multiple degradation factors of the underwater image, adopted image dehazing algorithm, color compensation histogram, equalization saturation, illumination intensity stretching, and bilateral filtering algorithm to solve the problems of blurring, color fading, low contrast, and noise problems. Braik et al. [8] used particle swarm optimization (PSO) to enhance underwater images by reducing the influence of light absorption and scattering. In addition, the Retinex theory is often applied in the underwater image enhancement process [9]; Fu et al. [10] proposed an underwater image enhancement method based on the Retinex model. This method applied different strategies to enhance the reflection and illumination components of the underwater image on the basis of color correction, and then the final enhancement results are synthesized. Perez et al. [11] proposed an underwater image enhancement method based on deep learning, which constructed a training data set consisting of groups of degraded underwater images and restored underwater images. The model between degraded underwater images and restored underwater images was obtained from a large number of training sets by deep learning method, which is used to enhance the underwater image quality.

Underwater detection mainly depends on the digital cameras, and the image processing is commonly used to enhance the quality and reduce the noise; contour segmentation methods are commonly used to locate the objects. A lot of such methods are proposed to realize the target detection. For instance, Chen Chang et al. [12] proposed a new image-denoising filter based on a standard median filter, which is used to detect noise and change the original pixel value to a newer median. Prabhakar et al. [13] proposed a novel denoising method to remove additive noise present in the underwater images, homomorphic filtering for correcting nonuniform illumination is used, and anisotropic filtering is applied for smoothing. A new approach for denoising combining wavelet decomposition with high-pass filter is applied to enhance the underwater images (Sun et al., 2011); both the low-frequency components of the back-scattering noise and the uncorrelated high-frequency noise can be effectively depressed simultaneously. However, the unsharpness in the processed image is serious based on the wavelet method. Kocak et al. [14] used a median filter to remove the noise, the quality of the images are enhanced by RGB color level stretching, the atmospheric light is obtained through the dark channel prior, and this method is helpful in the case of images with minor noise. For noisy images, a bilateral filtering method is utilized by Zhang et al. [15], the results are good, but the time processing is very high. An exact unbiased inverse of the generalized Anscombe transformation is introduced by Markku et al. [16]; the comparison shows that the method plays an integral part in ensuring accurate denoising results.

A Laser Underwater Camera Image Enhancer system is designed and built by Forand et al. [17] to enhance the laser underwater image quality, and it is testified that the system has a range of 3 to 5 times than that of a conventional camera with floodlights. Yang et al. [18] proposed a method of detecting underwater laser weak target based on Gabor transform, which is processed on laser underwater complicated nonstationary signal to turn it to become an approximate stationary signal, and then the triple correlation is computed with Gabor transform coefficient and it can eliminate random interference and extrude target signal’s correlation. Ouyang et al. [19] investigated the application of light field rendering (LFR) to images taken from a distributed bistatic nonsynchronous laser line scan imager using both line-of-sight and non-line-of-sight imaging geometries to create a multiperspective rendering of an unknown underwater scene.

Chang et al. [20] introduced a significant amount of polarization into light at scattering angles near 90 degrees: This light can then be distinguished from light scattered by an object that remains almost completely unpolarized. Results were obtained from a Monte Carlo simulation and from a small-scale experiment, in which an object was immersed in a cell filled with polystyrene latex spheres suspended in water. Gruev et al. [21] described two approaches for creating focal plane polarization imaging sensors. The first approach combines polymer polarization filters with a CMOS active pixel sensor and computes polarization information at the focal plane. The second approach outlines the initial work on polarization filters using aluminum nanowires. Measurements from the first polarization image sensor prototype are discussed in detail, and applications for material detection using polarization techniques are described. Underwater Polarization Imaging Technology is introduced in detail by Li et al. [22].

The above methods are based on wavelet decomposition, statistical methods or by means of laser technology, or color polarization theories, the results show that the methods are reasonable and effective, but the common weakness is that the processing is very time consumable, and it is difficult to achieve real-time detection.

The Convolution Neural Network (CNN) is recognized as the fastest detection method by many ways in different research fields; Krizhevsky et al. [23] applied CNN method to deal with classification problem winning the champion of ILSVRC (ImageNet Large Scale Visual Recognition Challenge), which reduce the top 5 error rate to 15.3%, from then on deep CNN has been widely applied. Girshick [24] proposed Region Convolutional Neural Network (RCNN) through combining the RPN (Region Proposal Network) and CNN methods, which are testified on Pascal VOC 2007, mAP reaches 66%. Based on RCNN, SPP-Net (Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition) is presented by He K. et al. [25] to improve the detection efficiency. RESNET is proposed by [26]; the success of RESNET is to solve the problem of network migration with the help of the introduction of residual module, so as to improve the depth of the network, which can obtain the features with stronger expression ability and higher accuracy. Multilayer Perceptron (MLP) is applied to replace SVM (Support Vector Machine); the training and classification are optimized significantly, which is named Fast RCNN [6]. In Fast RCNN, Ren S, He K, and Girshick [27] added RPN to select and modify the region proposals instead of selective search, which is aimed at solving the end-to-end detection problem; this is the Faster RCNN method. Liu Wei proposed a SSD (Single Shot MultiBox) method in ECCV2016 (European Conference on Computer Vision). Compared with Faster RCNN, it has a distinct speed advantage, which is able to directly predict the coordinates and categories of bounding box without processing of generating a proposal.

In 2016CVPR(IEEE Conference on Computer Vision and Pattern Recognition), Redmon proposed YOLO (You Only Look Once) [28] Regression object detection algorithm; by this method, the detection speed is improved significantly, and the real-time detection is possible to be realized. When the YOLO algorithm was put forward, the accuracy and speed of computation were not as good as that of the SSD algorithm. Then, Redmon proposed YOLO V2 [29] version to optimize the original YOLO multitarget detection framework through a series of methods, and the accuracy is greatly improved under the advantage of maintaining the original speed. Earlier of 2018, Redmon put forward the YOLO v3 [30], which is generally recognized as the fastest detection method, and the accuracy and the detection speed are greatly improved compared with the other methods.

In this paper, we applied a combination of max-RGB method and shades of gray method to enhance the underwater images, and a CNN method is used for weakly illuminated image. For the underwater object detection, a new CNN method is proposed to solve the underwater object detection problem; considering the particularity of underwater vision, two improved schemes are proposed to improve the detection accuracy, and the results are compared with Fast RCNN[6], Faster RCNN [27], and original YOLO V3[30]. It is testified through comparison that the modification is effective, and the program is installed on an underwater robot to test the real-time detection.

2. Image Preprocessing

For underwater computer vision, the image preprocessing is the most important procedure for object detection. Because of the effects of light scattering and absorption in the water, the images obtained by the underwater vision system show the characteristics of uneven illumination, low contrast, and serious noise. By analyzing the current image processing algorithms, enhancement algorithms for underwater images are proposed in this paper.

2.1. The Underwater Vision Detection Architecture

The typical underwater visual system is composed of light illumination, camera or sensor, image acquisition card, and application software. The software process of the underwater visual recognition system generally includes several parts, such as image acquisition, image preprocessing, convolution neural network, and target recognition, as shown in Figure 1.

Image preprocessing is at the low level, the fundamental purpose is to improve image contrast, to weaken or suppress the influence of various kinds of noise as far as possible, and it is important to retain useful details in the image enhancement and image filtering process. Convolutional Neutral Network is used to divide images into multiple nonoverlapping regions; the basis of object detection and classification is based on feature extraction, which is aimed at extracting the most effective essential features that reflect the target. Every aspect is closely related, so every effort should be made to achieve satisfactory results. The research of this paper mainly focuses on image preprocessing and recognition of typical targets from the underwater vision.

2.2. Combination of Max-RGB Method and Shades of Gray Method

The absorption of water to light leads to the decline of the color of underwater images. As the red and orange light are completely absorbed at 10 meters deep in the water, the underwater images generally get blue-green color. In order to eliminate the color deviation of underwater images, color correction of underwater images must be carried out.

The color correction of the normal image has been very mature. Many white balance methods, such as Gray Word method, max-RGB method, Shades of Gray method, and Gray Edge method, are used to correct the color deviation of the image according to the color temperature. Generally, the application scenarios of these methods are general partial color conditions, and the treatments for severe underwater vision are not satisfied. In this paper, the original max-RGB method and shades of gray method are combined to identify the illuminant color. where is the input underwater image, is the radiance given by the light source, is the wavelength, represents the surface reflectance, denotes the sensitivity of the sensors, and is the visible spectrum.

The illuminant is defined as

The average reflectance of the scene is gray according to the Grey-World assumption [31]

Assume is a constant value, the physical meaning of equation (1) can be simply described as that the observed image can be decomposed into the product of the reflectance of image and the illumination map . Thus, weak illumination image enhancement means removing weak illumination from the input image; equation (3) is substituted in equation (1)

The illumination by explaining that the average color of the entire image raised to a power

According to the max-RGB method, the above equation can be modified as where can take any number between 1 and , the default value of , which is defined in shades of gray method proposed by Finlayson [31].

2.3. CNN Method for Weakly Illuminated Image Enhancement

Retinex model can be used to enhance the image based on the estimated illumination map; for underwater vision, the images are always weakly illuminated, so a trainable CNN method is applied to predict the mapping relations between weakly illuminated image and the corresponding illumination map. A four-layer convolutional network is used, the first and the third layers focus on the high light regions, and the second layer focuses on low-light regions while the last layer is to reconstruct the illumination map. The Convolutional Neural Network directly learns from an end-to-end mapping between dark and bright images. Low-light image enhancement in this paper is regarded as a machine learning problem. A weakly illuminated image is input, and a convolution layer is applied to change the image into 32 channels; the 3-D view figure means multilayers feature map, and then and convolution layers are added in the network; the output is a one channel feature map. In this model, most of the parameters are optimized by back-propagation, while the parameters of traditional models depend on the neutral network. The four-layer convolutional network structure is shown in Figure 2.

The input image is the weakly illuminated image, and the output is the corresponding illumination map. Similar with Chongyi Li et al. [32] and Dong et al. [33], the network contains four convolutional layers with specific tasks. Observing the feature maps in Figure 2, different convolutional layers have different effects on the final illumination map. For example, the first two layers focus on the high-light regions and the third layer focuses on low-light regions, while the last layer is to reconstruct the illumination map. The specific operation form of the four convolutional layers is described as shown in Figure 2.

The enhancement effects are shown in Figure 3, the underwater background color is improved significantly, and the weakly illuminated images are enhanced using the trainable CNN method.

3. The Object Detection Theories

The images are resized into , the input images are resized, the image will stretch, and the label will be recalculated too. In this case, in fact, a scale factor is calculated to record the scale of width and height, respectively, and , , , and are calculated, respectively, but the output images are resized to be same as the original images. A CNN method is used to predict the bounding boxes and classification probabilities. For the underwater detection, the targets are difficult to be identified from the background. In order to improve the detection accuracy, the whole image information is used to predict the bounding boxes of the targets and classify the objects at the same time; through this proposal, the end-to-end real time targets detection can be realized.

3.1. Convolutional Neutral Network

The image is divided into grid cells, which is used to locate the center of the detection object. For each grid cell, the bounding boxes (bbox) are predicted, which includes 5 parameters, (, ) is the center location of the bounding box, (, ) is the width and height of the box, confidence is the Intersection of Union (IoU), which equals the intersection divided by the union between the bbox and the ground truth, the process is shown in Figure 4.

The bounding box is predicted through a fully-connected layer; if the width and height are only related to the scales and ratios of the input images, the location of the different objects in different shapes cannot be very accurate. Therefore, Region Proposal Network is applied to predict the bounding box and confidence [27], in which the predicted boxes with different scales and ratios are used, and the offsets of the boxes are calculated in RPN, as shown in Figure 5. The fully connected layer is removed, and the convolution layer with anchor boxes is added to predict the bounding box. In order to keep the high quality of the original image, a pooling layer is removed, and the input image is , the scale of the final feature map is with only one center.

Through a series of convolutions, a common feature map is obtained, then, RPN is applied. Firstly through a convolution, a new feature map is given, which can also be seen as high dimensional feature vectors, then through two convolutions, a feature map and a characteristic map are obtained. That is results, each result contains 2 scores and 4 coordinates, and then combined with the predefined anchors; after preprocessing, the bounding box is calculated.

In deep learning process, grid cell data is input in the deep learning results, the center of some pixels is within the certain range of a specific grid cell, and then, all the pixels satisfied the feature of the object are clustered in a certain range. After many times of trial training with penalty, it can find the exact range through sliding window. However, the center position cannot exceed the range of the grid cell. This greatly limits the model’s computation when it is sliding around in the picture. In this way, position detection and category recognition are combined into a CNN network to predict, you only need to scan the picture once to infer the position information and category of all objects in the picture.

3.2. Cluster Analysis

The -means cluster method is used to train the bounding boxes, the target is to obtain a better IoU between the bbox and the grounding truth, so the distance from the center of bbox to the cluster center is calculated as a parameter:

The Euclidean distance is applied in the traditional -means cluster method, which means that the bigger boxes with more errors compared with the smaller boxes, the result may be deviated from the true value. So the IoU score is proposed to substitute the traditional method.

The convolutional kernel is , the max-pooling size is , and the dimension of the feature map is reduced 2 times. The global average pooling is applied to complete the prediction; the convolution is used to compress the channels of the feature maps, so as to reduce the parameters and the amount of calculation. A batch normalization layer is added to accelerate the convergence speed and avoid the overfitting.

Data preprocessing (unified format, equalization, noise reduction, etc.) can greatly improve the speed of training and enhance the training effect. Batch Normalization (BN) is proposed by Google, which is commonly used in the CNN network. After the convolution or pooling and before the activation function, all of the input data is normalized as follows: where is Batch mean value and Var is variance; and are the scale and shift coefficients, which are obtained from training.

3.3. Location Prediction

In order to solve the unstable problem for using the anchor boxes, especially in the process of early iteration, the following procedures are applied to predict the location of boxes: where (, ) is the predicted value, (, ) is the coordinates of anchor, (, ) is the real coordinates value, (, ) is the offset value, and (, ) is the width and height of the box.

When , the box is offset a distance equaled the width of the box to the right; if the offset is to the left, so every predicted box can be located at any position on the image, which is the reason why the model is unstable, and the prediction is very time consumable. The prediction box is limited in the grid cell, and sigmoid function is used to calculate the offset value, which is defined between 0-1; the , , , and can be computed from the following equations:

In the above equations, (, ) is the upper left corner coordinates of the grid cell, as shown in Figure 6; when the scale of the grid cell is 1, the center is limited in the internal of the cell by the sigmoid function. The and are the priori width and height.

3.4. Loss Function

In the process of training, the loss function form is a key technique; for the method proposed in this paper, a sum squared error loss is used to balance the errors. For the boxes in different size prediction, the width and height of the bounding box are substituted by the square root value; thus, the smaller box has a relatively large value offset to make the prediction more effective. The loss function can be divided into 2 parts:

is aimed at determining the -th box in -th grid cell is in charge for the object or not, which is a coordinate prediction for the loss.

is the confidence prediction loss of the box with the object. The total loss is the sum of the and , which can give a better balance between the coordinates, confidence, and classification.

4. Underwater Detection CNN Network

For underwater detection, the commonly used methods are not applicable because of the low-quality vision and the small objects for detection. Our original neutral network is shown in Figure 7, the input image is resized into , the resized images should be batch normalized (BN), the convolution kernels is and , the stride is 1, and the output feature map is . In order to solve the phenomenon of gradient dispersion or explosion of the network, the better proposal is to change the layer-by-layer training of deep neural network to step-by-step training. The deep neural network is divided into several subsegments, each subsegment contains shallow network layers, then, short cut is used to make each subsegment train residual, and each subsegment has a total learning error. At the same time, the proposed method can control the propagation of gradient well and avoid the situation of vanishing gradient or exploding gradient, which is not conducive to training.

Firstly, convolution is used to reduce the number of channels and training parameters; then, convolution kernel of different sizes is used to perform convolution operation; finally, each feature image is combined according to the channel. In order to get more advanced features, the previous way is to increase the depth of the network, and we proposed this network to achieve this goal by increasing the width of the network. The concept module comprehensively considers the results of multiple convolution kernels, different information of the input image and better image representation are obtained. In order to prevent the middle part of the vanishing gradient process of the network structure, we introduced two auxiliary classifiers. Softmax operations are used on the output of two of the perception modules, and then, the auxiliary loss is calculated. Auxiliary loss is only used for training, not for the prediction process.

4.1. Network Structure Improvement

For underwater object detection, the vision sensors are installed on the underwater robot. For the real operation, the common method performs not well in small objects detection, because the regular dataset used in the experiment are normal images, which are high-quality and well-lighted images. For underwater detection, the objects are always overlapped by other things, such as rocks and corals, and the underwater vision is always vague, the clarity is low. Under these conditions, the network structure should retain more original features. In deep CNN, the more layers always extract features that are more abstract, and the deep semantic information can be extracted more clearly. On the other hand, the fewer layers can retain more representation information. The deep semantic information and the representation information can be combined to give a more accurate detection. In this paper, the structure is proposed by two schemes, the first one is that a convolution kernel is used on the feature map, and then, a downsampling layer is added to resize the output to equal , which is combined with the last output to complete the detection; the improvement is shown in Figure 8.

Because of the original information loss in convolution operation, in the second scheme, the downsampling is added firstly, and then the convolution layer is inserted in the network, the result is combined with the last output to achieve the detection; the modification is shown in Figure 9.

There are three full convolution feature extractors, respectively, corresponding to the convolutional set, which is the internal convolution kernel structure of the feature extractor, convolution kernel is used for dimensionality reduction, convolution kernel is used for feature extraction, and multiple convolution kernels are interleaved to achieve the purpose. Each full convolution feature layer is connected. The input of the current feature layer has a part of the output from the previous layer. Each feature layer has an output prediction results. Finally, the results are regressed according to the confidence level to get the final prediction results.

4.2. Dataset Augmentation

Underwater dataset is difficult to prepare, the underwater images and video are not easy to obtain on the internet, and for underwater images, the background is almost the same in the same area, so the images in the dataset are similar, because of these factors the training output model is always not effective to be used in other sea areas. Therefore, the dataset should be modified and augmented, so as to make the deep learning model more generally used. The dataset augmentation is mainly based on rotation, flipping, zoom, shift, etc.

The dataset used in this paper is obtained from the video recorded by an underwater robot. The total number of images is about 18000, and the images are similar with each other, so the rotation and color transformation is applied to transform the original patterns.

The three channels of images are dimensionality reduced; the (Red), (Green), and (Blue) direction vectors are obtained, respectively.

The eigenvalues and eigenvectors of , , and are defined as

is a random variable with a mean value of 0 and a variance of 0.1, and it is added in the transformation function as follows:

The rotation transformation is presented as where (, ) is the transformed location coordinates, and is the rotation angle.

The shift transformation is given as where is the shift angle.

The above three methods are selected randomly to transform the original image, and the total number is augmented to 30000.

5. Experiments Results

The method proposed in this paper is going to be used on an underwater remote operated vehicle (ROV) for fishing marine products. The robot is about 1 m long, 0.8 meters wide, and weighs 90 kg. The method of collecting marine products is adsorption type; the design and real robot are shown in Figure 10. The robot is remote operated; our team is going to reconstruct the ROV to semiautonomous, so the key technology is how to detect and locate the objects.

5.1. Detection Comparison

The GPU used in these computations is NVIDIA GTX 1080ti, and the total number of images is 30000, which are labeled one by one artificially. And in deep learning, 8520 images are used for training, 8530 for validation, and 12950 for test. In object detection, Precision, Recall, and Mean Average Value are commonly used to assess the accuracy; the definition is shown in Figure 11.

Mean Average Precision is the mean value of precision of all the detection classes, which is widely used to evaluate the detection system. In this paper, the dataset is prepared in Pascal VOC form, the results obtained from Fast RCNN [6] and Faster RCNN [27] are shown in Figure 12, and the concrete data is shown in (Table 1 and Table 2).

In order to make clear about the convergence of different methods, the mAP values vs. iteration times are shown in Figure 13.

From the above results and comparison, it can be seen that the detection accuracy of Faster RCNN is better than the other methods, but the difference is not very large. Compared with the original YOLO V3 method [30], the proposed method can give more accurate detection, and the scheme 2 is more effective. The convergence of the methods is different; the YOLO V3 methods convergent after the 28000 iteration times, which is earlier than Fast RCNN and Faster RCNN. After 40000 times iteration, all the methods cannot improve the detection accuracy, the reason is lack of the underwater samples of the dataset, and the images of the dataset are similar, especially the background of the images is the same. This is the main reason for underwater object detection, the underwater data in deep sea is too difficult to obtain.

The original network proposed in this paper is not stable; the results fluctuated with the iteration times increasing. The modified schemes are proposed to improve the stability and accuracy, as shown in Figure 13. Compared with the other typical methods, our proposed methods can give a more accurate result.

The loss function curves are shown in Figure 14, the loss values of all of the methods are convergent, and the loss values amplitude of the YOLO V3 methods are smaller compared with Fast RCNN [6] and Faster RCNN [27]; the convergent speed of the proposed methods are slower than the original YOLO V3 method [30].

For object detection, the accuracy of all of the above methods are enough for application, the real-time detection is more important, and the detection speed is shown in Table 3.

It is clear that the YOLO V3 [30] methods have a very fast detection speed, almost four times faster than Faster RCNN[27]. Based on the accuracy and detection speed analysis, the scheme 2 is better than the other methods, which has the same accuracy with the Faster RCNN, and the detection speed of this method is around 50FPS, even on a NVIDIA TX2 card, the detection speed can reach 17FPS, it is enough for real application.

5.2. Detection Results

The following typical images are used to testify the method (scheme 2) proposed in this paper, the images are provided by the “Underwater Robot Picking Contest”, and some images are filmed by the underwater ROV.

The method scheme 2 is better in underwater detection because of retaining more representation information, the comparison is shown in Figure 15, (a) and (b) are the same image, and the scheme 2 method can detect the sea cucumber and the sea urchin in the lower-left corner, but the original missed the objects. In (c) and (d), the left sea cucumber is missed by the original YOLO V3 [30] method too, so this method is more effective obviously. And from the image (a) detection, we can see that the sea cucumber covered by the sands in the lower-left corner can be detected too, which is difficult to detect by human vision.

In order to verify the method, 8 images are chosen in the experiment; the detection results are shown in Figure 16.

The training model is applied in the ROV to test the detection effect, the weather is cloudy, and the sea water is very turbid; the real-time detection results are presented in Figure 17.

As seen in Figure 15, some of the objects are missed to be detected, the reason is that the dataset is not large enough, especially the images of the dataset are very similar; the light and the background are simple, so when the trained model is used to detect in the other sea area or under different environment conditions, the detection accuracy is going to reduce more or less, so our team is planning to film more underwater images in different sea area and under different conditions to make the dataset more plentiful, so as to achieve the perfect underwater detection.

6. Conclusion

Considering the underwater vision characteristics, some new image processing procedures are proposed to deal with the low contrast and the weakly illuminated problems. A deep CNN method is proposed to achieve the detection and classification of marine organisms, which is commonly recognized as the fastest object detection method. The underwater vision is in low quality, and the objects are always overlapped and shaded, so the original YOLO V3 [30] method is not very effective for underwater detection; two methods are proposed to deal with these problems. Through detection results comparison with the other methods, the scheme 2 can give a better detection. The trained model is used to assist the ROV to detect underwater objects; although some of the objects are missed, the effectiveness and capability of the proposed method are obviously verified by the qualitative and quantitative evaluation results. The proposed method is suitable for our underwater robot to detect the objects, which is not better than the typical methods for the other dataset. And dropout layers and other technologies are not significant in this model; the reconstruction of the network by using a more complicated algorithm would be more effective.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

No potential conflict of interest was reported by the authors.

Acknowledgments

We would like to express our gratitude for support from the National Key R&D Program of China (Grant No. 2018YFC0309402) and the Fundamental Research Funds for the Central Universities (Grant No. HEUCF180105).