Abstract

We present a survey on marine object detection based on deep neural network approaches, which are state-of-the-art approaches for the development of autonomous ship navigation, maritime surveillance, shipping management, and other intelligent transportation system applications in the future. The fundamental task of maritime transportation surveillance and autonomous ship navigation is to construct a reachable visual perception system that requires high efficiency and high accuracy of marine object detection. Therefore, high-performance deep learning-based algorithms and high-quality marine-related datasets need to be summarized. This survey focuses on summarizing the methods and application scenarios of maritime object detection, analyzes the characteristics of different marine-related datasets, highlights the marine detection application of the YOLO series model, and also discusses the current limitations of object detection based on deep learning and possible breakthrough directions. The large-scale, multiscenario industrialized neural network training is an indispensable link to solve the practical application of marine object detection. A widely accepted and standardized large-scale marine object verification dataset should be proposed.

1. Introduction

Information technology and intelligent development have changed the operation mode and direction of many industries. Traditional maritime shipping industry also has gradually been advanced from digitization and informatization to intelligence [1]. As a major advance in machine learning over the last decades, the deep learning approach is becoming the most powerful technique for intelligent transportation system [2]. The deep learning methodologies are applied in various fields in the maritime industry such as ship classification, object detection, collision avoidance, risk perception, and anomaly detection. The main application directions can be summarized as maritime surveillance and autonomous ship navigation.

Currently, most of the research focuses on some aspects of the deep learning technique that performance much higher than humans; however, that technique is unable to complete complex tasks. So far, although seafarers have some limitations and many failure examples in the process of completing shipping transportation, humans are still the most reliable executors. Therefore, it is necessary to survey the applications of deep learning-based technologies in the maritime field to explore how computer vision replaces or even surpasses humans in a real-world application, especially the object detection task, which has exploded in recent years, and most of the evaluation index has already made great progress.

Humans perceive the external objects’ size, brightness, color, and movement state through their eyes; 80% of human perception information (situation awareness) is obtained through vision. With the limitation of the seafarer’s lookout at the ship bridge, the visual perception of the horizon cannot be done excluding the solar direction [3] and bad weather often affects the seafarer’s judgment. Most collisions and grounding are due to wrong interpretation or disregard of improper lookout (COLREGS-1972) [4].

Computer vision is an interdisciplinary scientific field that obtains and completes a series of image information processing from digital images or videos [5]. From the perspective of engineering, it seeks to perceive, understand, and automate tasks that the human visual system does.

Visual perception is an information-based approach to understanding biological and artificial vision [6]. It refers to the process of organizing, identifying, and interpreting visual information in environmental expression and understanding. According to this definition, the goal of computer vision is to express and understand the environment. The core issue of visual perception is to study how to organize the input image information, identify objects and scenes, explain the content of the image.

A number of surveys of general object detection have been published in recent years. Zou et al. [7] reviewed more than 400 papers on the development of object detection technology from 1998 to 2018. This survey includes historical milestone detectors, detection datasets, measurement methods, and the latest detection methods. This article also reviews some important detection applications, such as pedestrian detection, face detection, and text detection, and conducts an in-depth analysis of the challenges and technological improvements in recent years. Jiao et al. [8] analyzed the existing typical object detection model and methods and discussed how to construct an effective and efficient system architecture based on the current detection model. Wu et al. [9] systematically analyzed the existing deep learning-based object detection frameworks and organized the survey into three major parts: (i) detection components, (ii) learning strategies, and (iii) applications and benchmarks. This survey covers a variety of elements affecting detection performance, such as detector architectures, feature learning, proposal generation, and sampling strategies. Chen et al. [10] analyzed the characteristics of the imbalance problem in different kinds of deep detectors and experimentally compared the performance of some state-of-the-art solutions on the COCO benchmark. Qiao et al. [11] combined the visual perception tasks required for maritime surveillance with those required for intelligent ship navigation to form a marine computer vision-based situational awareness complex and investigated the key technologies they have in common. This review focuses on the ship detection by the ship’s own equipment and does not include the influence of other possible objects and backgrounds at sea, as well as the problems that may arise in industrial applications.

Computer vision-based marine object detection, as one of the most fundamental and challenging issues in maritime intelligent transportation, has received great attention over the last decades. As shown in Figure 1, it indicates the increasing number of publications in marine object detection from 2012 to 2021 July, the growing number of papers that their title is associated with “marine object detection” and “deep learning” over the past decades. The three advancements of digital data collection, computing power, and algorithm have promoted this deep learning research and application boom in maritime fields [1215].

In recent years, deep learning-based visual perception has been widely applied to autonomous ship navigation and maritime transportation surveillance for intelligent transportation systems (ITS). The survey articles in the maritime field-related applications of computer vision are as follows: Qiao et al. [11] summarized the progress made in four aspects: full scene parsing of an image, ship reidentification, ship tracking, and multimodal data fusion with different visual sensors. Prasad et al. [16] provided a comprehensive overview of various approaches of video processing for object detection in the maritime environment. It consists of three modules: horizon detection, static background subtraction, and foreground segmentation. Moniruzzaman et al. [17] described the use of deep learning for underwater imagery analysis, and deep learning architectures have been highlighted. Hashmani et al. [18] presented a survey on edge detection-based and machine learning-based marine horizon line detection; each study is presented with a recommendation for their suitability for a specific application in the marine environment.

Also, projection-based, region-based, hybrid, and artificial neural network (ANN) based methods for sea horizon detection have been discussed [19]. The researches of ANN methods in maritime surveillance made the horizon line detection easy, accurate, and robust. For optical remote sensing images applied in maritime, Li et al. [20] summarized the detection and classification of ship optical remote sensing images. Both methods were analyzed for traditional feature-designed methods and the deep convolutional neural networks (CNN).

The main difference between this paper and the above surveys is summarized as follows: (i) this paper only focuses on the task of deep learning-based object detection in computer vision. (ii) It analyzes the state-of-the-art of marine object detection in maritime surveillance, autonomous ship navigation, and other related applications. (iii) It analyzes and discusses the factors that affect the state-of-the-art solutions, especially the mainstream datasets and the milestone detectors.

As far as we know, the aim of this paper is to provide a survey of the most important approaches in the field of deep learning-based object detection for the maritime transportation system. This survey focuses on describing and analyzing deep learning-based marine object detection tasks. We contribute to the following:

Literature evaluation: we summarize the existing application scenarios of visual object detection in the maritime field. (2) Comparison of existing datasets. In practical engineering problems, big data plays an important role in realizing industrial applications. (3) Special emphasis on the role and development direction of visual detection in the autonomous ship navigation scenario. (4) Discussing the current limitations of object detection based on deep learning and possible breakthrough directions.

The rest of this paper is organized as follows. Section 2 highlights the state-of-the-art methods for general object detection. Section 3 introduces the application of object detection based on deep learning in various subdivisions of maritime affairs. The state-of-the-art backbone-based models and important datasets are described in Section 4.

2. State-of-the-Art of Object Detection

The definition of object detection is the task of detecting instances of targets of a certain class within an image or video.

Generally speaking, the detection task consists of two subtasks. One is the category information and probability of the target, and it is a classification task. The second is the specific location information of the target, which is a positioning task.

As one of the most popular research fields of computer vision, object detection research prospered, which is the basic idea changed from traditional artificial feature design, shallow classifiers to deep neural network-based feature autonomous learning.

In the nondeep learning era, many tasks are not solved at once but require multiple steps, such as [21]. In the deep learning era, many tasks use the end-to-end framework, that is, input a picture and output the final result. The algorithm details and learning process are all completed through neural networks. It is particularly obvious in the field of object detection.

Under the deep learning architecture, whether it is a clear step-by-step process or the end-to-end method, the object detection algorithm must have three modules. The first is the selection of the detection window, the second is the extraction of image features, and the third is the design of the classifier.

As shown in Figure 2, the milestone of the neural network backbone and SOTA methods of object detection is listed in the timeline. 2012 was a critical point. Although CNN was proposed many years before, it was still hidden by other machine learning algorithms. After 2012, various neural networks and modules were combined, and deep learning-based methods suddenly left other methods behind. Deep learning-based application research in various fields can be widely carried out.

2.1. Review of Traditional Object Detection Methods

In 2001, [22, 23] proposed the Viola-Jones object detection framework. Based on the AdaBoost algorithm [24], the Viola-Jones framework uses Haar-like wavelet features and integral graph technology to perform face detection. This is the first detection method based on Haar + AdaBoost. It is also the first real-time framework for detection. Before the advent of deep learning technology, the Viola-Jones detector has always been the mainstream framework for face detection algorithms [25, 26].

Histogram of oriented gradient (HOG) [27] calculates the histogram not based on the color value but based on the gradient. It constructs the feature by calculating the gradient direction histogram of the local area of the image. HOG features combined with SVM classifiers have been widely used in image recognition, especially in pedestrian detection [28, 29]. Many further related researches have been presented, such as invariant histograms of oriented gradients (Ri-HOG) [30], which adopt annular spatial bins type cells and apply radial gradient transform (RGT) to attain gradient binning invariance for feature descriptors.

The DPM [31] algorithm adopts the detection ideas of improved HOG, SVM classifier, and sliding window. For the multiview problem of the target, it adopts the strategy of multicomponent. For the deformation problem of the target itself, it adopts the component model strategy of pictorial structure. DPM is a component-based detection method, which has strong robustness to the deformation of the target. At present, DPM has become the core of many classification, segmentation, pose estimation, and other deep learning-based algorithms [3235].

In some specific application scenarios, object detection algorithms based on machine learning can still maintain good advantages. In [36], the image data were divided into smaller blocks and represented with a vector. These feature vectors are created by adding the subfeatures extracted from the color and texture properties of the images one after another. 99.62% classification success was achieved by using the Random Forest method. An average of 3.4 times acceleration was achieved by running each method on 1 master +4 workers clustering architecture on Apache Spark.

2.2. Deep Learning-Based Object Detection

CNN is one of the representative algorithms of deep learning [37]. It is the cornerstone of the current great success of deep learning, and it is a type of Feed forward Neural Networks (FNN) that includes convolution calculations and has a deep structure. CNN has the ability of representation learning, and it can perform shift-invariant classification of input information according to its hierarchical structure.

LeNet is one of the earliest CNN. Since 1988 [38], after many successful iterations, this pioneering result completed by Yann LeCun was named LeNet5. The architecture of LeNet5 is based on the view that the features of an image are distributed across the entire image, and the convolution of learnable parameters is an effective way to extract similar features in multiple locations with a small number of parameters.

In the nearly 20 years since LeNet was proposed, neural networks were once surpassed by other machine learning methods, such as support vector machines. Although LeNet can achieve good results on early small datasets, its performance on larger real datasets is not satisfactory. Computationally complex and insufficient computing power are the two main reasons for limiting its performance.

In 2012, Alex Krizhevsky proposed AlexNet [39]. Specifically, there are the following four innovations: (a) the GPU is used for network acceleration training for the first time. (b) ReLU activation function is used instead of the traditional sigmoid activation function and tanh activation function. (c) LRN local is used for response normalization. (d) In the first two layers of the fully connected layer, the dropout method is used to randomly inactivate neurons in a certain proportion to reduce overfitting.

AlexNet adds 3 convolutional layers on the basis of LeNet. VGG [40] proposed the idea of building a deep model by reusing simple basic blocks. The convolutional layers of vgg-block have the same structure, which means that the input size is equal to the output size. VGG proposed the idea of building a deep model by reusing a basic vgg-block. All VGG-block configurations are designed using the same principles, the filter (kernel) adapted with a very small receptive field (3 × 3), the convolution stride is fixed to 1 pixel, the padding is used to maintain the image resolution after convolution, and the max-pooling is performed over a 2 × 2 pixel window, with stride 2. This design increased network depth to improve classification accuracy.

In 2014, GoogLeNet [41] proposed the inception network structure, which is to construct a “basic neuron” structure to build a network structure with sparseness and high computational performance. The following two innovations should be highlighted: (a) using factorization into small convolution can reduce the number of parameters, reduce overfitting, and increase the nonlinear expression ability; (b) using the Inception Module, multiple branches extract high-level features with different levels of abstraction, which can enrich the expressive ability.

In some practice, the training error tends to increase instead of decreasing after adding too many layers. Even if the numerical stability brought by batch normalization makes it easier to train deep models, the problem still exists. He et al. [42, 43] presented a residual block (ResNet) to solve this problem. The ResNet can train an effective deep neural network through the cross-layer data channel; it deeply influenced the design of later deep neural networks [4447].

The cross-layer connection design in ResNet has led to several follow-up works, and DenseNet [48] is one of the representative innovations. The main building blocks of DenseNet are dense block and transition layer. The former defines how the input and output are connected, and the latter is used to control the number of channels so that it is not too large.

In the field of computer vision, CNN has always occupied the mainstream position. However, researchers continue to try to introduce the transformer model in the field of natural language processing (NLP) into computer vision, propose a new Vision Transformer model, and achieve performance close to the current SOTA method on multiple image process benchmarks. DERT [49] demonstrated that the transformer model for NLP can also be used for image pretraining and object detection tasks. Han et al. [50] surveyed the research of transformer-based computer vision.

Deep learning-based object detection models still have to solve the three problems of region selection, feature extraction, and classification regression. Generally speaking, it can be divided into two categories: single-stage methods and multistage methods.

The multistage methods have high localization and object recognition accuracy, and the example models include R-CNN [51], SPPNet [52], fast R-CNN [53], faster R-CNN [54], mask R-CNN [55], and cascade R-CNN [56]. The R-CNN framework is a typical representative of the multistage method. It uses selective search to generate candidate regions and then the detection process, and the number of candidate windows is controlled at about 2000. After selecting these image frames, the corresponding frames can be resized and then sent to CNN for training. Due to the very powerful nonlinear characterization ability of CNN, it can perform good feature expressions for each region. The final output of CNN uses multiple classifiers for classification judgment. This method increases the detection rate on PASCAL VOC [57] from 35.1% to 53.7%, which is equivalent to AlexNet’s breakthrough in classification tasks in 2012 and has a profound impact on the field of target detection. Subsequently, Fast R-CNN proposed RoI Pooling to select regional features from the convolutional feature map corresponding to the entire image, which solved the problem of repeated feature extraction. Faster R-CNN proposes region proposal, anchors divide the image into nn regions, and each region gives 9 proposals with different ratios and scales, which solves the problem of repeatedly extracting candidate proposals. Other representative multistage object detectors also include SPPNet [52], pyramid networks [58], context R-CNN [59], and MnasFPN [60].

Single-stage methods prioritize inference speed, and example models include YOLO [61], SSD [32], RetinaNet [62], and MobileNetV3 [63]. YOLO is the representative single-stage model; there is no explicit bounding box extraction process. First, it resizes the image with a fixed size, divides the input image as a 7 × 7 grid, predicts 2 bounding boxes per grid, and classifies and locates for each bounding box. The YOLO model has also undergone many versions of development and is currently developed to YOLOv5. YOLO’s approach is fast, but there will be many missing objects, especially tiny objects. So, single shot multibox detector (SSD) adds the concept of anchor from Faster R-CNN on the basis of YOLO and combines the features of different convolutional layers to make predictions. The main contribution of SSD is the multireference and multiresolution detection techniques, which significantly improve the detection accuracy of a one-stage detector, especially for some tiny objects [64]. Although the methods of the YOLO and SSD series do not have the extraction of region proposals and it becomes faster, they inevitably lose information and accuracy. The more representative single-stage object detector also includes RetinaNet [62] and MobileNet [63] series models.

In the field of computer vision, commonly used datasets include Microsoft Common Objects in Context (MSCOCO) [65], Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) [66], Visual Genome [67], Dataset for Object deTection in Aerial Images (DOTA) [68], and PASCAL Visual Object Classes Challenge (PASCAL VOC) [57]. The most popular object detection benchmark is the MSCOCO dataset. Models are typically evaluated according to a mean average precision metric.

In the field of marine object detection, there are few specialized datasets related to maritime supervision and autonomous ship navigation. Zhang et al. [69] proposed the use of generative adversarial networks (GANs) to solve insufficient marine data when training some object detection neural network. In [70], the novel idea is extracting the mask of the foreground object and combining it with the new background to automatically generate the location information and object information. Marine object detection-related dataset will be introduced in the subsequent section, separately.

2.3. Optimization Methods

The marine environment is complex and changeable, and the visual data has its own characteristics. Therefore, most of the researchers optimize the model and enhance the data based on the characteristics of the marine environment, to improve the accuracy and speed of marine object detection.

Chen et al. [71] presented a novel hybrid deep learning algorithm that combines improved generative adversarial network (GAN) and CNN-based detection methods for small ship detection. It uses Gaussian Mixture Wasserstein GAN with gradient penalty to generate sufficient informative artificial samples of small ships and uses raw and generated data to approach high accuracy tiny object detection. Ren et al. [72] proposed an effective ship image recognition method, which combines Hu invariant moment features and CNN features to achieve superior ship image recognition. Hu moment invariant feature joint to the last pooling layer achieves the highest recognition accuracy on self-built and VAIS datasets. Cao et al. [73] proposed a ship recognition method based on Morphological Watershed image segmentation and Zemike moment; although the Hu moment and Zernike moment are geometrically invariant, the Hu moment is unstable when the scale changes and the Zemike moment has a better stability. Using rotation to enhance the dataset causes errors in object detection tasks, Dong et al. [74] proposed a multiangle box-based rotation insensitive object detection structure (MRI-CNN) that improves the robustness of the model and reduces the detection performance impact due to the insufficient dataset.

3. Marine Target Detection Application

3.1. Maritime Surveillance

The interest in maritime surveillance has been increased in the last decades, and it is a significant issue for assuring the safety and security of international transportation and defense mission. Despite being an important activity, how to efficiently conduct maritime surveillance is still a difficult problem for all countries. Computer vision-based digital maritime surveillance can solve most of this situation awareness issues, which can be divided into three categories: (1) detection and location (e.g., manmade pollution, oil spills, maritime hazardous event, noxious substances, and crashed plane debris), (2) tracking (e.g., ships, shipwrecks, lifeboats, illegal fisheries, illegal ballast water discharge, and smuggling), and (3) behavioral recognition (e.g., abnormal path confirmation, ships rendezvous, and high-speed objects on maritime surface). Most researchers focused on shore-based maritime surveillance, high-resolution satellite image surveillance, synthetic aperture radar (SAR) remote sensing, and so on.

3.1.1. Shore-Based Surveillance

In the current social environment, the traditional marine video surveillance technology simply relies on a large number of maritime managers who are no longer able to meet the needs of safe navigation. Computer vision combined with image processing technology has become the mainstream of maritime surveillance. In [75], experiments show that the ship detection based on YOLOv3 has high accuracy in the face of different scenes such as small traffic flow, foggy ship navigation, large traffic flow, and small imaging scale. The YOLOv3 algorithm uses the k-means algorithm to predict the bounding box and combines the multiscale features for ship identification; YOLOv3 can adapt to port scenarios with different traffic flow by using a multiscale detection mechanism, which has strong generalization ability.

Shao et al. [76] proposed a ship detection model based on a saliency-aware CNN framework that realizes real-time detection through the monitoring video taken by the camera. It can predict the category and position of the ship and use the global contrast-based salient region detection to correct the location. Based on the YOLOv2 pipeline, a saliency-aware CNN framework is proposed to improve the accuracy and robustness of ship detection under complex coastal conditions. Liu et al. [77] improved the YOLOv3 anchor method and feature fusion structure, respectively, GIOU loss was added to the loss function, and cross PANet was proposed to replace the FPN structure in YOLOv3. The results show that the proposed method can significantly improve the accuracy of YOLOv3 detecting sea surface objects. The SeaBuoys dataset was established according to the actual sea surface conditions, and comparative experiments were carried out with the existing SeaShips dataset [78].

Li et al. [79] proposed a new ship detection from visual image (SDVI) algorithm, named enhanced YOLOv3 tiny network for real-time ship detection. The convolution layer, instead of the max-pooling layer and expanding the channels of the prediction network, introduced an attention module named CBAM into the backbone network, which makes the model more focus on the target. The algorithm has a 9.6% improvement in mAP and has a faster detection speed. Huang et al. [80] used k-means++ clustering on the dimensions of bounding boxes to prioritize the model, improve the YOLOv3-Darnet53 network, increase jump connection mechanism, decrease feature redundancy, and improve the ability of tiny ship detection. On the premise of ensuring real-time performance, the precision of ship identification is improved by 12.5%, and the recall rate is increased by 11.5%.

In [12], a “reference model” pretrained with Pascal VOC image dataset and a “proposed model” trained with a specific maritime dataset (Singapore Maritime Dataset, SMD), the same structure of the “reference model” compared with the “proposed model,” experiments show that, in SMD verification dataset, the proposed model is about twice as accurate as the reference model in terms of IoU and recall rate. Cane et al. [81] evaluated semantic segmentation networks in the context of an object detection system for maritime surveillance. The authors indicate that the SegNet and ENet achieve higher detection accuracy and precision. Considering the maritime surveillance actual condition, the ENet model would be the most suitable model.

3.1.2. High-Resolution Satellite Image Surveillance

High-resolution color remote sensing ship images taken from short distances provide advantages in ship detection applications. But the analysis of these high-dimensional images is complicated and requires a long time [36]. Synthetic aperture radar (SAR) is an active side-looking radar that can overcome weather interference and provide high-resolution images. SAR creates two-dimensional images or three-dimensional reconstructions of objects; it is typically mounted on a moving platform, such as an aircraft or spacecraft, and has its origins in an advanced form of side-looking airborne radar (SLAR).

Ghosh [82] proposed an efficient onboard detection system connected with a medium resolution wide amplitude optical camera and solved the problem of limited satellite coverage and limited simulation and equipment. Tian et al. [83] proposed a detection framework based on remote sensing image combining image enhancement module and dense feature reuse module to improve the object detection capability. Chen et al. [84] proposed an improved YOLOv3 based on an attention mechanism for fast and accurate ship detection, which accelerates detection speed to achieve real-time detection effect and improves the level of maritime surveillance.

Wang et al. [85] proposed an improved YOLOv3 algorithm for ship detection in optical remote sensing images. Adding the squeeze-and-excitation (SE) structure to the backbone improves the feature extraction capabilities and improves the detection accuracy by the fusion of multiscale feature maps. It achieves detection speeds of about 27 fps on NVIDIA RTX2080ti, with recall (R) = 95.32% and precision (P) = 95.62%. Cao et al. [86] conducted a similar study, the feature pyramid structure is introduced to combine the deep semantic information with the shallow semantic information, and the multiscale feature mapping is integrated to improve the detection ability of small objects.

Tang et al. [87] proposed a ship detection method based on noise classification and target extraction. The method consists of three modules: NLC (noising level classifying) module, STPAE (SAR target potential area extraction module) module, and the recognition module based on YOLOv5. The advantage of this model is that it can reduce the noise interference from the coast to ship detection. Tang et al. [88] introduced a novel high-resolution image network-based approach based on the preselection of a region of interest (RoI). It designs an HSV (hue, saturation, and value) module composed of four cores: background removal, noise removal, box-finding, and noise deletion, which can obtain useful RoI in a short time.

3.1.3. Airborne Maritime Surveillance

The speed of maritime surveillance ships is difficult to operate in complex seas and/or dispatching in busy ports. At present, although some maritime regulatory agencies carry manned helicopters, manned helicopters cannot take off to ensure the safety of personnel under hazardous sea conditions. At the same time, the cost of use is high; it is unable to meet the high-density, high-intensity maritime surveillance requirements.

Unmanned aerial vehicles (UAVs) can be remotely controlled or fly in the air in autonomous mode. It is a miniaturized and intelligent flight platform that can complete one or more tasks by carrying different task modules. It also has great potential in maritime applications. Solving the problems of drone flight stability, data transmission, shipboard electromagnetic compatibility, and convenient take-off and landing in the marine environment can enable drones to play a greater role in maritime applications. At present, according to the principle of flight, UAVs suitable for maritime applications mainly include fixed-wing UAVs, unmanned helicopters, multirotor UAVs, and vertical take-off and landing fixed-wing UAVs. Various types of UAVs have their own unique advantages in maritime applications.

Ribeiro et al. [89] presented an airborne maritime surveillance dataset captured by a small size UAV. This dataset presents object examples ranging from cargo ships, small boats, and life rafts to oil spill. Due to the continuous shaking of the UAV’s camera, it is very difficult to label data on the acquired video dataset. The authors proposed a new labeling tool, which is developed in C++, and the OpenCV library is used to create labels manually. Reference [90] presents an approach to detect boats in a maritime surveillance scenario using a small UAV. This work relies on CNN to perform robust detection even in the presence of distractors like wave crests and sun glare. Reference [91] explores maritime search and rescue missions by using experimental UAV data to detect the sea surface object. Reference [92] addresses the development of an integrated system to support maritime situation awareness based on UAVs, emphasizing the role of the automatic detection subsystem.

Xiu et al. [93] contributed a system that includes a maritime unmanned aerial vehicle (Mar-UAV) with a high-resolution camera and an Automatic Identification System (AIS). Multifeature information, including position, scale, heading, and speed, is used to match between real-time image and AIS message. The results demonstrate that the proposed algorithm and the Mar-UAV system are very significant for achieving autonomous maritime surveillance. Reference [94] presents a method to learn spatial and temporal features from video sequences; temporal features attempt to improve the maritime objects detection ability, which contain strong distractors such as glare and wakes. The proposed method is composed of two main parts, one spatial feature extractor based on the VGG network and one recurrent layer, the ConvLSTM.

3.1.4. Satellite Radar Image Surveillance

Ship detection in synthetic aperture radar (SAR) images has been widely studied due to its indispensable role in military intelligence acquisition, maritime management, and many civil fields. However, due to the limitations of bandwidth and computer computing power in satellite scenarios, SAR image-based ship detection deployment is largely hindered. Another reason is that searching for targets of interest in massive SAR images by eyes becomes time-consuming and often impractical. Therefore, lightweight neural network training models are widely used.

Chen et al. [95] proposed a novel learning scheme for training a lightweight ship detector called Tiny YOLO-Lite, which simultaneously (1) reduces the model storage size; (2) decreases the floating-point operations (FLOPs) calculation; and (3) guarantees the high accuracy with faster speed. Reference [96] proposes a lightweight CNN-LiraNet combining dense connections, residual connections, and group convolution. It uses a two-layer predictor and adds residual models to transmit features easier; experimental results show that the Lira-YOLO network has less complexity, only 2.980 Bflops. The parameters only have 4.3 MB. The mean average accuracy (mAP) index of the Mini-RD and SARShip detection dataset (SSDD) reaches 83.21% and 85.46%, respectively, which is comparable to the tiny-YOLOv3.

Yang et al. [97] proposed a one-stage object detection framework based on RetinaNet and rotatable bounding box (RBox) for the problems such as feature scale mismatch and task contradiction. Experimental results show that the average accuracy improved 13.26%, 9.29%, 8.92%, 8.55%, and 4.55% compared to the other four advanced RBox-based ship detection methods at the IoU threshold of 0.5. In this paper, scale calibration is proposed to make the proportion distribution of the main feature map and the object feature map consistent.

In arctic waters, a vast majority of objects are icebergs drifting in the ocean and can be mistaken for ships in terms of navigation and ocean surveillance. Hass and Jokar Arsanjani [98] presented a YOLOv3-based deep learning model that uses SAR images to discriminate icebergs and ships, which could be used for mapping ocean objects ahead of a journey.

To solve the problem of small objects and multiobject ship detection in complex scenarios, [99] proposes a detection method based on an optimized feature pyramid network (FPN) model. The results show that the small ship detection accuracy reaches 98.62%, and the proposed model has higher accuracy and better comprehensive performance compared with YOLO.

3.1.5. Other Applications

For military defense and intelligent early warning, an infrared intrusion object detection algorithm based on a neural network is proposed [100]. The extended CNN designed by this algorithm can fuse and expand the image features, enhance object filtering, and improve background suppression. Xie et al. [101] proposed an inspection system based on tracking technology, which can automatically process ship inspection video and predict suspicious areas where cracks may exist. Intelligent computer vision is the most important technology for the development and utilization of deep-sea resources. Han et al. proposed a combination of the max-RGB method and shades of the gray method that is applied to achieve the enhancement of underwater vision [102]. In [103], vision-based object detection for underwater robots has been proposed. In order to overcome the limitations of cameras and to make use of the advantages of image data, a number of approaches have been tested. The topics include color restoration algorithm for the degraded underwater images, detection, and tracking methods for underwater target objects.

3.2. Vision-Based Autonomous Ship Navigation

Object detection and vision-based ship navigation is an essential task for autonomous ship navigation. However, sunlight reflection, camera motion, and illumination changes may cause false object detection in the maritime environment. Farahnakian and Heikkonen [104] proposed three fusion architectures (pixel-level, feature-level, and decision-level) to fuse two imaging modes (visible and infrared); they employed deep learning for performing fusion and detection. Pan et al. [105] proposed the navigation mark classification and identification model based on deep learning (RMA: ResNet-Multiscale-Attention), which can identify different navigation marks finely. It can identify the nuances of the navigation mark; no additional supervision information is required except for the label, and it is end-to-end training.

3.2.1. Horizon Detection

Marine horizon detection is the most significant semantic boundary for segmenting the image into sea and sky. References [18, 19] have summarized the marine horizon line detection. In the past research, many robust marine horizon generation methods have been proposed. For the marine horizon model of the straight line, the traditional methods include the following: (a) linear fitting: the selection of candidate points of this method is easily susceptible to the complex sea and sun glint [106]. (b) Image segmentation: the optimal segmentation threshold of this method is difficult to be adaptively determined [107]. The algorithm's anti-noise ability is insufficient. (c) Gradient significance, each interference factor has abundant edges, and these edges have gradient values similar to or even higher than the marine horizon, which is easy to cause false detection [108].

Typical marine horizon detection relies on edge information, which requires two important issues to be overcome: unstable edge detection and complex marine environment with shore background and weather conditions. Jeong et al. [109] proposed a novel method for horizon detection that combines a multiscale approach and CNN; it has a median positional error (MPE) of less than 1.7 pixels from the center of the horizon and a median angular error (MAE) of approximately 0.1 degrees. This method is one of the methods for horizon detection with high speed and high accuracy, but it may have failed detection in some scenarios such as an absence of obvious line feature. Prasad et al. [110] presented a novel method called multiscale consistence of weighted edge Radon transform, abbreviated as MuSCoWERT. It has a median error of about 2 pixels (less than 0.2%) from the center of the actual horizon and a median angular error of less than 0.4 deg. Compared with traditional methods (ENIW [111], FGSL [112], MuSMF [113], IntGF, IntG, Hough [114], and GWR [115]), MuSCoWERT has excellent performance. Jeong et al. [116] proposed a fast method for detecting the horizon line in maritime scenarios by combining a multiscale approach and region-of-interest detection. Experimental results show that the proposed method can accurately identify the region of interest on the moving platform and ensure the robustness of sea-sky-line detection. And it is less affected by ships, light changes, waves, and wakes. In [117], a novel algorithm based on probability distribution and physical characteristics is introduced. The authors designed a hybrid method, which consists of sea-sky region extraction and horizon estimation based on the information of color, texture, and context. The proposed algorithm precisely detects the horizon not only from fine images but also from blurred image, even with a splashed camera.

As shown in Table 1, mean height deviation (MHD) and angle deviation (AD) as the evaluation standard have been recognized by most of the researchers. Nondeep learning methods occupy the majority of marine horizon detection. In recent years, marine horizon detection based on the Singapore Maritime Dataset has gradually increased, which is conducive to the comparison between different algorithms. Although it is still affected by the objective environment such as the different computing power of computers, it will play a role in pointing the direction of future research.

Accurate identification, tracking, and positioning of the marine horizon, as well as an accurate description of water boundary lines, are the basic requirements for safe driving of autonomous ship navigation. However, a large number of current researches mainly focus on pure marine horizon detection, and there is no in-depth research on marine horizon tracking and positioning and accurate description.

From the perspective of the marine horizon detection process, this survey summarizes and analyzes the key points of the existing marine horizon detection methods and summarizes the content that still needs to be studied in the future. It is suggested that, in complex water environment and engineering applications with high real-time requirements, marine horizon detection is facing severe challenges. In future work, we should improve the algorithm to improve the real-time performance and environmental adaptability of the algorithm.

3.2.2. Surface Moving Object Detection

Over the last decades, a lot of researchers have worked on the big challenge of detection of moving ships in various complex marine environments. Reference [122] presents a ship object detection algorithm to achieve efficient visual maritime surveillance from nonstationary surface platforms.

The maritime target detection represented by YOLO has made great achievements in recent years. Chen et al. [123] proposed a YOLO-based integrated framework to detect ships from maritime surveillance videos and accurately identify ship behavior in continuous frames. The average check rate reaches 92.85%, and the registration rate reaches 93.91%, respectively. It shows that the proposed method identifies the historical behavior of the detected object successfully, helps managers understand the historical navigation, predicts the future navigation trajectory, implements early warning measures to ensure maritime traffic safety. Li et al. [124] proposed a lightweight ship detection model (LSDM) based on YOLOv3 and DenseNet, in which the backbone network is improved by using dense connection inspired from DenseNet, and the feature pyramid networks are improved by using spatial separation convolution to replace the original convolution network. In the proposed model, only one-third of the parameters of the YOLOv3 network can reach average accuracy of 94% for ship detection, and in the LSDM tiny network, just one-eighth of the parameters of the YOLOv3 network can reach double detection speed and average accuracy of 93.5%.

Qiao et al. [125] proposed a detection framework based on YOLOv3, which integrates multimodel and multicue (C) pipeline. Multimodel is used to solve the problem of unstable tracking of target maneuverability in traditional single-model Kalman tracker (such as CV model), and multicue solves the problem of frequent IDS caused by motion blurring and occlusion. The two public maritime datasets showed that the proposed method achieved state-of-the-art performance, not only in identity switches (IDS) but also in frame rates. Huang et al. [126] solved the problem of low recognition rate on a small dataset and improved the real-time performance of ship detection. It provides a high-precision, real-time ship detection for smart port management and USV visualization.

The author discovered that the current research of marine moving object detection has flaws, and the dataset from the perspective of the ship bridge is difficult to obtain. So far, there is no suitable benchmark. Figure 3 shows an example of an onboard visual navigation dataset from the author’s lab.

Benchmark datasets containing various marine scenes from the perspective of ship bridges need to be presented, and all relevant studies should have unified standards and recognized evaluation mechanisms. Figure 4 shows an example result of onboard object detection. In the verification dataset, the missed detection rate of small targets should be included, which is essential for the autonomous navigation of large ships.

As shown in the upper left corner of Figure 4, the obstacle of the marine surface and navigation aid signs should be clearly identified, and the trained model needs to understand the different meanings of different objects for navigation. The identification of near-shore constructions and moving objects on the water, the background lights on the shore, and the lights on the ship is still a critical problem for object detection.

3.2.3. Background Subtraction

In the characteristics of dynamic marine environment, the detector needs to subtract the dynamically changing objects from the backgrounds; meanwhile, there are a large number of linear features and constantly changing lighting conditions. Even the advanced sea level detection technology and video frame registration technology are facing challenges. Many background subtraction and object detection methods are very difficult in the video stream.

For example, [84] designs a multiclass ship dataset (MSD) to highlight the difference between the ship and the background; it can improve the accuracy of tiny ship detection.

Prasad et al. [127] provided a benchmark of the performance of 23 classical and state-of-the-art background subtraction algorithms on visible range and near-infrared range videos in the Singapore Maritime Dataset. This paper indicates the limitations of the conventional performance evaluation criteria for maritime vision and proposes new performance evaluation criteria that are better suited to this problem.

Although these 23 methods have been successful, the recall and accuracy are extremely low. Even the most advanced BS technology cannot deal well in the marine environment. This means that the new BS algorithm needs to be formulated for maritime vision. The traditional performance evaluation index IoU is modified to a new evaluation index IOG, and a new index bottom edge proximity (BEP) is proposed to judge whether the bottom of detection object (DO) and ground truth (GT) are close. This indicator enables more extensive detection in the presence of trails.

Zhang et al. [122] proposed a discrete cosine transform (DCT) based ship detection algorithm which can extract the sea regions accurately for complex background modeling. The main contribution is to provide more accurate detection results within the complex sea surface background, which is of vital importance for ship-/buoy-based surveillance applications in the presence of large waves. The independent detectors for sky and sea regions increase the detection sensitivity to small objects around the horizon.

The lighting environment at sea is ever-changing; one method or a model suitable for one weather and lighting condition is ineffective. Establishing a model that can seamlessly select models and methods for different lighting conditions is essential for the practical application of maritime treatment. Prasad et al. [13] discussed the technical challenges in maritime image processing and machine vision problems for video streams generated by cameras. Challenges are arising from the dynamic nature of the background, unavailability of static cues, presence of small objects at distant backgrounds, and illumination effects.

Chan et al. [128] compared thirty-seven nonstatic electrooptical sensor (combine visible-light and infrared cameras)-based background subtraction methods; the results indicate that background subtraction algorithms of the multiple features category can better handle maritime challenges, thereby realizing higher accuracy when analyzing visible-light and infrared cameras.

3.2.4. Other Applications

Augmented reality (AR) can combine computer-generated graphic information with real camera views and is an effective display technology. Reference [129] used additional location data retrieved from the AIS device to improve retrieval performance based on the characteristics of the sea-sky-line boundary and used the k-means clustering algorithm and pixel contour to distinguish the sea-sky-line. The author also emphasized that the proposed system is based on CCTV and computer image processing; therefore, the performance is influenced by sea conditions, for example, the low light condition such as foggy, dark-night, and heavy rainy days.

4. Discussion

4.1. Model Comparison

The accuracy and real-time requirements of object detection for autonomous ship navigation and maritime surveillance are important. It is necessary to propose a maritime environment image/video perception based on an improved regressive deep convolution network. YOLO series architecture is always the first neural network to be considered, for example, [12, 71, 74, 75, 77, 79, 82, 84, 88, 123, 126, 130132]; these improvements contributed to a stronger baseline cross YOLO series detector.

As shown in Table 2, we collect some YOLO backbone network-based marine object detection models. Based on the advantages of YOLO in detection efficiency and speed, most of the researches focus on ship detection task. Experiments on public datasets (such as SMD and SeaShips) show that most of the enhanced YOLO series models have improved performance in different levels.

4.2. Marine Datasets Comparison

Moosbauer et al. [144] proposed a benchmark that is based on the Singapore Maritime Dataset (SMD). As shown in Table 3, this dataset included onshore and onboard objects in the marine environment; it provides Visual-Optical and Near-Infrared videos along with annotations for object detection. The authors evaluate two state-of-the-art object detection models for the applicability in the maritime domain: Faster R-CNN and Mask R-CNN. The SMD-based dataset can be used as a benchmark that encourages reproducibility and comparability for object detection in maritime environments. Recent research [12, 70, 81, 110, 127, 144151] reflects this characteristic.

To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, [68] introduces a large-scale Dataset for Object deTection in Aerial images (DOTA). There are many studies using this dataset in the field of maritime remote surveillance [135, 138, 152] and so on.

SeaShips is a large ship dataset. The dataset consists of 11,126 images, covering 6 common ship categories (ore ships, bulk carriers, general cargo ships, container ships, fishing ships, and passenger ships). All images come from about 5400 real video clips, collected by 156 surveillance cameras in the coastline video surveillance system. Some research uses this dataset to train their model or improve the model’s performance [139, 140].

Some other datasets need to be highlighted. Spagnolo et al. [153] presented a boat Re-ID dataset composed of 107 classes, and each class represents a different boat with a total of 5523 images. In order to verify the superiority of the proposed dataset, the authors give the results of training CNN by using this dataset, and the research results can be used as a benchmark for future comparisons.

Bovcon et al. [154] introduced the MaSTr1325 dataset for training deep USV obstacle detection models in small-sized coastal USV. They also proposed a data augmentation protocol to address slight appearance differences. The dataset is applied to three popular semantic segmentation architectures: U-Net, PSPNet, and Deeplabv2, among which Deeplabv2 performs best in obstacle detection. In [148], the authors used 4K videos for maritime video surveillance and proposed an approach that attempts to leverage both temporal and spatial video information for achieving fast and accurate object extraction. Multiscale texture discrimination algorithm carried out key video locations to achieve final object extraction.

4.3. Current Challenges and Future Works

Computer vision is the subject of studying image information organization, object and scenario recognition, and interpreting events by taking images (video) as input and aiming at representation and understanding of the environment. Judging from the current research status, the research mainly focuses on the organization and recognition of image information, and the interpretation of events is rarely involved, at least at a very preliminary stage.

The relationship between artificial intelligence and computer vision is as follows: artificial intelligence puts more emphasis on reasoning and decision-making, but at least computer vision is still mainly at the stage of image information expression and object recognition. Object recognition, environment perception, and scenario understanding also involve reasoning and decision-making from image features, but they are fundamentally different from the reasoning and decision-making of artificial intelligence.

4.3.1. Current Challenges

Specific maritime engineering applications belong to systemic issues, affected by many objective factors, for example, equipment shaking, model dependence, and light interference on shore.(i)Shaking problem of imaging equipment: in actual marine engineering applications (onshore and onboard), the effect of the model is often much lower than the accuracy and speed obtained in the laboratory. In the actual marine environment, the shaking of the equipment is the main reason for small object missed or object false detection. Even in the tracking task, the tracking fails due to the same reason.(ii)Model dependence: at present, all models require fixed scene training; the environmental changes have a large impact on the recognition accuracy of the model. Weather changes will change the external photosensitive environment, which will lead to bad results for marine object detection, and even equipment updates can cause model detection to fail.(iii)Background light pollution on the shore is an important issue, and there are few research papers related to the extraction of background light on the shore. Even experienced seafarers are still prone to think of shore lights as lights moving on the sea and make inappropriate decision-making. This is an urgent problem in the field of autonomous ship navigation.

4.3.2. Future Works

(1) Online Training. The current models are trained first and then deployed. Applications such as autonomous ship navigation require reasoning and decision-making based on environmental information in real time. One of the future trends is to solve this problem.

(2) Build a Maritime Data Sharing Center. (1) Unified model algorithm evaluation mechanism: at present, maritime surveillance and intelligent transportation need a public benchmark for different researchers who proposed various models. (2) Construct various marine scenarios and sea condition dataset share platform. The actual detection task of marine objects requires training with a lot of data in their respective sea conditions; the current datasets cannot complete this task. We will put more energy into the work of data sorting and build a maritime data sharing platform.

5. Conclusions

This survey covers most of the application scenarios of object detection for maritime surveillance and autonomous ship navigation. In recent years, a large number of marine object detection models based on deep learning have been proposed, but due to the lack of universal evaluation criteria, it is difficult to compare different improved models. According to the characteristics of the maritime environment, this paper summarized the advantages of the computer vision milestone model and proposed different application scenarios of the single-stage model and the multistage model under different development routes. The most popular YOLO series models are compared in different dimensions, and the importance of public dataset benchmarks is proposed. We also discussed the urgency of building a maritime proprietary dataset platform that satisfies different scenarios and model training in practical engineering applications. This work will put forward feasible suggestions for future research directions of deep learning-based marine object detection.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the Fundamental Research Funds for the Central Universities, Grant nos. 3132021130 and 3132019400.