In view of the current absence of any deep learning algorithm for shellfish identification in real contexts, an improved Faster R-CNN-based detection algorithm is proposed in this paper. It achieves multiobject recognition and localization through a second-order detection network and replaces the original feature extraction module with DenseNet, which can fuse multilevel feature information, increase network depth, and avoid the disappearance of network gradients. Meanwhile, the proposal merging strategy is improved with Soft-NMS, where an attenuation function is designed to replace the conventional NMS algorithm, thereby avoiding missed detection of adjacent or overlapping objects and enhancing the network detection accuracy under multiple objects. By constructing a real contexts shellfish dataset and conducting experimental tests on a vision recognition seafood sorting robot production line, we were able to detect the features of shellfish in different scenarios, and the detection accuracy was improved by nearly 4% compared to the original detection model, achieving a better detection accuracy. This provides favorable technical support for future quality sorting of seafood using the improved Faster R-CNN-based approach.

1. Introduction

As a major maritime country, China covers a sea area of up to 4.73 million m2, which comprises 280 million ha of shelf fishing grounds, 2.6 million ha of shallow aquaculture ponds, 17.47 million ha of inland waters, and 67 million ha of untapped resources suitable for fisheries such as saline-alkaline lands [1]. Unlike planting and animal husbandry, aquaculture is a weak and lagging industry in China. Thus, it is imperative to increase investment in fisheries in order to strengthen the development and utilization of waters and to improve the comprehensive production capacity of the aquaculture industry. Accelerating the growth mode transformation requires substantial novel technologies and methods by which the means and processes of production can be improved.

Among aquatic products, shellfish output accounts for an increasing proportion of total aquaculture production in China, which holds a world leading position in terms of both culture area and output. In the case of Zhangzi Island, China's largest scallop bottom aquaculture base, its scallop production in 2019 reached over 200,000 t. Regarding the sorting process in shellfish production, substantial manual input is required, which restricts the large-scale development of the shellfish industry severely. First of all, the manual environment is harsh and has low production efficiency. Secondly, due to the shortage of labor, the demands in shellfish production areas cannot be met during the peak season. Thirdly, corporate development is restricted greatly by the rise in labor wages and the jump in product costs. With the ever-demanding requirements on the quality and production efficiency of aquatic products, more efficient and precise means are needed for completing the shellfish sorting, in order to enhance the degree of production automation. In recent years, computer and robotics technologies have gained rapid development, of which computer vision has been applied extensively in numerous fields of industrial production including automobile, electromechanical, food, logistics, and manufacturing industries. By using computer vision for the identification and localization of scallops and subsequent sorting, the production efficiency can be improved while ensuring the quality of aquatic products. In the meantime, it also reduces the demand for labor.

The computer vision-based shellfish detection and identification technology is typical image analysis, understanding, and classification problem [25]. In-depth investigation and solution of this problem involve theoretical knowledge from multiple disciplines such as pattern recognition, image processing and analysis, computer vision, artificial intelligence, deep learning, and computer graphics. Accordingly, this paper proposes a Faster R-CNN-based algorithm for shellfish classification and detection.

In most cases, the fished aquatic products are manually sorted for classification among different species and grading within the same species. Generally, postharvest processing relies on the manual identification of fish species and appearance quality. To reduce the labor demand and enhance the processing efficiency, computer vision can be applied for noncontact counting and measurement of aquatic products. In this way, the processing efficiency and counting accuracy can be improved without damaging the aquatic products under inspection. Recent years have witnessed many applications in this respect. Through neural network training, automatic identification of marlin biological features was achieved, which simplified computation in the recognition process. According to Cha et al. [6], computer vision‐based techniques were developed to overcome the limitations of visual inspection by trained human resources and to detect structural damage in images remotely. In 2021, Long et al. [7] presented a novel deep learning based damage evaluation approach by using speckled images. A deep convolutional neural network (DCNN) for predicting the stress intensity factor (SIF) at the crack tip is designed. Based on the proposed DCNN, the SIF can be automatically predicted through computational vision. In 2013, Costa [8] applied computer vision to automatically classify the size, gender, and skeletal abnormalities of groupers. The least squares modeling-based multielement technology, which integrated image analysis and contour morphology (elliptical Fourier analysis), was applicable for sorting and processing live fish. Applying computer vision, Ma et al. [9] designed a set of equipment for sea cucumber grading and counting. After acquiring the projection images of sea cucumbers on the conveyor belt, their specifications were determined according to the size of the image area, thereby accomplishing grading and counting. In the absence of overlapping images, the identification accuracy reached 100%. Wang et al. [10] obtained the fish contours close to the standard shapes through deformation correction on curved fish images and then used the equidistant polar mapping to obtain the mapping maps, followed by extraction of intervals between local and adjacent extreme points in the maps. Finally, the matching degree was computed algorithmically to accomplish the fish identification. In the object detection task, the region-CNN (R-CNN) proposed by Girshick [11] is an important reference method, with plenty of algorithms borrowing ideas from R-CNN. Combining object region proposal with CNN classification, the R-CNN model extracts 2,000 candidate regions from the input images using the selective search algorithm and then performs feature extraction through the CNN network. Afterward, it utilizes a trained classifier to determine whether the candidate regions contain the target objects and eventually adjusts the proposals with a regressor. Girshick [12] modified the R-CNN by incorporating the ideas of SPP-net to put forward the Fast R-CNN model. The new model performs feature extraction on the entire image only once and then maps it onto the proposals, thereby avoiding the wastage of time caused by repeated feature extraction. Moreover, it performs training through a combination of Softmax classification and box regression, which eliminates the feature storage and enhances the space and time utilization. In the meantime, convolutional features can be shared among the classification and regression tasks. Ren et al. [13] proposed the use of deep learning for regional proposal network (RPN), where RPN and Fast R-CNN were combined to construct a novel Faster R-CNN model, thereby improving the overall detection performance.

This paper innovatively introduces a Faster R-CNN network for the identification, localization, and experimentation of four shellfish species in order to address the absence of a deep learning algorithm for shellfish identification in real contexts [1416]. Modification is made on the Faster R-CNN framework based on the features of various shellfish species, the original feature extraction module is replaced with a densely connected network (DenseNet), and the multilevel features of fusion objects are extracted to add expressive power to the features. Meanwhile, Soft-NMS is used instead of the original proposal merging strategy, and an attenuation function is designed to enhance the object box localization accuracy. Furthermore, shellfish datasets are collected and built in the real context, and the identification and localization of various shellfish species are implemented eventually in the working environment through training, which achieves excellent accuracy.

2. Materials and Methods

2.1. Faster R-CNN Architectures

As a current mainstream two-stage detection network, Faster R-CNN is a combination of RPN and Fast R-CNN, which enables the output of detection categories and box positioning at each stage [17, 18]. Depending on the network architecture, the Faster R-CNN can be divided into three parts: the basic feature extraction network, the RPN, and the detection network. The specific steps of the algorithm are described below. Figure 1 presents the algorithmic framework.

2.1.1. Feature Extraction Network

Among the above three parts, the feature extraction network comprises convolutional neural networks (CNNs), whose fundamental architecture includes convolutional, pooling, fully connected, and softmax classification layers. Different CNNs produce varying effects on the accuracy and duration of detection. There are three common feature extraction networks for Faster R-CNN, namely, ZFNet, VGG-16, and ResNet. (1) ZFNet [19], as a slight improvement on AlexNet, retains more features by reducing the number and stride of convolution kernels. It can also be inferred that with increasing network depth, better performance of feature extraction indicates a more remarkable extraction effect. (2) VGG-16 [20] is formed by repeatedly stacking 3  3 convolution kernels and 2  2 maximum pooling layers, which verifies the relationship between the depth and performance of CNNs. Despite a simple architecture and favorable feature extraction effect, it has large parameters, substantial training features, and high hardware requirements. (3) ResNet [21], short for residual network, solves the gradient disappearance problem accompanying the increase in network depth by designing a residual module, which implements identity mapping by connection route. Capable of extracting deeper features of target objects, the network achieves a remarkable identification effect.

2.1.2. Feature Extraction of the Datasets

There are many aspects of shellfish that can be extracted, such as colour or greyscale features, texture features, shape or form features, structural relationships, frequency features, and boundary area features. How to extract the features of the object to be recognised is the key to the target recognition problem. It was found that the pixel values of the shellfish greyscale map were not significantly different from the interfering objects, but the shapes were significantly different. Therefore, in this study, the contour boundary of the target object is extracted, and the classification features are extracted from the contour boundary.

The RPN is responsible for extracting candidate regions [22, 23], whose architecture is illustrated in Figure 2. It receives the convolutional feature maps from the basic feature extraction network and convolves each 3  3 sliding window into a 256-dimensional feature vector via convolution kernels. Through 1  1 convolution, each sliding window is output as two fully connected layers, i.e., the box-classification layer (cls layer) and the box regression layer (reg layer). The cls layer outputs the probabilities of belonging to the foreground and background, while the reg layer outputs four parameters, i.e., the centroid coordinates x, y of the prediction range, and the length and width , h. The presence or absence of the object is determined based on the receptive field corresponding to the sliding window center [24]. Given the length and width discrepancies among objects, nine anchors are generated using three window scales (8, 16, 32) and three aspect ratios (1 : 2, 1 : 1, 2 : 1) at a benchmark window size of 16, in order to perform multiscale multipoint sampling of feature maps. The principle of using a canny algorithm to extract the grayscale image edge is to calculate the gradient histogram of the image and accumulate the number of pixels along the gradient increment direction, set the gradient value at 80% of the total number of pixels as the high threshold th and 40% of the gradient value of the high threshold as the low threshold tl, mark the points in the gradient histogram with a gradient greater than the high threshold tl as edge points, and the points with gradient value less than the low threshold th as background points. And points with gradient values between tl and th are regarded as edge points if there are already marked edge points in their 8-connected neighbourhood, otherwise they are regarded as background points. In order to reduce the noise while preserving the effective boundary, the high threshold was set to 0.4 and the low threshold was set to 0.1, and the extraction effect was satisfactory.

The loss function of RPN is defined as follows:where subscript stands for the anchor index, represents the probability distribution of each anchor corresponding to k + 1 categories (k categories + 1 background), indicates whether an object is contained (1 if so, and 0 otherwise), denotes the minibatch size (usually 256), and is the number of anchors. denotes the balance weight, whose value is 1. Besides, represents the proposal box coordinates and represents the marker box coordinates. The specific parameter values are described below:

(same for y, , h) denote the location parameters of the proposal, anchor, and calibration boxes, respectively. The classification loss is a logarithmic loss function of object and nonobject:

The regression loss is , where

The RPN performs box regression via the loss function and merges the prediction boxes output by the detector through nonmaximum suppression, which is then connected to the Fast R-CNN as input [2527]. The candidate regions generated by RPN are mapped onto the feature map output by the feature extraction network. For the candidate region input of any size, the ROI pooling layer can obtain a fixed-dimensional output, which then obtains the final result through the cls and reg layers.

2.2. Improved Faster R-CNN
2.2.1. Dense Block Network

Although a deeper network allows the extraction of deeper semantic information, there will be an inevitable increase in parameters with the deepening of a network [28, 29]. As a result, a series of problems are brought to the network optimization and the experimental hardware. The datasets built specifically for the shellfish classification and detection algorithm herein have small sample sizes so that the network training easily leads to overfitting. The use of DenseNet as the feature extraction network helps solve problems [30].

As a novel network architecture, DenseNet draws on the ideas of ResNet. The most intuitive difference between the two architectures lies in the varying transfer functions for various network blocks.

As is clear from (6) describing the transfer function of the ResNet, the lth layer output of the network equals the nonlinear variation of l-1th layer output plus the l-1th layer output. Contrastively, the lth layer output of a DenseNet block is the set of nonlinear transformations output by all previous layers. Figure 3 depicts the Dense Blocks of the DenseNet.

The convolutions in each Dense Block are all interconnected [31]. H indicates that each input is convolved with k-dimensional 3  3 kernels using Batch Norm and ReLU in order to ensure that each node can output feature maps of the same dimension. k denotes the thickness of the feature maps output by each convolution layer. Compared to other networks, whose thickness of output feature maps can reach hundreds or even thousands, the DenseNet has an overall thickness of 32 only. For DenseNet, the dense connection between its blocks allows effective utilization of the shallow and deep layer features so that its efficiency and narrowness can be ensured, and its complexity and computational burden can be reduced greatly. Figure 4 details the parameters for connecting nodes .

In this paper, four 121-layer Dense Blocks are used to constitute the feature extraction network, which accomplishes object identification and localization by connecting the RPN and RoI pooling layer following the removal of the fully connected and classification layers. Table 1 lists the parameters of the four-DenseBlock architecture.

2.2.2. Nonmaximum Suppression

In essence, nonmaximum suppression (NMS) aims to search for the local maximum and suppress the nonmaximum elements, which is an important step of the detection process [32, 33]. Faster R-CNN generates a series of detection boxes in an image and the corresponding box score set . NMS algorithm selects the detection box M in the object detection process prior to the maximum score, which is then subjected to intersection over union (IoU) computation with the remaining detection boxes. The detection box M will be suppressed if the result is greater than the set threshold . The NMS algorithm formula is as follows:where the IoU is computed by the following formula:where A and B represent two overlapping detection boxes:

As is clear from (7), the NMS algorithm zeroes with the detection box that is adjacent to M and greater than the threshold. If an object under detection appears in the overlapping region, the NMS algorithm will fail to detect the object, thereby reducing the accuracy of the detection model.

To address this problem, the conventional NMS algorithm is replaced with the Soft-NMS, where an attenuation function is designed based on the IoU between adjacent detection boxes instead of setting their scores to zero, thereby ensuring accurate identification of adjacent objects. The Soft-NMS algorithm is expressed as follows:

To accomplish the shellfish classification and detection in real contexts, modifications are made on the front-end feature extractor and the tail-end regressor of the Faster R-CNN detection algorithm. The algorithmic flow is provided in Algorithm 1.

(1)Input image A, adjust the image size, and output image B with a specified size M × N.
(2)Using B as the input of the feature extraction module, obtain a multilevel fused feature map C via DenseNet.
(3)Using C as the input of RPN, obtain 300 proposals: D by the sliding window method. RPN changes the generated anchors through box regression so that they can be closer to the marker boxes.
(4)Using C and D as the inputs of RoI, obtain a mapping map E that is between the proposal and the feature map.
(5)Output E separately to the classifier and the regressor. The classifier achieves classification and identification of E using Softmax, while the regressor further corrects the boxes by Soft-NMS regression. Finally, classify and localize the objects.
2.3. Experimental Conditions

In order for the experiments to match actual production, the shellfish identification experiments in this paper were carried out on a deep learning based seafood sorting robot production line, as shown in Figure 5.

3. Results and Discussion

3.1. Data Sets Making and Processing

Shellfish datasets are collected independently in this study for verifying the effectiveness of the proposed Faster R-CNN algorithm, which are categorized into 4 species (scallops, mussels, conch, and clams) and contain 4,218 images in total (see Table 2).

The data features different light intensities, occlusions, complex backgrounds, and multiple objects in order to ensure that the detection model covers common real-life shellfish. Furthermore, 50% of the dataset is mirror augmented, and the other 50% is translationally augmented, followed by data annotation on the LabelImg software, as shown in Figure 6. The augmented datasets contain 8,436 images, 90% of which are used as the training sets and 10% as testing sets.

For accurate comparison of the improved algorithm in complex backgrounds, difficult samples are selected from the testing datasets based on the following criteria: there are more than four detection objects in an image, and the shellfish is affected by illumination and occlusion. Table 3 details the difficult sample testing data. Some images in the difficult samples have multiple difficult attributes, as shown in Figure 7(c). Apart from containing multiple shellfish species, they are also shot at nighttime. These images are both multiobject categorization samples and illumination-affected samples.

3.2. Experimental Parameters and Evaluation Indexes

Given the huge video memory requirements of DenseNet for fusing all feature maps prior to the current stage during transmission, efficient memory implementation is adopted. Specifically, two preallocated shared memory storage locations are proposed, which are used for storing the shared feature maps to be connected. All intermediate outputs are allocated to these memory modules during the forward transmission, while during the reverse transmission, the transfer function is recomputed and updated as needed. This strategy enables DenseNet to run in a single GPU with less computational overhead.

Experiments are implemented on the Tensorflow and run on a computer with i7 6700 processor, a GeForce RTX2080Ti GPU (11G video memory), and 32 GB RAM. The experimental data includes the shellfish appearances collected in different scenarios such as factories, Internet, and seafood markets, which are manually annotated using the LabelImg software—experimentation on a deep learning based seafood sorting robot line.

The convolution loss process and test accuracy in the training process are shown in Figure 8. After about 2000 convolutions, the overall loss can be rapidly reduced and stabilized between 0.1 and 0.2, and the test accuracy is also stabilized at about 80%.

The evaluation index is average precision (AP), which refers to the area enclosed by the precision-recall (P-R) curve. In a P-R curve, P represents the precision and R represents the recall rate. They are computed by the following formulas:where (true positives) denotes the number of positive samples that are identified as positive, (false positives) denotes the number of negative samples that are incorrectly identified as positive, and NFN denotes the number of positive samples that are incorrectly identified as negative. Besides, AP represents the identification accuracy of a single category. A higher AP value indicates better performance of the network model. In addition, mAP (mean average precision) represents the overall identification accuracy of all categories, and its relationship with AP is expressed by

3.3. Results Comparison and Analysis

Trainings are performed separately with ResNet and DenseNet as the feature extraction networks, and the network models are evaluated using the testing sets. Table 4 lists the obtained AP values for various shellfish species, whereas Table 5 presents the detection and comparison results of difficult samples.

According to the detection results, the Faster R-CNN with ResNet exhibits an over 77% mAP in various shellfish detections. Figure 7 depicts the partial detection results. In Figure 7(a), distinct object features and sufficient illumination are noted so that the model can achieve remarkable detection performance. The scallop in Figure 7(b) is partially occluded and contains multiple shellfish species, where the detection effect is satisfactory. In contrast, missed detection is present in Figure 7(c). Clearly, despite certain detection capabilities, ResNet still yields some missed and false detections. The reason is that full training of ResNet is impossible due to excessively small data size, which leads to inadequate robustness in complex scenarios.

As is clear from Table 4, the use of DenseNet-121 for feature extraction yields a mAP of up to 83%, which is nearly 4% higher than the ResNet. Greatly improved detection results are noted across three shellfish species (mussels, scallops, and clams). Regarding the reasons, the testing sets of these three species contain samples with multiple objects, occlusions, and complex backgrounds, and DenseNet can extract more features of objects to attain better results. After modifying the detection boxes with Soft-NMS, the accuracy rises by 2% in the mussel and clam datasets containing multiobject samples. Suggestively, the Soft-NMS can avoid the proposal box zeroing caused by a higher repetition of detection boxes than the threshold under multiple and overlapping objects, which achieves a better detection effect. According to Table 5, the improved detection network is more robust in difficult samples than the original version, with rather distinct improvements in two types of samples, i.e., the illumination-affected and occluded samples. In the presence of multiple objects and complex backgrounds, performance comparisons are made between the improved versus original Faster R-CNN algorithms, as shown in Figure 9.

The first row in Figure 9 presents the detection results with the original network, whereas the second row presents those with the improved network. Clearly, the original network often yields missed detections under complex conditions. Figure 9(b) exhibits a false detection made by the original network, which identifies conch and clam as a mussel. Figure 9(c) shows a case of missed detection, where the detection box fails to contain the shellfish objects precisely. As the comparisons reveal, the Faster R-CNN improved with DenseNet and Soft-NMS outperforms the original network in terms of accuracy, which allows the detection of more shellfish objects and an accurate selection of independent elements under excessively short distances between adjacent objects. The improved Faster R-CNN exhibits a more prominent improvement in performance under complex backgrounds and multiple objects.

According to a series of experimental comparison findings, the improved Faster R-CNN algorithm has high detection accuracy in shellfish detection tasks and good robustness in different environments. Apart from extending the application range of the original algorithm, it also shows practical applicability.

4. Conclusions

A deep learning algorithm for shellfish identification is proposed in order to address the inefficiency of conventional detection algorithms under different ambient light, diverse backgrounds, and varying occlusion conditions. As a modification based on Faster R-CNN, the algorithm uses DenseNet as the feature extraction network, where the dense connection between blocks allows effective utilization of the shallow and deep layer features, thereby enhancing the shellfish detection accuracy. Meanwhile, the proposal merging strategy is optimized by using Soft-NMS instead of the original algorithm, thereby adding precision to the proposals. Furthermore, shellfish datasets are built in real contexts and then augmented to improve the robustness of the training model. The proposed detection algorithm can achieve multiobject shellfish detection in seafood processors and has preferable accuracies in complicated scenarios like illumination influence, partial occlusion, and complex background, which has a nearly 4% higher detection accuracy than the original model, exhibits a good detection performance. In the next Step I will use an improved Faster R-CNN for seafood quality detection based on the findings of this paper.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Public Science and Technology Research Funds Projects of Ocean (No. 201505029) and Natural Science Foundation of Liaoning (No. 2020-MS-273).