Abstract

Object detection is a classical research problem in computer vision, and it is widely used in the automatic monitoring field of various production safety. However, current object detection techniques often suffer low detection accuracy when an image has a complex background. To solve this problem, this paper proposes a double U-shaped multireinforced unit structure network (DUMRN). The proposed network consists of a detection module (DM), a reinforced module (RM), and a salient loss function (SLF). Extensive experiments on five public datasets and a practical application dataset are conducted and compared against nine state-of-the-art methods. The experiment results show the superiority of our method over the state of the art.

1. Introduction

Object detection in computer vision is widely used in the field of production safety monitoring, for example, abnormal behavior detection, regional invasion detection, and dress code detection. The practical applications of object detection effectively solve many problems and defects in safety production management while reducing preventable accidents in the workplace.

In many practical production safety applications of objection detection, we found that the following problems still exist in the safety harness detection: (1) the color contrast between safety harness and work clothes is low, which makes it difficult to detect the safety harness accurately. (2) The structure of safety harness is complex, and the detection of safety harness is easily interfered by the texture of work clothes, resulting in detection difficulty, as shown in Figure 1.

In order to solve the above problems, salient object detection technology is considered as a feasible solution. Salient object detection is to imitate the mechanism of human visual attention, that is, to obtain the object of interest in the visual scene through human eyes and then transmit this object to the brain for understanding and optimization, so as to quickly obtain the desired information from the scene. Since salient object detection can ignore irrelevant information, the region of interest can be effectively segmented and applied to the subsequent detection process. Therefore, salient object detection, as an effective pretreatment technology, is widely used in many computer vision tasks.

Based on the extensive literature review, we briefly introduce the current salient target detection methods from the following four aspects.

1.1. Salient Object Detection Based on Traditional Methods

Early salient object detection methods were based on low-level features and some heuristics’ prior knowledge, such as color contrast [1], background prior [2], and center prior [3]. Early methods detect salient objects by searching for pixels according to a predefined saliency measure computed based on handcrafted features [4, 5]. Borji et al. [6] provided a comprehensive survey in this field. Encouraged by the advancement on the image classification of deep CNNs [7,8], early deep salient object detection methods searched for salient objects by classifying image pixels or superpixels into salient or nonsalient classes based on the local image patches extracted from single or multiple scales [9,10].

1.2. Salient Object Detection Based on Feature Enhancement

Wang et al. [11] used a weight sharing method to refine features iteratively and promoted a mutual fusion between features. Li et al. [12] proposed a novel dense attentive feature enhancement (DAFE) module for efficient feature enhancement in saliency detection. Zhang et al. (UCF) [13] developed a reformulated dropout and a hybrid upsampling module to reduce the checkerboard artifacts of deconvolution operators as well as aggregate multilevel convolutional features for saliency detection. Hu et al. [14] proposed a level set [15] function to output accurate boundaries and compact saliency. Luo et al. (NLDF+) [16] designed a network with a 4 × 5 grid structure to combine local and global information and used a fusing loss of cross-entropy and boundary IoU inspired by Mumford and Shah [17]. Hou et al. (DSS+) [18] proposed a holistically nested edge detector (HED) [19] by introducing short connections to its skip layers for saliency prediction. Chen et al. (RAS) [20] presented a HED by refining its side output iteratively using a reverse attention model. Zhang et al. (LFR) [21] predicted saliency with clear boundaries by proposing a sibling architecture and a structural loss function. Yao and Wang [22] proposed an enhancing region and boundary awareness network (ERBANet) equipped with attentional feature enhancement (AFE) modules to improve the detection performance.

1.3. Salient Object Detection Based on the Attention Mechanism

The gate unit is combined by two consecutive feature maps of different resolutions from the encoder to generate rich contextual information [24]. Li et al. [5] proposed an attention steered interweave fusion network (ASIF-Net) to detect salient objects, which progressively integrated cross-modal and cross-level complementarity from the RGB image and the corresponding depth map via steering of an attention mechanism. Xu et al. [25] proposed a dual pyramid network (DPNet) for salient object detection by formulating the self-attention mechanism into the subregion-based contexts. Zhou et al. [26] proposed a simple yet effective hierarchical U-shape attention network (HUAN) to learn a robust mapping function for salient object detection and formulated a novel attention mechanism to improve the well-known U-shape network. Li et al. [27] proposed a multiattention guided feature fusion network (MAF). A novel channel-wise attention block (CAB) was used which is in charge of message passing layer by layer from a global view, which utilized the semantic cues in the higher convolutional block to instruct the feature selection in the lower block. Zhang et al. (PAGRN) [28] developed a recurrent saliency detection model that transfers global information from the deep layer to shallower layers by a multipath recurrent connection. Hu et al. (RADF+) [29] recurrently concatenated multilayer deep features for saliency object detection. Wang et al. (RFCN) [30] designed a recurrent FCN for saliency detection by iteratively correcting prediction errors. Liu et al. (PiCANetR) [31] predicted the pixel-wise attention maps by a contextual attention network and then incorporated them with U-Net.

1.4. Salient Object Detection Based on Edge Optimization

To capture finer structures and more accurate boundaries, numerous refinement strategies have been proposed. Wu et al. [32] proposed a novel stacked cross refinement network (SCRN) for salient object detection which aimed to simultaneously refine multilevel features of salient object detection and edge detection by stacking a cross refinement unit (CRU). Wang et al. (SRM) [33] proposed to capture global context information with a pyramid pooling module and a multistage refinement mechanism for saliency maps’ refinement. Amirul et al. [34] proposed an encoder-decoder network that utilizes a refinement unit to recurrently refine the saliency maps from low resolution to high resolution. Deng et al. (R3Net+) [23] developed a recurrent residual refinement network for saliency maps’ refinement by incorporating shallow and deep layers’ features alternately. Fu et al. [35] proposed an end-to-end deep-learning-based refinement model named Refinet. Intermediate saliency maps that are edge-aware were computed from segmentation-based pooling and then fed to a two-tier fully convolutional network for effective fusion and refinement.

The researchers have improved salient object detection from the above four aspects, but the following two problems still exist.

1.4.1. The Issue of Blurred Edges

The salient object detection method based on the fully convolutional neural network (FCN) can better extract multilevel features compared with previous methods. However, after continuous convolution and pooling operations, the loss of shallow fine details cannot be reconstructed by the upsampling operation, resulting in defects in the fine structure or boundary as shown in Figure 2. The saliency is defined primarily in terms of the global features of an image, rather than local or pixel-level features. In order to obtain more accurate results, salient object detection methods still need to understand the global significance of the whole image and the structural details of the object [19].

1.4.2. The Issue of Complex Background

The most salient object detection networks adopt the U-Net structure as the encoder and the decoder, and multistage features provided by U-Net are used to reconstruct high-resolution feature images. Whether the effective features of the encoder can be transmitted to the decoder is the basis of whether the decoder can output an accurate and salient object. However, most U-Net-based methods only considered the information interaction between different levels in the encoder or the decoder and directly used all-pass skip-layer structure to connect the encoder features to the decoder. In these methods, information interference often occurs between different blocks, especially when an image has a complex background.

Qin et al. [36] proposed a method that divided the task into two parts, but optimized the edges without taking into account the loss of fine structure and the interference of the complex background. Inspired by the BASNet [36] structure, we propose a double U-shaped multireinforced unit structure network (DUMRN) to solve the above two problems simultaneously. This network can achieve a fine prediction of object boundary and accurate saliency object detection under the complex background. The main contributions of this paper include the following:(1)We propose a new detection module which includes an information processing unit, a dual-flow branch unit, and a semantic reinforcement unit. The information processing unit is used to control the amount of information flowing from each encoder block to the decoder while enhancing the effective information and suppressing the irrelevant information. The dual-flow branch unit is used to fuse the output of the information processing unit and the supplementary branch optimizing the residual information of the trunk branch. The semantic reinforcement unit makes full use of the top-level semantic information and integrates multilevel context information to obtain more accurate spatial information and fine boundary information.(2)We propose a new reinforcement module. It includes a feature reinforcement unit and a heat map unit. The feature reinforcement unit further fuses the information in the output preliminary salient map through a U-shaped encoder-decoder structure. The heat map unit uses the improved activation function to delimit the feature map.(3)We design a loss function for salient object detection. It combines a salient loss of binary cross-entropy (BCE), a structural similarity (SSIM), and an IoU loss and can learn from real ground information at pixel, patch, and map levels.

2. Materials and Methods

Because encoder-decoder structures can make full use of context features in salient object detection, two encoder-decoder structures are designed to form a double U-shaped network, which is divided into the detection module (DM) and reinforcement module (RM) as shown in Figure 3.

In the previous double U-shaped network [36], the first U-shaped structure is a simple encoder-decoder structure, which often cannot effectively solve the loss of semantic information and interference of redundant information. So, we add some optimization units to the first U-encoder-decoder structure which includes an information processing unit (IPU), a dual-flow branch unit (DFBU), and a semantic reinforcement unit (SRU). We input an image into the encoder-decoder, and after the information allocation of the IPU and the information supplement of the DFBU, the output of D1 and SRU is finally added to get the preliminary feature map. Experimental results of the method with a double U-shaped structure show that adding a second encoder-decoder structure can further enhance the information [36]. The second U-shaped structure has a heat map unit (HMU) and a feature reinforcement unit (FRU). In the preliminary salient map input enhancement module, the operations of two branches are carried out, respectively. Finally, the output of S′1 of the FRU and HMU is added to obtain the final result.

2.1. Detection Module

The detection module is mainly aimed at solving the problems of information interference under complex background and edge blurring. The detection module is a U-shaped encoder-decoder structure, which mainly contains IPU, DFBU, and SRU. IPU controls and processes the information exchange between encoders and decoders to solve the interference problem caused by the complex background. DFBU supplements the main information and solves the problem of detailed information. SRU makes better use of multilevel semantic information to solve the problem of the edge structure.

2.1.1. Information Processing Unit (IPU)

Compared with previous methods, the U-Net structure can obtain both deep semantic information and shallow spatial information. However, there is interference information when the encoder and decoder exchange information, and transmitted information has many invalid information, or the interference affects the quality of transmitting information.

To solve this problem, we add an IPU between each pair of corresponding encoders and decoders to distribute the information from the encoder and then transmit it to the decoder, as shown in Figure 4. Among them, Ei represents the i-layer feature of the encoder, Ti represents the ilayer feature parallel to the encoder, and Di + 1 represents the decoder feature of the i + 1 layer. When information is allocated, Ei, Ti, and Di + 1 are input into the IPU for a series of convolution, activation, and pooling operations to obtain a specific gravity to be allocated to Xi. The operation formula is as follows:where is the connection operation between channels, is the convolution layer, and S(·) is the sign function on the element.

2.1.2. Dual-Flow Branch Unit (DFBU)

The DFBU structure, the trunk branch as the main fusion feature of the main trunk, is used to combine multilevel information to predict the overall information of the object, while the supplementary branch can combine more low-level information to supplement the trunk branch, aiming at optimization. The IPU processed information Xi is divided into two branches and entered into the DFBU, respectively. Part of the information enters into the main branch to obtain the convolution layer Di. After the convolution, activation, and pooling operations, add it to the output of Xi to get the convolution level Di − 1. Part of the information goes to the supplementary branch, and the information at all levels is added in turn. Finally, the output information is added to the D1 output information of the trunk branch to obtain the DFBU output result, which is denoted as output_1, as shown in Figure 5.

2.1.3. Semantic Reinforcement Unit (SRU)

Since spatial information and detailed information cannot be fully integrated in the U-Net structure, the shallow semantic information is lost step by step in the continuous process of convolution and pooling of input information. Rich semantic information and accurate detailed information play an important role in salient object detection. Due to the lack of shallow and deep features, the generated salient map cannot obtain fine boundaries in the case of satisfying the approximate salient region. Since the highest layer of the encoder has rich semantic features, we fuse the features of multiple layers in the encoder (E2, E3, E4, E5, and E6) with E1, respectively, to obtain a convolution layer with the same size as E1. Finally, we add the five fused convolution layers Y2 to Y6 and E1, and the output result is output_2, as shown in formula (2) and Figure 6. Finally, output_1 and output_2 are added to obtain the preliminary feature graph output by the detection module.

2.2. Reinforcement Module
2.2.1. Heat Map Unit (HMU)

The heat map unit mainly intensifies and weakens the features in the feature map. We introduce a nonlinear activation function, sigmoid activation function, to adjust the features in the graph. Sigmoid is also known as the logistic activation function. It compresses a real value to the range of 0 to 1. It can be applied to the output layer when our ultimate goal is to predict probability. It turns big negative numbers into zero and big positive numbers into one. We adjust this function, as shown in Figure 8, to make the graph of the function steeper and more dramatic as it approaches 0 in the x-direction. This function will strengthen salient features and suppress nonsalient features in the input image, thus forming a feature map similar to a heat map. Suppression of information on the left side of the Y-axis makes the background of the salient map cleaner.

2.2.2. Feature Reinforcement Unit (FRU)

The feature reinforcement unit is the second U-shaped structure of the network: the encoder and decoder structure. HMU can cause the loss of some valid information mixed into the HMU when it restrains nonsignificant information, so the FRU mainly supplements the information in the HMU. By using the characteristics of the U-Net structure, the FRU can make better use of deep and shallow information to reinforce the features of the initial salient feature graph and finally output the results in the last convolution layer of the decoder. The output results are fused with the output results of the heat map unit, and the features of the convolution layer are further reinforced to obtain the final results, as shown in Figure 9.

2.3. Loss Function

In order to train the salient target detection model, we design three significance loss functions according to the format of significance loss in the previous method [36], which include BCE loss, SSIM loss, and IOU loss.where represents the output loss at the -side; represents the total number of sides; , , and represent the weight of each loss of BCE, SSIM, and IOU, respectively; and , , and represent the BCE loss, SSIM loss, and IOU loss, respectively. Our model has 8 outputs, namely, , including 7 outputs of the detection module and 1 output of the enhancement module. is defined aswhere represents the ground truth value and represents the predicted value.

is originally proposed for image quality evaluation. It explores structural information in an image by separating the effects of brightness on objects. The similarity measurement of can be composed of three contrast modules, respectively: brightness, contrast, and structure. is defined aswhere x, y represents the image feature; represents the positions of the local index in the mapping; represents the value of the symmetric Gaussian weighting function; and the constants are to avoid the instability of the system caused when approaches 0. are greater than zero and are set to 1 in practice.

is used as a standard evaluation measure and training loss for object detection and segmentation, which can reflect the detection effect. The expression is as follows:

Among them, is the correct result annotated artificially, while represents the result predicted by the algorithm. The standard is used to measure the correlation between true and predicted, and the higher the correlation, the higher the value.

We have introduced the principle and calculation process of the three loss functions. These functions represent different stages of the training process. is a pixel-level-based convergence assessment, and different weights are assigned to the foreground and background. calculates the local neighborhood of each pixel and assigns a higher weight to the boundary to make the boundary clearer. is used to measure the correlation between real and predicted values. When combining these three kinds of loss, we use to maintain the smooth gradient of all pixels and use to pay more attention to the foreground. is used to enhance the target boundary information in the feature map.

3. Experiment

3.1. Experimental Dataset

In this section, we first test the proposed method using the following five image saliency detection datasets:1ECSSD contains 1000 images with complex structure.(2)DUT-OMRON contains 5168 images with complex foreground structure, each of which usually has complex background or multiple foreground objects.(3)PASCAL-S contains 850 images of background and complex foreground objects.(4)HKU-IS contains 4447 images with multiple foreground objects that overlap or break the image boundary.(5)DUTS is the largest dataset for image saliency detection, which consists of two subsets: DUTS-TR and DUTS-TE. DUTS-TR contains 10553 images for training, and DUTS-TE contains 5019 images for testing.

We then apply the proposed method to the safety harness detection task for the practical application. For this practical application, we have collected 2200 images from power construction sites, 2000 for the model training and 200 for testing.

3.1.1. Implementation and Experimental Setup

We use an eight-core PC with AMD Ryzen 1800 × 3.5 GHz CPU (32 GB memory) and GTX 1080ti GPU (11 GB memory) for training and testing. We build our model on the basis of the BASNet framework. In the experiment, the proposed network is implemented on the PyTorch repository. We train the network using the DUTS-TR dataset. During training, each image is first resized to and randomly cut to . For the optimizer, we use the Adam optimizer to train our network, and its superparameter settings are as follows: initial learning rate , , , and . During testing, each input image is adjusted to and then input into the network to obtain the saliency map. The saliency map is adjusted to the size of the input image using bilinear interpolation.

3.1.2. Evaluation Metrics

We use the three metrics to evaluate the proposed method: F-degree score, MAE, and S-degree score.

The calculation of F-degree is as follows:where TP means that the classifier recognizes the correct positive sample; TN means that the classifier recognizes the correct negative sample; FP means the classifier recognizes the wrong negative sample; and FN means the classifier recognizes the wrong positive sample. Precision is defined as

Recall is defined as

MAE represents the average of the absolute error between predicted and observed values, namely, the prediction between the significant mapping and its real mask average absolute deviation per pixel. The MAE is a linear score which means that the weight of all the individual differences in average is equal. As a supplement of the PR curve, it is calculated by the average absolute difference between the pixel significant value and the ground truth:where represents the area of significance mapping; represents the probability map of significance of the pixel; and represents the real value of the pixel. S measure takes into account the structural similarity of the area perception junction (SR) and object perception (SO), where α is set to 0.5.

3.2. Ablation Study

In this section, we test the validity of each component proposed in our model and conduct ablation experiments on the ECSSD dataset. To demonstrate the effectiveness of our detection enhancement network, we first use the FPN branch, and the proposed IPU, DFBU, SRU, HMU, and FRU are then added in turn. Table 1 gives the results of this ablation study.

3.3. Quantitative Evaluation

We compare our model with nine other models: AFNet, BASNet, EGNet, F3Net, GateNet, ITSD, LDF, MINNet, and PoolNet. To evaluate the qualities of the segmented protruding objects, Table 2 summarizes the F measure (), S measure (Sm), and MAE measure for the largest region of all datasets. As Table 2 shows, the proposed method outperforms other methods in both area and boundary measures by using ResNet-50 as a backbone. In particular, our method improves by 4.1%, 5.1%, 6.2%, 3.4%, and 5.9% on ECSSD, HKU-IS, DUT-OMRON, DUTS-TE, and PASCAL-S datasets, respectively.

To further demonstrate the superior performance of our method, we show a qualitative comparison of our method with other methods in Figure 10. We can see that the proposed method can suppress the interference information in the case of complex background and strengthen the effective information of saliency targets in images.

3.4. Practical Application

In order to solve the problem of safety harness detection in power production safety monitoring, we apply the proposed double U-shaped multireinforced unit structure network in the YOLOv5 detection model and test the performance on the aforementioned power construction site dataset. Figure 11 shows that the proposed method can accurately detect target saliency maps under the complex power construction site background and improve the detection accuracy by 10% compared with the original YOLOv5 network as shown in Figure 12.

4. Conclusions

In this paper, we have proposed a double U-shaped multireinforced unit structure network (DUMRN) to improve object detection. The proposed network consists of the detection module (DM), the reinforced module (RM), and the salient loss function (SLF). Quantitative evaluation on five public datasets has been conducted, and experimental results show that the proposed method gives the accurate performance and outperforms nine state-of-the-art methods. In addition, the safety harness detection experiment further verifies the effectiveness of the proposed method in the practical application. However, there are still some shortcomings in the proposed method. First of all, compared with general object detection methods, the proposed method consumes more time due to the salient object detection preprocess. Secondly, the proposed method may not provide a stable performance for small target detection. In the future, we will further expand datasets for more practical applications and improve the speed of the proposed method by optimizing network structures.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Qian Zhao and Haifeng Wang contributed equally to this work.

Acknowledgments

The authors thank for the dataset for safety harness detection provided by Yunnan Power Grid Co., Ltd., Yuxi Power Supply Bureau, and the openCODE provided by Xuebin Qin et al. This work was supported by the Yunnan Province Ten Thousand Talents Program, Postgraduate Research Innovation Fund Project of Yunnan Normal University (ysdyjs2020148), and Science and Technology Innovation Program of the Institute of Optics and Electronics, Chinese Academy of Sciences (20204001026).