Semantic Segmentation under a Complex Background for Machine Vision Detection Based on Modified UPerNet with Component Analysis Modules
Semantic segmentation with convolutional neural networks under a complex background using the encoder-decoder network increases the overall performance of online machine vision detection and identification. To maximize the accuracy of semantic segmentation under a complex background, it is necessary to consider the semantic response values of objects and components and their mutually exclusive relationship. In this study, we attempt to improve the low accuracy of component segmentation. The basic network of the encoder is selected for the semantic segmentation, and the UPerNet is modified based on the component analysis module. The experimental results show that the accuracy of the proposed method improves from 48.89% to 55.62% and the segmentation time decreases from 721 to 496 ms. The method also shows good performance in vision-based detection of 2019 Chinese Yuan features.
As one of the primary tasks of machine vision, semantic segmentation differs from image classification and object detection. The image classification process involved the recognition of the the type of object but cannot provide position information , whereas object detection can be used to detect the boundary and type of the object but cannot provide the actual boundary information . On the other hand, semantic segmentation can recognize the type of the object and divide the actual area at the pixel level, as well as implement certain machine vision detection functions, such as positioning and recognition . As we start from image classification, move to object detection, and finally reach semantic segmentation, the accuracy of the output range and position information improves . In the same manner, the recognition precision increases from the image-level to the pixel-level. Semantic segmentation achieves the best recognition accuracy; therefore, it is useful in (1) distinguishing the entity from the background, (2) obtaining the position information (centroid) clearly physically defined by indirect calculation, and (3) performing machine vision detection and identification organization, which require high spatial resolution and reliability [5, 6].
The online semantic segmentation with convolutional neural networks (CNNs) under a complex background is effective for improving the overall performance of online machine vision detection and identification  when maintaining the same architecture of the encoder-decoder network and convolutional and pooling layer and equivalently transforming the fully connected layer, thus yielding broad generalization. In recent years, ResNet has been used to replace the shallow CNN to optimize semantic segmentation results significantly . For machine vision detection and identification under a random-texture complex background, it is necessary to eliminate the random-texture complex background to extract the object without affecting the original features of the object . The difficulty lies in the randomness of the textured background, which makes it difficult to employ typical periodic texture elimination techniques, such as the frequency domain filtering and image matrix methods [10, 11]. On the contrary, the encoder-decoder semantic segmentation network ultimately retains the classification components in the network backbone, thus exhibiting larger receptive fields and better pixel recognition ability [12, 13], as depicted in Figure 1. Unreasonably selected and consequently incorrectly used component analysis modules will lead to an excessively small foreground range, resulting in the misjudgment of component pixels. If the component analysis module is too sensitive, the foreground range will be too broad; thus, it would be difficult to remove misjudged pixels . Therefore, in the process of semantic segmentation under the complex background, it is necessary to consider objects, the contradiction between the component semantic response values, and their mutual exclusion relationship, while maximizing the accuracy of the semantic segmentation under the complex background using the encoder-decoder network.
Figure 2 shows a flowchart of the semantic segmentation under the complex background using the encoder-decoder network. The process can be described as follows: the component classifier of the encoder-decoder network recognizes the pixel-level semantics and response of the pixels in the image; the object classifier recognizes the pixel-level object semantics and the response and extracts misjudged pixels of the foreground object in semantic segmentation; finally, the mutually exclusive relationship between component semantics and object semantics is considered, and non-background-independent semantics are determined to achieve effective semantic segmentation under a complex background to improve the model accuracy .
In this study, we focus on online semantic segmentation under a complex background using the encoder-decoder network to solve the above described mutual exclusion relationship problem between component semantics and object semantics. The main contributions of this study are threefold:(i)We attempted to improve the low accuracy of component segmentation and selected the superior basic encoder-decoder network according to the performance.(ii)We modified the UPerNet based on the component analysis module to maximize the accuracy of the semantic segmentation under a complex background using the encoder-decoder network while maintaining an appropriate segmentation time.(iii)We show that the proposed method is superior to previous encoder-decoder network and has satisfactory accuracy and segmentation time. We also show the application of the proposed method in bill-note anticounterfeiting identification.
The rest of this paper is organized as follows. In Section 1, we outline related works. In Section 2, we introduce a method for semantic segmentation under a complex background using the encoder-decoder network. In Section 3, we verify the proposed method. In Section 4, we present the conclusions.
2. Related Work
2.1. Evaluation of the Semantic Segmentation Performance
We can generally evaluate the CNN semantic segmentation performance from the accuracy and running speed. The accuracy indicators usually include the pixel accuracy , mean intersection over union , and mean average precision . The pixel accuracy PA is defined as the number of pixels segmented correctly accounting for the total number of image pixels; the mean intersection over union is defined as the degree of coincidence between the segmentation results and their ground-truth; the mean average precision is the mean of average precision scores for segmentation results, whose intersection over union no less than , for each classes.
If the object detected by machine vision has k categories, the semantic segmentation model requires the label of the categories denoted as , including the background. Denoting the number of pixels of mis-recognized as the pixel of and as and respectively, the numbers of detected objects of mis-recognized as and as and , respectively, the pixel accuracy can be calculated as follows:
The running speed of CNN semantic segmentation can be measured by indicators including the segmentation time , which is defined as the time needed to segment the image by running the algorithm. The theoretically shortest possible time required to segment the image is also labeled as the theoretical segmentation time , and the time required for the algorithm to actually segment the image is known as the actual segmentation time . If not otherwise specified, is denoted as .
2.2. End-To-End Encoder-Decoder Semantic Segmentation Framework
Although CNN semantic segmentation performs as a single-step end-to-end process, which is not further divided into multiple modules to deal with, the connection of numerous modules directly affects the CNN. The end-to-end semantic segmentation framework using the encoder-decoder enables the CNN to detect images with any resolution and output prediction map results with constant resolution. Typical networks include fully convolutional networks (FCN) , SegNet , and U-Net .
Figure 3 shows a schematic of the FCN model. The FCN is an end-to-end semantic segmentation framework proposed by Jonathan Long et al. (University of California, Berkeley) in 2014. The main idea is as follows: the operation of a fully connected layer is equivalent to the convolution of a feature map and a kernel function of identical size. The fully connected layer is converted into a convolution layer, which converts the CNN into a full convolution operation network consisting of a complete convolution layer (convolution operation) and pooling layer (convolution operation) to process images of any resolution. In this manner, the limitation of the fully connected layer is overcome, i.e., images with different resolutions can be processed. The original resolution is restored after eight times bilinear upsampling by taking the pooling layer as an encoder, designing a cross-layer superimposed architecture as a decoder, yielding the final output feature map of the network by upsampling, and adding to the output feature map of each pooling layer (namely, the encoder) to obtain a feature map with higher resolution. The CNN can perform end-to-end semantic segmentation through a fully convolutional and cross-layer superimposed architecture; therefore, various CNNs are capable of achieving end-to-end semantic segmentation. Using the framework described, the reached 62.2% in the VOC2012 semantic segmentation testing set, which is 10.6 % higher than the classic methods and 12.2% (its is 50.0%) higher than the SDS  further segmented by CNN object detection and classical method.
The ResNet proposed by the Amazon Artificial Intelligence Laboratory serves as a basic network for constructing FCNs for semantic segmentation; the in VOC2012 reaches 8.6% . The prediction results of the FCN application are obtained by eight-fold bilinear interpolation of the feature map, including the problems of detail loss, smoothing of complex boundaries, and poor detection sensitivity of small objects. The results ignore the global scale of the image, possibly exhibiting regional discontinuity for large objects that exceed the receptive field. Incorporating full connection and upsampling increase the size of the network and introduces a large number of parameters to be learned.
Figure 4 shows a schematic of the SegNet model, which is an efficient, real-time end-to-end semantic segmentation network proposed by Alex Kendall et al. (Cambridge University) in 2015. The idea is that the encoder and the decoder have a one-to-one correspondence, and the network applies the pooled index in the encoder's maximum pooling to perform nonlinear upsampling, thus forming a sparse feature map; then, it performs convolution to generate a dense feature map. SegNet defines the basic network of the encoder-decoder and deletes the fully connected layer to generate global semantic information. The decoder utilizes the encoder information without training, while the required amount of training parameters is 21.7% of that of the FCN. For the prediction of the results, SegNet and FCN occupy a GPU memory of 1052 and 1806 MB, respectively, and the GPU memory occupancy on GPU GTX 980 (video memory 4096 MB) is 25.68% and 44.09%, respectively. Therefore, the occupancy of SegNet is 18.41% lower than that of FCN. In , the design of SegNet on ResNet was described, and the in VOC2012 reached 80.4% . The of SegNet tested in VOC2012 was reported to be 59.9%, and the efficiency was found to be 2.3% lower than that of FCN; furthermore, there was the problem of false detection at the boundary.
Figure 5 shows a schematic of the U-Net model, which was proposed by Olaf Ronneberger (University of Freiburg, Germany) in 2015. The idea was to design a basic network that can be trained by semantic segmentation images and modify the FCN cross-layer overlay architecture with the high-resolution feature map channels retained in the upsampling section and then connect it to the decoder output feature map in the third dimension. Furthermore, a tiling strategy without limited by GPU memory was proposed; with this strategy, a seamless semantic segmentation of arbitrary high-resolution images was achieved. With U-Net, a of 92.0% and 77.6% was achieved in the grayscale image semantic segmentation datasets PhC-U373 and DIC-HeLa, respectively. The skip connection was used in the ResNet framework to improve U-Net, and a of 82.7% was achieved in the VOC2012 . There are two key problems with the application of U-Net: the basic network needs to be trained, and it can only be applied to specific task, i.e., it has poor universality.
Figure 6 shows a schematic of the UPerNet model, which was proposed by Tete Xiao (Peking University, China) in 2018. In the UPerNet framework, a feature pyramid network (FPN) with a pyramid pooling module (PPM) is appended on the last layer of the backbone network before feeding it into the top-down branch of the FPN. Object and part heads are attached on the feature map and are fused by all the layers put out by the FPN.
3. Material and Methods
The semantic segmentation under a complex background based on the encoder-decoder network will establish an optimized mathematical model with minimal segmentation time , segmentation time , and accuracy PA. Under the encoder-decoder network, the backbone network , the depth , and the decoder are obtained to form an encoder. By selecting the relatively better and of the basic network, the component analysis module to improve the optimized architecture is proposed, and the encoder-decoder network with optimized PA for semantic segmentation under a complex background is obtained. In the encoder-decoder network, the encoder transforms color images (three 2D arrays) to 2048 2D arrays. The encoder is composed of convolutional layers and pooling layers, and it could be trained on large-scale classification datasets, such as ImageNet, to gain greater feature extraction capability.
Modeling of semantic segmentation under a complex background using the encoder-decoder network and selection of and .
The encoder network is determined by the backbone network , depth , and decoder . Segmentation time and accuracy PA depend on , , and , which can be expressed as and . Denoting the minimal segmentation time as (the recommended value is 600 ms), the mathematical model of the optimization for semantic segmentation under a complex background based on the encoder-decoder network is as follows:
The parameters of the model to be optimized are , , and .
First, , , and are combined. Then, the object segmentation accuracy , component segmentation accuracy , and are compared to select the relatively better and for the basic network.
The ADE20K dataset, which has diverse annotations of scenes, objects, parts of objects, and parts of parts , is selected. In this paper, we denote parts of objects as component. Using a GeForce GTX 1080Ti GPU and the training method described in , we obtained and for improved FCN , PSPNet , UPerNet , and other major encoder-decoder networks for semantic segmentation used in the ADE20K  object/component segmentation dataset. We evaluated of different network on the ADE20K test set, which consist of 3000 different resolution images with average image size of 1.3 million pixels. Table 1 displays the pixel accuracy and segmentation time of the main network architectures on ADE20K object/component segmentation tasks, where the relatively better indices are indicated by a rectangular contour.
From Table 1, the following observation can be made. ① In all networks, is less than by about 30%; ② and dmain are equal in networks 1, 2, and 3; and are better in compared to or ; ③ and are equal in networks 3 and 4. When is doubled, improves slightly and improves significantly. After a comprehensive consideration, we selected the UPerNet  encoder-decoder network, where , , and .
Figure 7 shows the architecture of semantic segmentation under a complex background implemented by UPerNet . The encoder ResNet reduces the feature map resolution by 1/2 at each stage. The resolution of the output feature maps within five stages is respectively reduced to 1/2, 1/4, 1/8, 1/16, and 1/32. The decoder is PPM + FPN. Through pooling layers with different strides, the feature maps are analyzed in a multiscale manner within PPM. Through three transposed convolution layers, the resolution of the feature maps is increased two times to 1/16, 1/8, and 1/4. The upsampling restores the feature map resolution to 1/1. The component analysis module recognizes the feature map and outputs both the object/component segmentation results.
Figure 8 shows the component analysis module of UPerNet. The module is composed of the object classifier, component classifier, and component analysis module. The input of each classifier is a 1 : 1 feature map. The object classifier implements the semantic recognition of kinds of objects and outputs the object probability vector and the object label . The component classifier implements the semantic recognition of kinds of components and outputs the component probability vector and the component label . According to and the component object set , the component analysis module only segments the that satisfies and outputs the valid component label . UPerNet outputs the object segmentation result (the object label ) and the component segmentation result (the valid component label ).
The component analysis module of UPerNet can be expressed as follows:
A greater of leads to a higher component segmentation efficiency.
Equation (4) outputs that satisfies . By identifying deviations of due to the relationship between and , the optimized component analysis module can improve the efficiency of component segmentation; it both meets the requirement of and improves .
Improvements of UPerNet for semantic segmentation under a complex background based on the component analysis module.
In this subsection, we describe the derivation of the component analysis module, the optimization of the function expression of the module, and the construction of the architecture of the component analysis module.
As shown in Figure 8, the component classifier recognizes component semantics and outputs the component labels of the pixel with image position and the probability vector corresponding to the various component labels. The relationship between and  is as follows:
From equation (4) and (5), we obtainwhere is the probability of . Weighting over to get instead of , reducing the weight of low-probability object labels, and increasing . With , if , letting can increase the detection rate of background pixels. Therefore, the module can be expressed as follows:which is the component analysis module yielded by replacing with and considering .
The optimized architecture of the UPerNet component analysis module is proposed based on equation (7). Figures 9(a)–9(c) show the optimized architecture obtained by replacing with by considering and by both replacing with and considering in the component analysis module, respectively.
3.1. Experimental Results
3.1.1. ADE20K Component Segmentation Task
For the UPerNet model, the backbone network of the encoder was ResNet, , and the decoders are PPM + FPN + component analysis modules (before/after modification). We trained each network on the object/component segmentation task dataset ADE20K  to demonstrate the pixel accuracy and segmentation time . The experiments were run on a GeForce GTX 1080Ti GPU.
Table 2 reports and of the UPerNet obtained with different component analysis modules in ADE20K component segmentation task. From the results, the following observations can be made:(i)The pixel accuracy of ResNet () + PPM + FPN + the proposed modified component analysis modules with different settings increased from 48.30% (without component analysis modules) to 54.03%, 55.13%, and 55.62% while the segmentation time lengthened marginally from 483 to 492, 486, and 496 ms, respectively.
The UPerNet with modified component analysis modules showed significantly high segmentation performance. Both and outperformed the UPerNet with a deeper ; and of the architecture () are 55.62% and 496 ms, while those of the architectures with no modification with and 152 were 48.71% and 598 ms and 48.89% and 721 ms, respectively, as shown in Figure 9(c).
3.1.2. CITYSCAPES Instance-Level Semantic Labeling Task
We trained each UPerNet (with/without component analysis module) on the instance-level semantic labeling task of the CITYSCAPES dataset . To assess the instance-level performance, CITYSCAPES uses the mean average precision AP and average precision . We also report the segmentation time of each network run on a GeForce GTX 1080Ti GPU and an Intel i7-5960X CPU. Table 3 presents the performances of different methods on a CITYSCAPES instance-level semantic labeling task. Table 4 presents the mean average precision AP on class-level of the UPerNet with/without the component analysis module in the CITYSCAPES instance-level semantic labeling task. From the table, it can be seen that the modified component analysis modules effectively improved the performance of the UPerNet. With the component analysis module, both AP and are improved, and the segmentation time increased slightly from 447 to 451 ms. Most of the UPerNet AP on class-level are improved. Figure 10 shows some CITYSCAPES instance-level semantic labeling results obtained with the UPerNet with/without component analysis module.
Taking banknote detection as an example, we set up the semantic segmentation model by the component analysis modules (before/after modification) to vision-based detection of 2019 Chinese Yuan (CNY) feature in the backlight to demonstrate the segmentation performance of the proposed method.
The vision-based detection system consisted of an MV-CA013-10 GC industrial camera, an MVL-HF2528M-6MP lens, and a LED strip light. The field of view was 18.33°, and the resolution was 1280 × 1024. Under the backlight, we collected 25 CNY images of various denomination fronts and backs at random angles. Then, we marked four types of light-transmitting anticounterfeiting features, namely, security lines, pattern watermarks, denomination watermarks, and Yin-Yang denominations. All four features were detected in the CNY images to generate our dataset (200 images). We trained the model with different component analysis modules from our dataset to demonstrate and . Table 3 presents the pixel accuracy and segmentation time of UPerNet with different component analysis modules for CNY anticounterfeit features via vision-based detection, and Figure 11 shows the segmentation results of the anticounterfeiting features detected by UPerNet with/without the component analysis module.
From Table 5, it can be seen that the proposed method improved from 90.38% to 95.29% from 490 to 496 ms. Moreover, increased from 96.1% to 100%, detecting all the light transmission anti-counterfeiting features without false detection, missing detection, or repeated detection.
In this study, we performed semantic segmentation under a complex background using the encoder-decoder network to solve the issue of the mutually exclusive relationship between the semantic response value and the semantics of object/component in the semantic segmentation under a complex background for online machine vision detection. The following conclusions can be drawn from this study.(i)Considering the mutually exclusive relationship between the semantic response value and the semantics of object/component, we selected the mathematical model of semantic segmentation under a complex background based on the encoder-decoder network for optimization. It was found that , is the best encoder, and is the best selected decoder.(ii)We replaced with . The component analysis module of and UPerNet are considered to improve the performance of the encoder-decoder network.(iii)The experimental results show that the component analysis module improves the performance of semantic segmentation under a complex background. Both and of the proposed model were better than those of the UPerNet with deeper . Specifically, the accuracy improved from 48.89% to 55.62% and from 721 to 496 ms. By performing vision-based detection with the 2019 CNY features, we showed that the proposed method improved from 90.38% to 95.29% while increased only slightly from 490 to 496 ms; also increased from 96.1% to 100%, detecting all the light transmission anticounterfeiting features without false detection, missing detection, or repeated detection.
The model in which was replaced with and the corresponding component analysis module improved the performance of the UPerNet encoder-decoder network. However, the efficiency improvement is affected by the accuracy of object segmentation. In our next study, we will investigate the applicability of machine learning to the component analysis module to achieve a higher performance in different applications.
The ADE20K Dataset used to support the findings of this study is available at http://groups.csail.mit.edu/vision/datasets/. The CITYSCAPES Dataset used to support the findings of this study is available at https://www.cityscapes-dataset.com. Its pretrained models and code are released at https://github.com/CSAILVision/semantic-segmentation325 pytorch.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was funded by the Key-Area Research and Development Program of Guangdong Province (Grant no. 2019B010154003) and the Guangzhou Science and Technology Plan Project (Grant no. 201802030006).
C. Szegedy, W. Liu, and Y. Jia, “Going deeper with convolutions,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 1–9, IEEE, Boston, MA, USA, June 2015.View at: Google Scholar
K. He, X. Zhang, and S. Ren, “Deep residual learning for image recognition,” in Proceedings of the Computer vision and pattern recognition, pp. 770–778, IEEE, Las Vegas, NV, USA, June 2016.View at: Google Scholar
L. Geng, Y. Wen, and F. Zhang, “Machine vision detection method for surface defects of automobile stamping parts,” American Scientific Research Journal for Engineering, Technology, and Sciences, vol. 53, no. 1, pp. 128–144, 2019.View at: Google Scholar
S. Liu, J. Huang, and G. Liu, “Technology of multi-category legal currency identification under multi-light conditions based on AleNet,” China Measurement &Test, vol. 45, no. 9, pp. 118–122, 2019, in Chinese.View at: Google Scholar
G. Liu, S. Liu, and J. Wu, “Machine vision object detection algorithm based on deep learning and application in banknote detection,” China Measurement &Test, vol. 45, no. 5, pp. 1–9, 2019, in Chinese.View at: Google Scholar
H. Gao, H. Yuan, and Z. Wang, “Pixel transposed convolutional networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 5, pp. 1218–1227, 2019.View at: Google Scholar
J. Huang and G. Liu, “The development of CNN-based semantic segmentation method,” Laser Journal, vol. 40, no. 5, pp. 10–16, 2019, in Chinese.View at: Google Scholar
S. Nowozin, “Optimal decisions from probabilistic models: the intersection-over-union case,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 548–555, IEEE, Columbus, OH, USA, June 2014.View at: Google Scholar
K. He and J. Sun, “Convolutional neural networks at constrained time cost,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 5353–5360, IEEE, Boston, MA, USA, June 2015.View at: Google Scholar
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 3431–3440, IEEE, Boston, MA, USA, June 2015.View at: Google Scholar
X. Li, Z. Liu, and P. Luo, “Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 6459–6468, IEEE, Honolulu, HI, USA, July 2017.View at: Google Scholar
H. Zhao, J. Shi, and X. Qi, “Pyramid scene parsing network,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 6230–6239, IEEE, Honolulu, HI, USA, July 2017.View at: Google Scholar
D. Kim, J. Kwon, and J. Kim, “Low-complexity online model selection with lyapunov control for reward maximization in stabilized real-time deep learning platforms,” in Proceedings of the Systems, Man and Cybernetics, pp. 4363–4368, Miyazaki, Japan, January 2018.View at: Google Scholar
M. Cordts, O. Mohamed, and S. Ramos, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223, IEEE, Las Vegas, NV, USA, June 2016.View at: Google Scholar