Abstract

Slagging-off (i.e., slag removal) is an important preprocessing operation of steel-making to improve the purity of iron. Current manual-operated slag removal schemes are inefficient and labor-intensive. Automatic slagging-off is desirable but challenging as the reliable recognition of iron and slag is difficult. This work focuses on realizing an efficient and accurate recognition algorithm of iron and slag, which is conducive to realize automatic slagging-off operation. Motivated by the recent success of deep learning techniques in smart manufacturing, we introduce deep learning methods to this field for the first time. The monotonous gray value of industry images, poor image quality, and nonrigid feature of iron and slag challenge the existing fully convolutional networks (FCNs). To this end, we propose a novel spatial and feature graph convolutional network (SFGCN) module. SFGCN module can be easily inserted in FCNs to improve the reasoning ability of global contextual information, which is helpful to enhance the segmentation accuracy of small objects and isolated areas. To verify the validity of the SFGCN module, we create an industrial dataset and conduct extensive experiments. Finally, the results show that our SFGCN module brings a consistent performance boost for a wide range of FCNs. Moreover, by adopting a lightweight network as backbone, our method achieves real-time iron and slag segmentation. In the future work, we will dedicate our efforts to the weakly supervised learning for quick annotation of big data stream to improve the generalization ability of current models.

1. Introduction

Slagging-off is an essential operation in steel-making. It is used to remove high sulfur slag from molten iron to improve the purity of iron. The process of slagging-off is shown in Figure 1(a) and the actual image obtained by video capture is shown in Figure 1(b). In this process, molten iron is inevitably brought out, and the loss of molten iron is directly proportional to the clean rate of slagging-off. Meanwhile, slagging-off operation will be accompanied by the decrease of molten iron temperature. Therefore, accuracy and efficiency are two key factors of slagging-off operation, which are directly related to production energy consumption. At present, manual operation of machinery for slag removal is a commonly employed scheme in industrial applications. However, affected by the long-term strong light and dense smoke condition, it can easily lead to misidentification and misoperation. Besides, manual operation is inefficient. With the introduction of Industry 4.0 paradigm, the trend is moving towards to intelligent production line, where automatic slagging-off will benefit the modern smart manufacturing greatly.

Recognition of iron and slag is the premise of automatic slagging-off operation. We formulate this problem as a semantic segmentation task, which is a fundamental problem of computer vision and aims to assign categories for each pixel in an image. Many classical machine vision methods have been proposed for image segmentation. However, the monotonous gray value of industry images and poor image quality caused by strong light and dense smoke condition challenge the performance of traditional computer vision algorithms. As far as the task of iron and slag segmentation is concerned, results of some traditional algorithms including K-means, Markov random field, and mean shift are shown in Figure 2, which are obviously cannot meet the requirement of industrial application. Currently, the state-of-the-art methods for segmentation mainly based on fully convolutional networks (FCNs) are used [1]. However, only modeling local correlation with convolutional operations, FCNs are not effective to reason relation between distant regions with arbitrary shape without stacking multiple convolution layers. To tackle this problem, many algorithms have been proposed to expand the receptive field of FCNs to capture long-range contextual information in the scene. Dilated convolution has been implemented to capture large objects, thus introducing another problem that small objects may be ignored. Another research direction is fusing multiscale features [2], which is inefficient. Recently, self-attention mechanism-based methods [3] make use of affinity matrix to model the relation between each spatial position and its neighborhoods. However, the memory and computational requirements of large affinity matrix prevent the application of these methods for high-resolution image segmentation application, such as the iron and slag segmentation with a resolution of .

The monotonous gray value of industry images, poor image quality, and irregular and scattered shape of slag also challenge the existing FCNs. Graph convolution is an efficient and effective operation to model global contextual information over regions in a single layer, which has been widely employed in recent scene understanding works [4, 5]. Motivated by these works, we propose an effective and efficient spatial and feature graph convolutional network (SFGCN) module based on graph convolution. Different from previous works, our SFGCN module makes use of latent interaction space to efficiently perform global reasoning function. Our SFGCN module consists of two parallel branches to project feature maps to latent spatial space and feature space, respectively. Then, graph convolutions are employed to perform relation reasoning. After graph reasoning, the updated information is reprojected back into the original coordinate space for further information extraction. Extensive experiments prove that our SFGCN module can consistently improve the performance of current mainstream convolutional neural network backbones for iron and slag segmentation.

Our contributions can be summarized as follows:(1)We formulate the slagging-off problem as an image-based semantic segmentation task and explore deep learning methods to tackle the automatic iron and slag recognition task for the first time.(2)Considering the limitation of convolution operations for modeling local correlation, we propose a SFGCN module to effectively reason global information interaction via weighted spatial graph convolution and feature graph convolution branches. The proposed network is termed as SFGCNet.(3)We establish an industrial slagging-off dataset and conduct extensive experiments, and the results show that our SFGCN module brings consistent performance improvement for a wide range of network backbones for iron and slag segmentation. Moreover, taking a lightweight network as backbone, our method is able to achieve real-time segmentation of iron and slag.

Fully convolutional networks (FCNs) have made great progress in semantic segmentation [1,6]. There are many variants to improve the performance of segmentation; we briefly review several main research directions in scene understanding domain, including network architecture implementation, global context reasoning, and graph-based reasoning.

2.1. Network Architecture Implementation

Atrous Spatial Pyramid Pooling (ASPP) has been proposed and employed in Deeplabv2, v3 [7] to integrate multiscale contextual information, which contains multiple parallel dilated convolutions with different dilated rates. A variety of encoder-decoder structures have been implemented to obtain effective usage of midlevel and high-level extracted features [8, 9]. PSPNet [2] builds a novel pyramid pooling module to get multiscale contextual prior knowledge. DenseASPP [10] embeds multiscale features to expand the receptive field of convolution layers for segmentation task. All these methods effectively stack multiple convolution layers to collect multiscale information.

2.2. Global Context Reasoning

Many methods have been proposed to overcome the limitation that convolution layers are difficult to capture global context, such as self-attention mechanism and nonlocal networks. Self-attention mechanism is firstly proposed in [11] to model long-range dependencies for machine translation task and has been widely applied in many tasks in recent years [12]. PSANet [13] captures pixel-wise relation by applying attention module in spatial dimension. EncNet [14] and DFN [15] apply attention module along the channel dimension of the feature map to account for global context. DANet [16] uses attention module in both spatial and channel dimensions. Nonlocal networks [3, 17] aim to deliver long-range information from one position to another.

2.3. Graph-Based Reasoning

Graph-based reasoning provides an efficient idea of global context reasoning. Random walk and conditional random field (CRF) networks have been proposed based on graph for efficient image segmentation and classification. Recently, graph convolutional networks (GCNs) have been proposed for semisupervised image classification. Wang et al. [18] apply GCN to capture global contextual relation in video recognition task. Chen et al. [4] explore GCN to reason global relation in semantic segmentation task. Yan et al. introduce GCN to describe skeleton connections for action recognition [19, 20]. Following these methods, we propose a novel dual GCN module consists of spatial graph convolution and feature graph convolution to model global contextual information for iron slag segmentation. Our SFGCN module makes use of latent spatial and feature spaces to efficiently realize global relation reasoning, which alleviates the memory and computation burden of global context reasoning while improving the performance of segmentation.

3. Methods

In this section, we first review the graph convolution and then introduce the implementation of our SFGCN module. Finally, we detail the network architecture for slag segmentation.

3.1. Graph Convolution

Graph convolution is an efficient operation to reason global context information, which overcomes the limitation that convolution operation can only model local context information. Graph convolution defined in graph with nodes and edges can effectively achieve global information interaction in a single operation. The specific operation can be defined as follows:

The specific implementation steps are shown as the following pseudocode, including (1) project the feature map from coordinate space to graph space, we employ the conventional convolution operations to project the feature map to graph space after the feature extraction operation, and the process is shown in Figure 3; (2) build adjacency matrix to describe intrabody connections of nodes within the graph; (3) update the weight matrix; and (4) reproject the graph to coordinate space. The feature map extracted by backbone networks contains spatial and channel dimensions. Assuming that spatial dimension abstracts the objects in the scene and channel dimension encapsulates the detailed object features, that means the graph established in spatial space is able to describe the relevance between objects in the scene and the graph established in feature space is able to express the relevance between object parts. Therefore, we conducted graph convolution on spatial graph and feature graph, respectively. The spatial branch is used to grasp thecinternal integrity of objects and the relationship between objects. The feature branch is used to characterize the details of objects and the relationship between features (Algorithm 1) [21].

Input: extracted by convolutional network
Output: after graph convolution operation
 1: functionSFGCN()
 2:  Project coordinate input to graph space
 3:  
 4:  
 5:  Build adjacency matrix
 6:  Update weight matrix
 7:  
 8:  Reproject graph O to coordinate space
 9:  return
 10: end function
3.2. Graph Convolution in Spatial Space
3.2.1. Spatial Space Projection

As shown in Figure 4, before conducting graph convolution operation, we first project the input feature map to latent spatial space to get the graph. In practice, spatial downsampling operation is employed to transform the input feature to graph in the latent spatial space , where represents the downsampling rate. We achieve based on stacked depth-wise convolution operations in each layer with a stride of 2 and kernel size of . Then, is obtained via

3.2.2. Spatial Graph Convolution

After projecting the input feature to graph , the graph consists of nodes. Each node of the graph integrates the information of a cluster of pixels in the feature map. To measure the correlation between nodes, we form an adjacency matrix . The spatial graph convolution is implemented according to the following formulation to achieve global relation reasoning:where gives the adjacency matrix and represents the softmax activation function, in which is the dot-production operation. is the weight matrix for updating information.

3.2.3. Reprojection

After relational reasoning, we reproject back to the original coordinate space for compatibility with later operations. Different from the downsampling operation in graph projection, we directly employ nearest neighbour interpolation to upsample to the original input size. Finally, the output feature map of spatial graph convolution branch is obtained according to .

3.3. Graph Convolution in Feature Space

Spatial graph convolution models the spatial correlation of pixel clusters in a scene, which enables the network to make correlation prediction based on all objects in the whole scene. Next, we consider projecting input feature map to feature space and reasoning correlation along the channel dimension. Assuming that the latter layers of the FCN are responsive to the object parts and high-level semantic features, conducting GCN in feature space can model the correlation of abstract features such as object parts. We first adopt a channel downsampling operation to reduce the channels of input feature from to and employ a linear combination function to aggregate information along the channel dimension. Finally, we obtain the formulation of input feature to feature space graph :where represents nodes and denotes the states of each node. After feature space projection, the feature graph convolution and reprojection are conducted according to the following equations:

Considering the low dimension of feature graph, we employ two 1D convolution layers as adjacent matrix and trainable edge weights . To alleviate the optimization difficulty, the adjacent matrix is updated with a residual structure and reconstructed as . Both and are randomly initialized and optimized with gradient descent during the training process.

3.4. SFGCNet

Finally, the output of SFGCN is computed as , where “+” denotes point-wise summation and is the learnable weight coefficient. Now we can easily embed our SFGCN module into the existing network backbone (e.g., FCN and ResNet).

3.4.1. Implementation of SFGCNet

As shown in Figure 5, we embed SFGCN module in the last stage of fully convolutional networks (FCNs) to achieve the segmentation of iron and slag. In order to verify the effectiveness of SFGCN module, we construct SFGCNet by adopting FCN [1], BiSeNet [22], ICNet [23], and ResNet-50 [24] as the network backbones, respectively. BiSeNet and ICNet are two lightweight networks to achieve real-time semantic segmentation.

4. Experiments

4.1. Dataset and Evaluation Metrics

As there has no public slagging-off dataset, we collect 7 videos from different industrial cameras. Due to the time-consuming and laborious segmentation labeling, we only select 24 clips from all 7 videos randomly. Each of the clips contains 64 frames. All of these clips are segmented with Photoshop software manually, by three raters, following the same annotation protocol, and their annotations are approved by experienced workers, and then, we split these images into training set and test set with a ratio of . The annotation sample is presented in Figure 6. The training set is used to train models, and the test set is used to validate the performance of trained models.

The efficiency and accuracy of the model are mainly considered in industrial applications. The efficiency of the model can be evaluated by inference time, model parameters amount, and the total number of floating-point operations per second (FLOPs). To evaluate the accuracy of the model, we adopt the commonly used metrics in the segmentation task, including Mean Intersection over Union (MIoU) and pixel accuracy (PA). The two metrics are defined as follows:where represents the pixel predicted correctly (i.e., the true category of the pixel is class , and the prediction is class too). mean the pixel prediction is wrong (i.e., the true category of the pixel is class , and the prediction is class .

4.2. Preprocessing

The annotation of semantic segmentation is time-consuming and labor-intensive. Also, it is difficult to obtain a large number of labeled data in industrial applications. Thus, data augmentation is an effective method to expand the dataset, which is helpful for alleviating the overfitting problem and enhancing the robustness of the network. Considering that images acquired by the video camera contain a large number of background areas, which cannot benefit the accuracy, we firstly crop the raw image from to to reduce the proportion of background area. After that, we randomly apply the data augmentation methods with probability, including the following:(1)Random horizontal and vertical flips(2)Random scaling between (3)Random intensity shift between

4.3. Experiments and Results
4.3.1. Experiment Setup

We implement our method with PyTorch. Cosine annealing learning rate policy is used with 30 warming-up epochs. The initial learning rate is set to 0.001 and adjusted based on the following formulation:

Specifically, represents the current training epoch, is the total number of training epochs, and denotes the warming-up epochs. We train our model with Adam optimizer and synchronized BN on four parallel Nvidia 2080Ti GPUs for 300 epochs. The batch size is set to 8 to guarantee the performance of batch normalization.

4.3.2. Experiment Results

We apply our SFGCN module to the last stage of typical backbones such as FCN, ResNet-50, BiSeNet, and ICNet to reason long-distance dependencies. Considering the distribution difference between industrial dataset and natural scene dataset, we train all the above backbones from scratch. As shown in Table 1, our SFGCN module widely improves the performance of different backbones. In terms of MIoU, SFGCN module brings 5.63%, 0.83%, 3.49%, and 3.87% improvements on BiSeNet, ICNet, FCN, and ResNet-50, respectively. Benefited from the global reasoning function of graph convolution, SFGCN module makes the isolated slag and molten iron region more easily to be identified. As shown in Figure 7, while the dispersed areas of slag and iron are easy to be segmented incorrectly, SFGCN alleviates the influence of neighbor regions on the classification of these regions. On the other hand, the introduction of SFGCN module only results in 2.81 ms, 1.17 ms, 0.79 ms, and 0.55 ms more inference time for BiSeNet, ICNet, FCN, and ResNet-50, respectively, as well as slight parameters and FLOPs increase, which demonstrates that our SFGCN module is efficient. Especially, taking lightweight BiSeNet as the backbone, our SFGCNet achieves real-time segmentation of iron and slag.

4.3.3. Ablation Studies and Discussion

Embedded location of SFGCN module: our SFGCN module can be flexibly embedded in any stage of the network backbone, and it is worth exploring where the embedding can achieve better results. Moreover, the embedding location will affect the accuracy and efficiency of the network at the same time. The feature map of shallow layers has high resolution, which directly increases the parameters and FLOPs of the SFGCN module. From the perspective of feature extraction, shallow layers cannot capture abundant semantic information due to the lacking of enough receptive fields, which will also limit the performance of SFGCN module. Experiments show that the SFGCN module achieves higher efficiency when it is embedded in the last stage of various backbones.

The effectiveness of each branch: to verify the effectiveness of SGCN branch and FGCN branch, we conduct experiments on BiSeNet and FCN with different settings in Table 2.

As shown in Table 2, both SGCN and FGCN boost the performance of BiSeNet and FCN. The introduction of SGCN and FGCN, respectively, yields 3.82% and 4.46% improvement in MIoU for baseline of BiSeNet. Meanwhile, SGCN and FGCN outperform the FCN baseline by 2.15% and 2.81%. After integrating SGCN and FGCN branches, our method achieves 5.63% and 3.49% performance boost for BiSeNet and FCN. Results show that SFGCN module brings benefits for the segmentation of iron and slag.

The effects of SGCN and FGCN branches are visualized in Figure 8. As shown in the third column, SGCN aggregates information of pixel cluster and delivers messages between nodes, thus guaranteeing the integrity of objects. However, spatial branch loses details of each node while aggregating node information. The FGCN branch focuses more on reasoning the details of objects to make up for the deficiency of the SGCN branch which focuses more on connection between objects. The refinement of segmentation is significantly improved as shown in the fourth column.

We compute the coefficients of SGCN and FGCN branches to objectively evaluate the contribution of these two branches. The shortcut connection weight is set to 1. and are initialized as 1 and learnable. The final coefficients of each branch of the SFGCN module in different backbones are shown in Table 3, and the results show that SGCN and FGCN branches do provide extra information for the segmentation. Moreover, the coefficients vary for different network backbones. Therefore, the learnable coefficients provide the flexibility of adjusting the contribution of each branch based on the information learned by the base network.

Effect of projection: as described in Section 3, we aggregate information along spatial and channel dimensions to project the input feature map to the graph space. The downsampling ratio of SGCN branch directly determines the degree of spatial information aggregation. Large ratio loses details while small ratio retains useless information. The number of nodes in the FGCN branch also affects the relation reasoning between the features of objects. Appropriate number of nodes is important for recovering the details of each object. After conducting extensive experiments on our dataset, we observe that SFGCN module brings more performance improvement when the size of is of the input image size and the number of nodes for is 32. We speculate that downsampling to aggregate information is more suitable for the scale of objects and 32 nodes can better express the details of objects in the slagging-off scene.

5. Conclusion

In this work, we explore deep learning methods for iron and slag recognition. We formulate this problem as a semantic segmentation task and propose a SFGCN module to reason global contextual information according to the characteristic of the slagging-off task. Extensive experiments have verified that our method not only triumphs over traditional segmentation methods but also widely improves the performance of current mainstream deep learning models in the slagging-off task. Taking lightweight network as backbone, our SFGCNet can realize real-time and accurate recognition of iron and slag, which provides a significant reference for downstream automatic slagging-off operation.

Although our algorithm has achieved satisfactory results in view of accuracy and efficiency, we need to expand the dataset to improve the performance of the model in more scenarios. It is difficult to label industrial big data manually, in the future work, we will dedicate our efforts to the weakly supervised learning for quick annotation of big data stream to improve the generalization ability of current models.

Data Availability

The raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by State Key Laboratory of Advanced Special Steel, Shanghai Key Laboratory of Advanced Ferrometallurgy and the Science and Technology Commission of Shanghai Municipality (No. 19DZ2270200), the Fundamental Research Funds for the China Central Universities of USTB (FR-DF-19-002), and Scientific and Technological Innovation Foundation of Shunde Graduate School, USTB (BK20BE014).