Abstract

Remaining useful life prediction of a milling tool is one of the determinants in making scientific maintenance decision for the CNC machine tool. Predicting the RUL accurately can improve machining efficiency and the quality of product. Deep learning methods have strong learning capability in RUL prediction and are extensively used. Multiscale CNN, a typical deep learning model in RUL prediction, has a large number of parameters because of its parallel convolutional pathways, resulting in high computing cost. Besides, the MSCNN ignores various influences of different scales of degradation features on RUL prediction accuracy. To address the issue, a pyramid CNN (PCNN) is proposed for RUL prediction of the milling tool in this paper. Group convolution is used to replace parallel convolutional pathways to extract multiscale features without additional large number of parameters. And the channel attention with soft assignment is used to select the key degradation features, considering different sensors and scales. The milling tool wear experiments show that the score value of the proposed method achieved 51.248 ± 1.712 and the RMSE achieved 19.051 ± 0.804, confirming better performance of the proposed method compared with the traditional MSCNN and other deep learning methods. Besides, the number of parameters of the proposed method is reduced by 62.6% and 54.8% compared with the MSCNN with self-attention and the MSCNN methods, confirming its lower computing cost.

1. Introduction

As a basic tool of industry, computer numerical control (CNC) machine tool plays an important role in industrial manufacture. With the increasing demand for product quality, stability of machining process becomes more and more important. Tool wear is a common negative effect on machining quality during the high-speed machining process [1]. And it not only affects the quality of machined surface and the machining precision but also results in increasing machining cost. Moreover, unnecessary tool replacement that aims at preventing the decrease in surface quality will increase the downtime and machining cost in high-speed milling [2]. The effects for tool degradation mainly include cutting parameters, work material, and cutting tool. However, the internal law of these effects on tool degradation is hard to determine for their various combinations. Since it could not be directly detected during the process, it is hard to make scientific maintenance decisions without interrupting the machining process. Therefore, a significative work is to accurately predict the remaining useful life (RUL) of the milling tool.

With the widely usage of industrial internet of thing in condition monitoring of machinery, a mass of monitoring data of the CNC machine tool are acquired by various sensors. The explosive growth of monitoring data brings new opportunities to RUL prediction of the milling tool. Compared with model-driven RUL prediction methods, data-driven RUL prediction methods are able to learn degradation characteristics of a tool from massive monitoring data. And it could also build the corresponding RUL prediction models automatically, which means neither deep understanding of system-failure physics nor complete knowledge of the dynamics is required. Therefore, data-driven RUL prediction methods are gaining more and more attention in the field of RUL prediction recently [3].

Traditional data-driven prognostic approaches usually contain three steps: hand-crafted feature extraction, degradation behavior learning, and RUL prediction [4, 5]. Hand-crafted feature extraction is to use signal process methods to extract sensitive degradation features from the monitoring data. Then, these features are fed into machine learning models, such as ridge regression, support vector machine (SVM), and so on, to learn the degradation features and predict the RUL. For example, Park et al. [6] extract time, frequency, and time-frequency domain features, and these features are input into the ridge regression model after dimension reduction using PCA. Zhao et al. [7] extract high-dimensional feature using time-frequency representation (TFR), which are fed into the simple multiple linear regression model to predict the RUL after supervised dimensionality reduction using PCA and LDA. Liu et al. [8] used the integration of empirical mode decomposition (EMD) and Wigner–Ville distribution (WVD) to extract degradation feature from gearbox vibration signal, and then particle filter (PF) with the state space model based on the Wiener process is used to predict the RUL of gearbox considering degradation feature. Even though these methods have a good performance on the RUL prediction, they still need to take much effort on hand-crafted feature design [9, 10]. To avoid this situation, it is desirable to find a new method to automatically extract degradation feature from monitoring data. Therefore, deep learning-based RUL prediction methods have gained more and more attention in the field of data-driven RUL prediction [1120].

Deep learning, structured by a stack of multiple layers of nonlinear processing units [21], can extract high-level feature without human intervention. Thus, deep learning shows a more powerful feature extraction ability, and achieves state-of-the-art accuracy in many tasks, such as image classification, natural language processing (NLP), target detection, and so on. Deep belief network (DBN), auto-encoder network (AEN), recurrent neural network (RNN), and convolution neural network (CNN) are mainstream architectures in deep learning [22]. Wang et al. [23] proposed a deep separable convolution network (DSCN) for RUL prediction of bearing, which extracted the degradation feature from monitoring data using deep separable convolution and predicted the RUL using fully connected layers. Hinchi and Tkiouat [24] used CNNS to extract features from vibration signal, and then employed LSTM to predict the RUL of rolling element bearings. Zhang et al. [25] proposed a multiobjective DBN ensemble method for RUL prediction of turbofan engines. Wang et al. [26] use DCAE and SOM to gain the health index of rolling bear, and then use this health index as a label to train a CNN-based RUL prediction model to predict the RUL. Ding et al. [2729] proposed three meta deep learning methods to predict the RUL of the machine under different conditions and limited and variable-length data. Zhang et al. [30] proposed a deep representation regularization-based transfer learning method for remaining useful life predictions under different machinery operating conditions and no target-domain run-to-failure training data.

Because of the remarkable ability of extracting degradation features from monitoring data, CNN-based RUL prediction methods become a research hotspot, especially the multiscale CNN (MSCNN) [3139]. The architecture of traditional MSCNN with self-attention is shown in Figure 1. Parallel convolutional pathways are used to extract different scales of degradation features, which is developed by different size of convolution kernel for different convolutional pathways. And the self-attention is embedded to avoid the interference caused by the redundant and uncorrelated information of partial sensors, improving the performance of the networks. The usage of parallel learning strategy, however, greatly increases the parameters of the model, leading to higher cost of computing during model training. The self-attention, in addition, can only consider the contribution of different sensors to RUL prediction. In other words, the contribution of different scale of degradation features is not taken into account.

To deal with the mentioned problems, a pyramid CNN (PCNN) is proposed in this paper. The architecture of the proposed PCNN is shown in Figure 2. The monitoring data acquired from different sensors can be directly fed into the proposed network without any preprocessing, which means complex signal processing techniques do not require. This network contains two parts, multiscale feature learning subnetwork and RUL predicting subnetwork. The multiscale feature learning subnetwork is built by stacking one-dimensional (1D) convolution layers and pyramid convolution layers. Low-level features are extracted by the one-dimensional (1D) convolution layers and fed into the pyramid convolution layers. In the pyramid convolution layers, group convolution is used to extract multiscale high-level degradation features. Then, the channel attention model is used to generate attention weight for each channel. A soft assignment is used to recalibrate the attention weight of different scales so that the key degradation features can be selected from not only different sensors but also from different scales. The RUL predicting subnetwork contains global pooling and fully connected layers (FCLs). The mapping relationship between degradation features and the RUL is established in these parts. The tool wear experiment is used to verify the proposed method. Compared with the traditional MSCNN, the proposed method has higher accuracy of RUL prediction and smaller number of parameters.

The rest of this article is structured as follows. The basic theory of the proposed method is expounded in detail in Section 2. Experiment and comparison analyses are illustrated in Section 3. Conclusions are composed in Section 4.

2. Proposed PCNN for RUL Prediction of Milling Tool

2.1. One-Dimensional (1D) Convolution Layer and Shortcut Connection

On-dimensional convolution is used to extract degradation feature from raw data in this paper. The 1-D convolutional operation can be described aswhere is the raw data, is the output of the process, is the learnable convolutional kernel, is the bias tern, represent the convolutional operation, and is the nonlinear activation function. In this paper, the rectified linear unit (ReLU) is used as the nonlinear activation function of the 1-D convolution operation. By repeating this process twice, low-level degradation features, denoted as , can be obtained.

Gradient vanishing/exploding and weight matrix degradation is a considerable problem of deep learning. To address this issue, shortcut connection is introduced in this network.

The raw data acquired from the sensor is fed into the shortcut connection pathway, which contains a convolution layer and a max pooling layer. The size of the convolutional kernel in the shortcut connection is , which aims to increase the dimension of . The max pooling layer is used to downsample the output of the convolution layer. The output of the shortcut connection model, denoted as , is given bywhere is the output of the pyramid convolution layer, is the pooling function, is the convolution kernel with the size of , and is the convolution operation.

2.2. Pyramid Convolution Layer

In this layer, multiscale high-level degradation information from different sensors is extracted and fused. First, a group convolution operation is used to extract different scale of high-level degradation features. After doing this, the channel attention model is used to generate the attention weights of the multiscale features. Finally, the soft assignment is used to recalibrate the attention weight of the corresponding scale.

2.2.1. Group Convolution

The monitoring data acquired from the sensors are nonlinear signals containing a lot of noise. While the degradation features can be extracted by convolution operation, the receptive field range of the convolution kernels have great influence on the degradation features. Large-scale degradation features can be extracted by a larger receptive field, while detailed degradation features can be extracted by a smaller receptive field. Therefore, it is necessary to use different size of convolution kernels to extract multiscale degradation features. The traditional multiscale convolution uses parallel pathways to extract multiscale features. The size of convolution kernel in various convolution pathways is different. Although the performance of the network is proved, a large number of parameters increases the computing cost. Therefore, it is desirable to find an efficient multiscale feature extraction method.

In this paper, group convolution is used to replace parallel convolutional pathways so that multiscale features can be extracted without additional large number of parameters. The architecture of this model is shown in Figure 3.

The input low-level feature is splitted into groups along with the channel direction, denoted as , with , where is the number of channel and is the length of . A set of learnable kernels is used to convolve . The output of the convolution, denoted as , can be obtained bywhere is the number of learnable kernels and the number of input channels, denotes the convolution operator, is the convolution kernel of the group, and is the bias term. Different convolution kernels have different sizes, which can extract different scales of degradation features. Finally, the whole multiscale feature can be obtained by the concatenation of all the .

2.2.2. Channel Attention Model and Soft Assignment

The data from different sensors contain different degrees of degradation information. In other words, some important degradation information only exists in partial sensors. Furthermore, different scales of features also contain different degrees of degradation information. Therefore, it is important to select key degradation information from the multiscale feature . In this paper, a channel attention model is used to obtain the attention weight from the input feature . Then, the soft assignment is used to recalibrate the attention weight of the corresponding scale. The structure of this model is shown in Figure 4.

Attention weights of the features of different scales can be obtained by using parallel processing pathways. Each processing pathway includes global information encoding and channel-wise relationship information recalibrating. The global information encoding is done by global average pooling and global max pooling, and the channel-wise relationship information recalibrating is done by fully connected networks with one hidden layer.

The global average pooling (GAP) and the global max pooling (GMP) can aggregate the global information of each channel, generating two vectors: and . Both and contain channel-wise statistics. The channel-wise statistics of the -th channel and is obtained by

Then, and are fed into the fully connected network (FCN) with one hidden layer. The neuron number of the hidden layer in the FCN is , where is the ratio of dimensionality reduction. After that, the attention weight of , denoted as , can be calculated bywhere , , , and are the weight matrices in the FCNs, denotes the element-wise summation, and is the sigmoid activation function.

By doing this, the network can fuse degradation information from different sensors and produce a better attention for high-level degradation feature. Furthermore, in order to enhance the key degradation features of some scales and suppress the irrelevant ones without destroying the original channel attention vector, a soft assignment is used to adaptively recalibrate the attention weight of the corresponding scale. After doing this, the key degradation features are selected not only from different sensors but also from different scales. The soft assignment is given by

Then, the multiscale high-level degradation feature with multiscale channel-wise attention weight, denoted as , can be obtained bywhere is the channel-wise multiplication.

Finally, the output of the pyramid convolution layer, denoted as , can be obtained by the concatenation of all the .

3. Experimental Verification

3.1. Data Description

As shown in Figure 5, the life testing of the milling tool is conducted in a computer numerical control (CNC) milling machine.

The material of the workpiece is 316L stainless steel, and the milling tool is cemented carbide insert deposited by TiAlN coating. During the milling process, the table feeds the workpiece from front to back along the Y-axis. As tabulated in Table 1, a total of 4 milling tool are tested and all tests are carried out without the application of a cutting fluid. As shown in Figure 6, two types of sensors are installed in the milling machine, including accelerometer (Kistler Z292A600) and rotary dynamometer (Pro-Micro). For the accelerometer, the sampling frequency is set as 10 kHz. For the rotary dynamometer, the sampling frequency is set as 2.5 kHz.

As shown in Figure 6, a metallographic microscope is used to measure the width of the flank wear. When the width of the flank wear is greater than 0.2 mm, the tested tool wear achieves the limit [1]. The acquired monitoring data of the C1 during the whole operating life is shown in Figure 7.

As shown in Figure 7, some of these monitoring data have obvious degradation trends with the increasing of cutting time, while others do not have these trends.

3.2. Experimental Study

In this case, all of the monitoring data are used as the input of the network to verify the effectiveness of the proposed method. The size of an input sample is .

One of the main hyperparameters that may affect the prediction performance of the proposed model is the number of groups, which directly affects the dimension of feature extract in the pyramid convolution layer. For investigating this influence, different number of groups in the proposed PCNN are applied to estimate the RUL prediction. The number of groups is set to be 2, 4, and 8. Figure 8 shows the score values and RMSE of C4, and the corresponding training time and model parameters are given in Table 2.

It can be observed that the score value is the lowest and the RMSE is the highest when the number of group is set to be 2, which indicates that the prediction performance is relatively poor. The accuracy of the RUL prediction results is closer for others. As the number of groups increased, the model becomes more computationally intensive. Therefore, it can be observed in Table 2 that the model training time and the number of parameters increased with the increase in the number of groups. Though a bigger number of groups can extract more features of different scales, resulting in better prediction performance, the calculation burden is aggravated and the performance improvement is limited when the number of group increases to a certain extent. By the trade-off between accuracy and efficiency, the number of groups is finally selected as 4.

The final architecture of the network is shown in Figure 9. And the hyperparameters of the pyramid convolution layer of the PCNN are listed in Table 3.

Mean square error is used as the loss function of the network and Adam optimizer with a mini-batch size of 128 is used to update its weights and biases. The trained network is used to predict the RUL values of the testing dataset after training 150 epochs. If the prediction value was bigger than the actual value, it may cause low process quality or even a scrapped products due to a overwear in the tool. Taking this situation into account, except for root mean square error (RMSE), a score function is used to evaluate the performance of the network. The score value is given bywhere is the number of samples in the testing dataset, is the actual value, and is the predicted value. The higher the score values, the more accurate the performance of the RUL prediction is.

Figure 10 shows the RUL prediction result of C4 using the proposed method. As shown in Figure 10, the predicted RUL value fluctuates slightly with the actual RUL, and the fluctuation becomes smaller and smaller with the increase of the cutting time. Furthermore, cross validation is used to prove the stability of the proposed method. Each test is repeated ten times, and the mean and standard deviation of these four testing dataset are listed in Table 4.

As shown in Table 4, on the one hand, both score and RMSE of each testing dataset has small standard deviation, which proves that the proposed model has good stability for the same task. On the other hand, the mean value of both score and RMSE of these four testing dataset has small fluctuation, which proves that the proposed network has good stability for different tasks. In conclusion, the proposed network has a good prediction result and good stability in both the same task and the different task, which means the predicted result of the proposed method is credible.

3.3. Comparison Analysis
3.3.1. Ablation Experiments

In order to illustrate the advantage of the proposed PCNN, ablation experiments are done in this part. The other three prognostic networks are employed to predict the RUL and they are denoted as Network-1, Network-2, and Network-3. The architectures of these three networks are similar to that of the PCNN, and the differences are that (1) Network-1 does not use group convolution and channel attention with soft assignment, (2) Network-2 only use group convolution, and (3) Network-3 only use channel attention with soft assignment. In addition, the hyperparameters settings of these three networks are the same as those of the PCNN, and the cross validation used in Section 3.2 is used in this part too. The performance estimation results of these four different networks are listed in Table 5 and drawn in Figure 11.

It can be observed that compared with the classic multiscale convolutional network without attention mechanisms (i.e., Network-1 [37]), the use of group convolution or channel attention with soft assignment effectively improves the prediction performance and stability of the network, resulting in higher score value and lower RMSE. For Network-2, the performance improvement is attributed to the use of group convolution, which reduces the risk of overfitting by reducing the number of learning parameters. For Network-3, the employment of channel attention with soft assignment make the network enhance key degradation features of some sensors and scales. Besides, it is to be noted that through systematically integrating group convolution and soft attention with soft assignment, the proposed PCNN obtains the highest score value and the lowest RMSE value for each testing dataset among four different prognostic networks, which verifies again the performance of the proposed method.

3.3.2. Comparison with the State-of-the-Art Models

In this part, eight state-of-the-art models, including two machine learning models, random forests (RF), and support vector regression (SVR) [34] and six deep learning model, deep convolution neural network (DCNN) [35], residual dense network (RDN) [36], multiscale convolutional neural network (MSCNN) [37], convolutional long-short-term memory network (CLSTM) [24], deep belief networks (DBN) [38], and multiscale convolutional attention network (MSCAN) [39] are utilized to estimate the RUL for the comparison analysis. For the RF and SVR, features listed in [34] are extracted from all the monitoring data. Then, these features are fed into the corresponding model to predict the RUL. The score value and RMSE of these methods are listed in Table 6. Both score value and RMSE are calculated form the half of the life too.

From Table 6, it can be found that the proposed method has the highest score value and the lowest RMSE, which confirms the proposed method can predict the RUL accurately. This performance enhancement demonstrates again the advantage of the PCNN.

Besides, in order to illustrate the efficiency of the PCNN, the number of parameters and the training and testing time of three multiscale learning models are listed in Table 7. All experiments in this paper are performed on a server configured with two Intel (R) Xeon (R) Gold 6242R [email protected] GHz processors, eight NVIDIA GeForce RTX 3090 graphics cards, and a total of 512 GB memory (RAM).

As shown in Table 7, the total model parameters of the proposed method are respectively reduced by 62.6% and 54.8% compared to the MSCNN with self-attention and the MSCNN methods. Both training time and testing time of the proposed method are greatly reduced, which means the computing cost is reduced and the efficiency is improved.

4. Conclusion

Because of the strong learning capability, the CNN is widely used in degradation feature extraction, especially the multiscale CNN which has a stronger representing learning ability. Because of the parallel convolutional pathways, the traditional MSCNN, however, has a large number of parameters, which means a higher computing cost. In addition, a lack of consideration of contribution of different scale of degradation feature makes poor performance of RUL prediction. To address the issue, a pyramid CNN (PCNN) is proposed for RUL prediction of the milling tool is proposed in this paper. In this network, group convolution is used to replace parallel convolutional pathways to extract multiscale features without additional large number of parameters. The channel attention with soft assignment selects the key degradation features not only from different sensors but also from different scales. The proposed method was experimentally validated by the milling tool wear experiment. Some related methods and state-of-the-art models, including machine learning methods and deep learning methods, are analyzed for comparison with the proposed method. The result of it indicates that the proposed method is able to predict the RUL accurately.

Although the proposed method achieves a good RUL prediction result, there are still a few shortcomings in its application. The premise of the application of the proposed method is that the working condition of the testing data is the same as training data, which limits the application in practical engineering because the working condition of the machining process is dynamic. And limited labeled training samples prevents us from training a model for every working condition. To address the issue, a promising work is to introduce transfer learning or meta learning into the model, which can make the model achieve good performance under small samples. Furthermore, this can be combined with some adaptive optimization algorithms to automatically determine the hyperparameters of the model, which can achieve better performance of it.

Data Availability

The test data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Key Research and Development Projects in Guangdong Province, grant number 2021B0101220005; Basic Research Program of Guangzhou, grant number 202102080349; industrial internet platform innovation development project of MIIT—New Potential Failure Mode and Effects Analysis System Based on Industrial Internet Platform, grant number TC210804F.