Abstract

Remaining useful life (RUL) prediction is necessary for guaranteeing machinery’s safe operation. Among deep learning architectures, convolutional neural network (CNN) has shown achievements in RUL prediction because of its strong ability in representation learning. Features from different receptive fields extracted by different sizes of convolution kernels can provide complete information for prognosis. The single size convolution kernel in traditional CNN is difficult to learn comprehensive information from complex signals. Besides, the ability to learn local and global features synchronously is limited to conventional CNN. Thus, a multiscale convolutional neural network (MS-CNN) is introduced to overcome these aforementioned problems. Convolution filters with different dilation rates are integrated to form a dilated convolution block, which can learn features in different receptive fields. Then, several stacked integrated dilated convolution blocks in different depths are concatenated to extract local and global features. The effectiveness of the proposed method is verified by a bearing dataset prepared from the PRONOSTIA platform. The results turn out that the proposed MS-CNN has higher prediction accuracy than many other deep learning-based RUL methods.

1. Introduction

Prognostics and health management (PHM) are crucial for the mechanical system. RUL prediction is one of the important tasks in modern industry PHM. Maintenance costs can be reduced if the remaining useful life of the machinery can be known in advance. Bearings are the critical parts of the mechanical system [1, 2]. The failure of bearings may lead to a severe accident. Thus, the bearings RUL prediction has drawn more and more attention in the study of PHM.

The bearing RUL prediction methods can be roughly divided into two types: model-based approaches and data-driven approaches [3]. With the development of modern industrial technology, an enormous amount of condition monitoring data are recorded, data-driven methods such as deep learning have the powerful data-processing ability when faced with massive data [4]. Since DL-based approaches can extract features from the input data without much prior knowledge, they have become more and more popular in RUL prediction and fault diagnosis [5]. A prediction framework constituted by deep autoencoders (AE) is proposed in Reference [6]. AE is used to retain sufficient information when compressing features. Shen et al. [7] proposed a contractive autoencoder-based rotating machinery fault diagnosis method. Robust features can be learned by the contractive autoencoder automatically. A deep long short-term memory (DLSTM) network was proposed in Ref. [8]. Multisensor condition monitoring data are fused to get more useful information for accurate RUL prediction. Long short-term memory (LSTM) is also applied in discovering the potential patterns in Ref. [9]. Xiang et al. [10] proposed a novel LSTM framework. Attention-guided ordered neurons are applied in this framework to achieve the accurate gear remaining useful life prediction. A double CNN model architecture was implemented in Ref. [11]. The fault occurrence time (FOT) is determined by the first CNN; RUL prediction is accomplished on the second CNN. Guo et al. [12] proposed a health indicator construction method to monitor the health state of machinery. Convolutional neural network is designed to learn features and construct a mapping from HI between features. A recurrent convolutional neural network was designed in Ref. [13]. The temporal dependencies of different degradation states can be captured by the recurrent convolutional layers.

Among the variety of DL techniques, CNN has gained more attention because of two outstanding characteristics, i.e., spatially shared weights and local perception. At first, CNN was widely used in image recognition and achieved tremendous success. Nowadays, it is also popular in fault diagnosis [14] because it can accomplish feature extraction and fault classification automatically. However, there are still two shortcomings in traditional CNN. (1) The scales of the convolution kernel are very important for the performance of CNN. Kernels with a bigger size can extract features in a bigger receptive field, whereas kernels with a smaller size extract feature in a smaller receptive field. The performance of the network is directly affected by the scales of the convolution kernel. Single size of the convolution kernels can lead to the extracted information incompletely. (2) The ability of traditional CNN is not strong enough to extract local and global features simultaneously. In the conventional CNN, just the feature maps of the final layer before flatten are treated as final features while features extracted in previous low-level layers are omitted. Although the global features extracted by the high-level layers are more invariant than features extracted by the low-level layers, detailed local features extracted by the low-levels are also contribute to prognosis and classification [15, 16].

In order to overcome the aforementioned problems and learn more representative features, various multiscale CNN models have been proposed and applied in machinery fault diagnosis and RUL prediction [17]. The final convolutional layer and the final pooling layer are merged to construct a multiscale layer in Ref. [18]. Global features from the final pooling layer and local features from the final convolutional layer are utilized for classification. The results show that the network with the proposed multiscale layer can improve the recognition accuracy. Chang et al. [19] were inspired by the inception model and proposed a concurrent convolution neural network method to enhance wind turbine fault diagnosis accuracy. Also, some multiscale CNN methods have also been applied in RUL prediction. A multiscale deep convolutional neural network (MS-DCNN) with three MS-blocks was applied in Ref. [20]. In order to extract features in different receptive fields, three distinct sizes of convolution kernels are implemented in each layer parallelly. Kernels with a bigger size can extract features in a bigger receptive field, but it also leads to more weights of convolution kernels to train. Too many weights are generally hard to train for the network. Different moving steps in convolution operations are used to obtain features in different scales [21]. A multiscale convolutional neural network that merges the final convolutional layers and the final pooling layer was designed to extract the local and global features for bearing RUL prediction [15]. But the detailed information learned by the low layers is still lost in this architecture.

In this paper, a novel MS-CNN method for bearing RUL prediction is introduced. An integrated dilated convolution block is constructed to extract features in different receptive fields from the complex signal. Then, several stacked integrated dilated convolution blocks are concatenated to construct a multiscale feature extractor. The two advantages of the proposed MS-CNN are summarized as follows:(1)An integrated dilated convolution block is constructed to extract features in different receptive fields from complex signals without increasing the weight of convolution kernels.(2)A multiscale feature extractor is constructed to avoid the loss of information at different levels. The multiscale feature extractor can make full use of the global features obtained from the higher layers and the local features obtained from the lower layers.

The rest of this paper is introduced as follows: in Section 2, the relevant theoretical backgrounds are introduced, including convolutional neural networks and dilated convolution. The proposed MS-CNN bearing remaining useful life prediction method is introduced in Section 3. In Section 4, a public dataset about rolling bearings is used to verify the superiority of the proposed method. The dataset is collected from the PRONOSTIA platform. Finally, the conclusions are summarized in Section 5.

2. Theoretical Background

2.1. Convolutional Neural Network

CNN is a feed-forward neural network. The traditional CNN is mainly composed of three kinds of layers: the convolutional layers, the pooling layers, and the fully connected layers [22].

In the convolutional layers, features can be learned by several convolution kernels. The convolutional operation is a linear operation. In order to increase the nonlinear of two adjacent layers, the nonlinear activation function is carried out to solve the problem. The output of the convolutional layer can be written aswhere denotes the convolution operation, is theth output of layer, is the th input from the previous layer, is the weight vector of convolution kernel in th layer, and is the bias of the th output. is the nonlinear activation function.

In the pooling layer, the output of the convolutional layers can be compressed to improve computational efficiency [14], which is a form of down-sampling. The pooling operation includes many pooling functions, such as L2 norm pooling, max pooling, and average pooling. Pooling operation is adopted aswhere is the th input feature map in layer, is the th output in the layer , is the pooling size and denotes the stride, and pool ( ) denotes pooling function.

After several convolutional and pooling operations, the feature maps of the previous layer are flattened and then sent to the fully connected layer. RUL prediction is a regression task. Thus, the fully connected layers can be regarded as the regression layer, which builds a connection from the feature maps learned by previous layers to the final result.

2.2. Dilated Convolution

Dilated convolution (atrous convolution) can enlarge the receptive field without increasing network parameters [23]. Recently, dilated convolution has been implemented in many areas such as bearing fault diagnosis [24], semantic image segmentation [25], and sound classification [26].

Compared with the ordinary convolution operation, a hyperparameter called dilation rate is added to the dilated convolution. As shown in Figure 1, different dilation rates can be seen as inserting different sizes of holes between each convolution kernel parameter. When applied in one-dimension CNN, it can be calculated aswhere yi represents the output of the th element in the convolution, is the th element of input, are the weights of the filters, and the length of the filter is K. r is the dilation rate, r = 1 in dilated convolution is equal to the ordinary convolution. One zero is inserting in the adjacent convolution weight when the dilation rate = 2.

3. Proposed Bearing RUL Method

In this section, the framework of the proposed multiscale convolutional neural network is introduced in detail. Machinery vibration signals are sent to the network directly as the input data. The procedure of the proposed MS-CNN-based RUL prediction method is introduced lately. Four steps are implemented to get the bearing remaining useful life prediction: FOT determination, data preprocessing, RUL prediction, and smoothing operation. The proposed MS-CNN could establish a relationship between monitoring signal and remaining useful life without much prior knowledge. It is easy to be generalized in industrial applications.

3.1. Proposed MS-CNN Architecture

The framework of the multiscale convolutional neural network is shown in Figure 2. The MS-CNN consists of two modules, a multiscale feature extractor and the regression layer. The multiscale feature extractor is constructed to improve the network's learning ability. Features in different receptive fields are taken into consideration by an integrated dilated convolution block. Then, the integrated dilated convolution blocks in different levels are all concatenated to establish the proposed multiscale feature extractor. The regression layer is designed to construct the mapping relationship between features and corresponding real lifetime.

3.1.1. The Integrated Dilated Convolution Block

From the structure of the traditional CNN, features are extracted by the convolutional operation and pooling operation. The scale of the convolution kernel can affect the learning ability of the network. A single-sized convolution kernel in each convolutional layer may lead to the information learned by this layer incompletely. Inspired by the inception network, an integrated dilated convolution block is introduced, which is shown in Figure 3.

Different convolution kernels with different dilation rates are concatenated to learning multiscale features in different receptive fields in this paper. Compared with the convolution with different sizes of convolution kernels. The integrated dilated convolution block can learn different scales of information without increasing the parameter of the network. Too large dilated rate may cause the loss of detailed information. Too many dilated rates integrated into one block may lead to redundant information. Taking consideration of the characteristic of the vibration signal and inspired by inception networks, the structure of the integrated dilated convolution block has three kinds of dilated rates.

When the input data are sent to the integrated dilated convolution block, three kinds of convolution operations are performed on the input data synchronously. Features extracted from n filters are then concatenated into a features vector. The features vector can be recorded as follows: denotes the feature map learned by different convolution kernels in the th layers.

3.1.2. The Proposed MS-CNN

In the traditional CNN structure, the feature maps of the last pooling layer are treated as final features for classification or regression. Local features extracted by previous layers are usually discarded. Although the global features extracted by high-level layers are much more representative and robust than those extracted by low-level layers, local features contain some detailed information and are useful for prognosis.

In this paper, the outputs of different integrated dilated convolution blocks are concatenated to extract local and global features synchronously. The proposed MS-CNN is shown in Figure 2. The concatenated feature maps can contain not only the invariant and stable global feature but also the detailed information. The concatenated feature maps can be expressed aswhere the denotes the concatenation operation and indicates the feature maps of the th integrated dilated convolution block.

To keep the RUL prediction ranging from 0 to 1, the sigmoid activation function is applied in the last fully connected layer. The sigmoid activation function can be expressed aswhere the represents the input feature map of the final fully connected layer.

MSE loss function is utilized to update the parameters of the whole network, which is expressed aswhere N denotes the number of the training samples and denotes the predicted RUL value of the input , is the true label.

3.2. Framework of the Proposed Bearing RUL Method

The flow chart of the proposed method is displayed in Figure 4. Firstly, signals of machinery are collected by sensors, then the fault occurrence time (FOT) is determined according to the collected signals. Then, the data from FOT to failure are divided into two categories: training samples and testing samples. The training samples are applied to train the network. The superiority of the MS-CNN is confirmed by testing samples. At last, a smoothing operation is implemented to get a continuous RUL result.

3.2.1. Step 1: FOT Determination

As shown in Figure 5, in the early stage, vibration signals recorded by the sensors usually undergo a stable period, which means the monitored component is in a healthy stage. Data in the healthy stage include unrelated information to RUL prediction. Thus, the determination of FOT before RUL prediction is necessary. In this paper, kurtosis is applied to find the FOT. The formula of kurtosis is as the following equation:where is the mean and standard deviation of the signal, is the standard deviation of the signal, and N is the number of signal data.

Kurtosis is very sensitive to the amplitude value. The degradation can be well reflected by kurtosis. Thus, it is usually very helpful for detecting incipient faults [27].

Laida criterion (also known as 3 rules) is applied to detect the FOT. We assumed that the signal in the early period is in the health stage. The mean and the standard deviation of kurtosis in the health stage are calculated, and then, 3 is used as the FOT indicator.

Although it is very effective to regard kurtosis as health index to detect early fault points, this health index is not stable enough because it is affected by noise and outliers. Thus, local linear regression is applied after calculating kurtosis from the vibration signal, which can remove the FOT misjudgment caused by outliers. After local linear regression smoothing operation, when the smoothed kurtosis of the time t falls out of the 3 interval, it can be regarded as the FOT as follows:where denotes the kurtosis of the ith data sequence. The FOT of bearing1_1 determined by the applied method is shown in Figure 5. The amplitude of the vibration signal keeps in a stable stage before FOT and degradation starts after FOT. The method of FOT determination is proved to be effective and accurate. FOT determination is only determined in the training processing. In the online testing processing, lacking of the whole lifetime vibration signal leads to the FOT determination impossible.

3.2.2. Step 2: Data Preprocessing

To speed up the training convergence, normalizing the raw data is a common and effective operation. Firstly, the selected sensor data is divided into time segmentation, and the length of the segmentation is L. Each time segmentation is a sample that can be represented as . Max-min normalization is implemented to ensure the data within [0, 1] as the following equation:where the is the max value of the sample X. is min value of the sample X. represents the value of after max-min normalization.

The label of the training data is constructed as the reliability in the range of [0, 1]. FOT is regarded as the start of degradation. The label can be described as the linearly degrading process from FOT to complete failure, as shown in Figure 6. Then, the samples and the corresponding labels are treated as the input and the output of the proposed MS-CNN.

3.2.3. Step 3: RUL Prediction Based on the Proposed MS-CNN

The proposed method based on MS-CNN has two processes: offline training and online testing. In the training period, the training dataset and corresponding label are used to train the proposed MS-CNN. After data preparation, training data segmentation undergoes a multiscale feature extractor. More representative and comprehensive features can be learned by the multiscale feature extractor. The output of the multiscale feature extractor is sent to the regression layer to construct the relationship between features and RUL. MSE loss function is applied in the network. Small training samples can lead to overfitting. The network has bad performance on the testing dataset. Dropout is adopted in the fully connected layers to avoid overfitting by setting some hidden neurons to zero and turned off in the testing process. Adam algorithm is applied to update the parameters of the network. Different from the stochastic gradient descent algorithm, the Adam algorithm can adjust the learning rate adaptively without setting the learning rate in advance. When the online testing data at a moment are sent to the network, the prognostic RUL of that testing data can be predicted by the trained network.

In order to assess the predicted result quantitatively, two error indicators are applied in this paper, i.e., the mean absolute error (MAE) and root mean squared error (RMSE) [28].where N denotes the number of the samples, Ei is the corresponding RUL of the th sample predicted by the proposed MS-CNN, Li is the corresponding actual RUL.

3.2.4. Step 4: Smoothing

The RUL results predicted by the network are discrete and fluctuant. However, in the real industrial applications, the actual RUL of bearing is always continuous. The remaining useful life of the bearing decreases as time goes by. Thus, the smoothing operation is applied to smooth the predicted RUL.

At the time , the prediction RUL of this moment is recorded as RUL(n), the RUL at five time-point moments before such as undergo a moving average filter. The smoothing operation can be implemented as equation (12). The regression RUL result at the time is regarded as the final predicted result. The smoothing operation can make the predicted result according to the actual condition.

4. Experiment

In this section, experiments are carried out to verify the superiority of the proposed MS-CNN method. The experimental dataset was acquired from the PRONOSTIA platform. The dataset is introduced in Section 4.1. In Section 4.2, the parameters of the proposed MS-CNN are displayed. What is more, the results of the proposed method on experimental data are displayed. Different DL-based bearing remaining useful life prediction methods are compared in Section 4.3. A PC with Intel Core i7-5557U CPU, 4-GB RAM is used in all the experiments’ implementation. All the results in this study are tested ten times to reduce random errors.

4.1. Data Description

The PRONOSTIA platform (as shown in Figure 7) performed the accelerated degradation test of the rolling bearing. The platform is mainly divided into three parts: the rotating part, the degradation part, and the measurement part. Degradation of the bearings can be accelerated by the degradation part. In order to measure the vibration signal of bearings, two acceleration sensors were installed on the vertical and horizontal axis. However, the amplitude of the vertical vibration signals is lower than the horizontal ones. The degradation trend was better captured by the sensors placed on the horizontal axis. Therefore, only the horizontal vibration signals are used in the paper. The frequency of sampling in the experiment is 25.6 kHz [29]. A sample contains the data collected in 0.1 s every 10 s. 17 run-to-failure bearings were acquired. The first two bearings of each working condition were applied for the training process, and the rest of the bearings were used as the testing data. All the datasets are shown in Table 1.

4.2. Proposed Method on Experimental Data

Row vibration signal of each run-to-failure bearing is divided into time-series segmentation. Each segmentation is a sample, and it contains 2560 data points. After the determination of the FOT, each sample is normalized by the max-min normalization method as described in Section 3.2.2. Training samples with corresponding labels were utilized for training the MS-CNN.

The parameters of the MS-CNN are displayed in Table 2. KS represents the size of the convolution kernel, r represents the dilation rate, n is the number of filters, and s represents the stride. N represents the number of neurons in the fully connected layer. The dropout strategy is adopted to avoid overfitting with a coefficient of 0.2. MSE is employed as the loss function. Adam optimization algorithm is applied to update the parameters of the model in the back-propagation process.

The loss function curve for the training is shown in Figure 8. The training loss curve declined rapidly from the beginning to the 20th epoch. Then, keeping a slow decline trend among the 20th epoch and 80th epoch, the loss curve is stable from the 80th epoch to the 100th epoch. Thus, the epoch of the network is determined as 80.

The performance of the network is related to depth. Effects on the number of integrated dilated convolution blocks are discussed in this section. As shown in Figure 9, the performance of the network is improved when the number of fused dilation convolution layer increases at first. When the number of the integrated dilated convolution blocks exceeds three, the MAE of the result increased. Too many integrated dilated convolution blocks may lead to an over-fitting problem, the training data will have high performance, but the testing data will have a bad performance. On the other hand, as the depth of the network increases, the training time can be longer. Thus, in this study, the number of fused layers is determined to 3.

The row vibration signal of bearing1_3 is shown in Figure 10. Bearing1_3 shows a failure behavior in a gradual degradation trend. The remaining useful life prediction result of bearing 1_3 is shown in Figure 11. The yellow line in the figure is the row estimation result without smoothing operation. The row estimation result is discontinuous and fluctuates in a larger range. The red line is the estimation result with smoothing operation. Smoothed estimation shows a steady and continuous RUL result of the testing data, which is consistent with the actual RUL. The row vibration signal of bearing1_7 is shown in Figure 12. Bearing1_7 shows a sudden failure behavior. RUL prediction result of bearing 1_7 is shown in Figure 13. Although the estimation result is not completely in line with the actual RUL in the early stage, degradation can be effectively reflected in the near-failure stage. Bearings in different failure behaviors were used to turn out the superiority of the proposed MS-CNN method.

4.3. Comparison Results

Several commonly used deep learning models are used for comparison, including deep neural network (DNN), convolutional neural network (CNN), long short-term memory (LSTM), and multiscale convolutional neural network which merge the final convolutional layers with the final pooling layer [15]. All the comparative methods have tuned the parameters to optimal values relatively.

4.3.1. DNN

Six hidden layers were utilized in the deep neural network to share the same depth as the proposed MS-CNN. And the number of neurons in each layer was 300, 200, 200, 100, 100, and 1. Dropout was applied in each layer and the dropout rate of 0.2 is used.

4.3.2. CNN

A traditional CNN was applied to compare with the proposed MS-CNN. CNN has three convolutional layers and three pooling layers. The kernel size of the three convolutional layers were 16 × 1, 8 × 1, and 4 × 1. The number of the kernels was 4, 8, and 16. Three fully connected layers were implemented to stay the same as the proposed MS-CNN.

4.3.3. LSTM

LSTM is one of the variant algorithms of recurrent neural networks. LSTM framework was designed to contain three LSTM layers and three fully connected layers. The number of neurons in LSTM layers was 128, 64, and 64. The three fully connected layers were the same as the proposed method.

4.3.4. MS-CNN

The third convolution layer and the third pooling layer were concatenated to form a fused layer as proposed in Ref. [15]. The kernel size and the number of the kernel are the same as the CNN structure. After the fused layer, three fully connected layers were constructed the same as the proposed MS-CNN.

The prognostic result of different methods for testing data bearing1_3 is shown in Figure 14. The degradation can be reflected by the proposed method. The predicted RUL of bearing1_3 by the proposed method is the closest to real-life than those other approaches. The DNN method shows the worst result among the comparing methods. The prognostic result of different methods for testing data bearing1_7 is shown in Figure 15. The RUL of bearing1-7 predicted by the proposed MS-CNN is not consistent with the actual RUL in the early stage. That is because the signal in the early stage shows a stable state. In the late stage, degradation can be reflected by the proposed MS-CNN. Since the accurate RUL prediction in the near-failure stage is more important in real industries, the proposed MS-CNN is promising in real industrial implementation.

The numerical comparison result of all the testing data is shown in Table 3. It can be seen that the MAE and RMSE of the proposed MS-CNN are almost the lowest among the comparison methods. Although the proposed MS-CNN gets bad performance in bearing 2_4, the method is still robust in many different tasks. DNN method gets the biggest errors than other methods. MS-CNN in Ref. [15] has smaller errors than CNN. That is because the combination of the final convolutional layer and the final pooling layer makes use of the local and global features learned by the high-level layer. But MS-CNN in Ref. [15] has bigger errors than the proposed method. The result shows which can be suggested that the detailed information extracted by low-level layers is useful for Prognostic. LSTM method shows worse performance than the CNN method. It is not suitable for extracting features from plenty of original data. What is more, the LSTM method consumes much more time to train than other methods, and it is not suitable for industrial applications. The results proved that the proposed method could provide reliable remaining useful life estimations in different failure behaviors.

5. Conclusions

In this paper, an MS-CNN-based method for bearing prognostic is proposed to overcome the shortcomings of traditional CNN. The effectiveness of the proposed method was verified on a public dataset. Some contributions are summarized as follows: (1) the integrated dilation convolution block can extract features in different receptive fields from the raw signal without increasing the parameters of the network; (2) the integrated dilation convolution block in different depths are concatenated, avoiding the loss of detailed information learned by the lower layer. The proposed architecture can show a high accuracy than other deep learning methods mentioned in this paper. However, the structure of the network is designed subjectively. Our future study is supposed to concentrate on optimizing the structure of the network automatically.

Data Availability

The data used in this paper are available, which can be downloaded from GitHub - wkzs111/phm-ieee-2012-data-challenge-dataset: dataset that was used during the PHM IEEE 2012 Data Challenge, built by the FEMTO-ST Institute.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (nos. 51505277 and 51875375) and the Suzhou Prospective Research Program (no. SYG201802).