#### Abstract

To enhance the performance of deep auto-encoder (AE) under complex working conditions, a novel deep auto-encoder network method for rolling bearing fault diagnosis is proposed in this paper. First, multiscale analysis is adopted to extract the multiscale features from the raw vibration signals of rolling bearing. Second, the sparse penalty term and contractive penalty term are used simultaneously to regularize the loss function of auto-encoder to enhance the feature learning ability of networks. Finally, the cuckoo search algorithm (CS) is used to find the optimal hyperparameters automatically. The proposed method is applied to the experimental data analysis. The results indicate that the proposed method could more effectively distinguish fault categories and severities of rolling bearings under different working conditions than other methods.

#### 1. Introduction

Rolling bearings are widely used in rotary machines, which play an important role in determining the running states of equipment. Under complex working conditions, rolling bearings will inevitably obtain various faults. This may affect the operating performance of the whole machine and even lead to enormous losses and serious casualties. Therefore, it has important practical significance for the state monitoring and fault diagnosis of rolling bearings [1].

At present, the vibration analysis method has been widely used in fault diagnosis of rolling bearings, which generally contains three steps: (1) feature extraction, (2) feature selection, and (3) pattern recognition [2]. For (1), the effectiveness of feature extraction is related to the accuracy of fault diagnosis. The common feature extraction method mainly includes time-domain methods, frequency-domain methods, and time-frequency methods. In time domain, the most popular methods are the statistical analysis [3], such as root mean square, root amplitude, and maximum peak value. The frequency-domain methods mainly analyze the characteristic frequency [4], such as frequency standard deviation and root mean square frequency. In the time-frequency domains, the famous methods include short-time Fourier transformation (STFT) [5], wavelet transformation (WT) [6], and empirical mode decomposition (EMD) [7]. For (2), feature selection can reduce the input dimension of classifier and improve classification efficiency. The existing feature selection methods include decision tree [6], random forest [8], and distance-based feature selection method [9]. For (3), the selected features are classified by Bayesian classification [10], support vector machines (SVM) [11], artificial neural network (ANN) [12], and other machine learning algorithms.

The shallow structure of machine learning lacks powerful representation capabilities and has difficulty in effectively learning the complex nonlinear relationships in mechanical fault diagnosis problems [13]. As a breakthrough in the field of machine learning, deep learning has received widespread attention from all fields [14, 15]. It can automatically mine the representative information hidden in the raw data and directly establish an accurate mapping relationship between data and the operating state of the equipment [16].

AE is a kind of deep learning model, which can automatically learn features from samples and is widely used in the field of mechanical fault diagnosis [17]. Jia et al. [18] built a deep neural network based on AE and applied it to bearing fault diagnosis. Shao et al. [19] proposed a new deep feature fusion method for rotating machinery fault diagnosis, which has achieved satisfactory diagnostic results in bearing fault diagnosis and gear fault experiment analysis. Meng et al. [20] improved the performance of the denoising auto-encoder (DAE) and successfully identified rolling bearing faults. Sun et al. [21] successfully used a deep neural network based on sparse auto-encoder (SAE) for fault classification of induction motors. In addition, Shen et al. [22] proposed a fault diagnosis method for rotating machinery based on contractive auto-encoder (CAE). Zhang et al. [23] conducted mechanical fault diagnosis using the ensemble deep contractive auto-encoder (EDCAE) under noisy environment. The deep structure can automatically extract more representative and discriminating high-level features from the input and obtain satisfactory fault diagnosis results.

However, the effective characteristics of the signal are often submerged in industrial background noise. This leads to the performance decline of deep AE [24]. Moreover, the selection of hyperparameters in deep auto-encoder networks has always been a challenge. It is difficult to find the most suitable parameters. Therefore, it is necessary to design a novel deep AE.

In order to further enhance the performance of deep auto-encoder and select hyperparameters adaptively, a novel CS optimized deep auto-encoder network-based fault diagnosis method for rolling bearing is proposed in this paper. The proposed method was applied to the experimental data analysis of rolling bearing under different working conditions. The results show that the proposed method is more effective and robust than other methods. The main contributions of this paper can be summarized as follows:(1)The multiscale features of vibration signals are extracted as the input of the network, which improves test accuracy and reduces training time(2)The sparse penalty term and contractive penalty term are simultaneously used to regularize the loss function of the AE, which enhance the feature learning ability and robust of deep AE(3)CS algorithm is used to optimize hyperparameters of the new AE network automatically, which makes the new auto-encoder network more suitable for the signal characteristics

The rest of the paper is organized as follows. The basic AE is described in Section 2. The proposed method is described in Section 3. In Section 4, the experimental diagnosis results for rolling bearing are analyzed and discussed. Finally, conclusions are given in Section 5.

#### 2. Basic AE

Basic AE is a three-layer unsupervised neural network [25], and it can learn useful features automatically by nonlinear dimension reduction. The structure of the basic AE is shown in Figure 1. It consists of two steps: encode and decode. The encoding procedure maps the input data to hidden features vector, and the decoding procedure maps the hidden features vector back to reconstructed vector.

Given the unlabeled sample data , the encoding procedure is defined as

The decoding procedure is defined aswhere *s* is the sigmoid function, and are the weight matrices, and are the bias matrices, is the hidden layer features vector and is the output layer features vector, *D* is the number of training samples, and *d* stands for the d^{th} sample.

AE minimizes the reconstruction error by optimizing parameters **W** and **b**. The loss function of AE is usually defined by the mean square error (MSE) aswhere represents the norm.

Thus, the total loss function of *M* samples is defined as

#### 3. The Proposed Method

##### 3.1. DCSAE Network

If the encoder and decoder are given too large capacity, the AE will perform the copy task without capturing any useful information about data distribution. The regular term constraints are added to the AE loss function that can encourage the model to learn useful features without restricting the capacity of the model [26].

Sparse penalty is a common regular term constraint, which is added to the loss function of AE, can reduce the dimension of input data effectively, and can speed up network training; the sparse penalty items is defined as [27]where *j* is the number of each unit in the hidden layer, *s*_{2} is the number of units in the second layer, is the sparse parameter, which is a small value artificially given, and is the average activation value of the *i*^{th} hidden layer.

Contractive penalty is added to the AE, which can learn the representative and robust hidden features from noisy environment. The contractive penalty term is defined as [28]

The element of the in the *j*^{th} row and *i*^{th} column is defined as

When the activation function of AE is the sigmoid function, equation (7) can be calculated bywhere is the weight matrix of the unit in layer *i* and the unit in layer *j.*

Based on equations (8) and (6), it can be further written as

Considering the advantages of sparse penalty and contractive penalty, in this paper, the sparse penalty term and contractive penalty term are added to the loss function of AE at the same time, and a contractive sparse auto-encoder (CSAE) can be obtained. CSAE loss function can be defined aswhere is the sparse penalty factor used to control the relativity between the sparse penalty and the reconstruction error and is the contractive penalty parameter used to control the relativity between the contractive penalty term and the reconstruction error.

Two CSAEs and one softmax classifier can be stacked to form a DCSAE network with several hidden layers, as shown in Figure 2. The detailed process of DCSAE network is as follows: Step 1: the input data and the first hidden layer are made up of the first CSAE for feature learning Step 2: the hidden layer features of the first CSAE are used as the input of the second CSAE; repeat this process until all hidden layers are pretrained Step 3: update the parameters by minimizing the CSAE loss function Step 4: the hidden features of the last CSAE are used as the input of softmax classifier for supervised training Step 5: the backpropagation (BP) algorithm is used to fine tune the entire network to obtain a well-trained DCSAE

Moreover, the batch normalization (BN) layer is added to the DCSAE that is beneficial to the training network weight matrix.

##### 3.2. Parameter Optimization

The sparse parameter , the sparse penalty factor , and the contractive penalty parameter in equation (10) are key hyperparameters of the DCSAE. Setting the parameters manually is hard to ensure that the set of parameter is optimal.

The CS algorithm is a simple and effective optimization approach [29], the path can be randomly searched during the optimization process, and the optimal solution is easily obtained by setting fewer parameters [30]. Thus, the CS algorithm is used to select hyperparameters of DCSAE. Process of the CS algorithm is as follows: Step 1: set the following cuckoo parameters number of nests, population size, probability of discovery, and number of iterations, set fitness function, and initialize nest position. Step 2: calculate the value corresponding to each nest according to the fitness function and select the current best nest. Step 3: retain the optimal value of the previous generation and its corresponding nest position, and then, update other bird nest positions and states. Step 4: compare the current value with the retained previous generation optimal value; replace if better, otherwise keep it unchanged. Step 5: after updating the nest position, a new probability is randomly generated. If the new probability is less than the set discovery probability, the nest position remains unchanged; otherwise, the nest position is randomly changed. Finally, keep the best nest position. Step 6: if the number of iterations does not reach the maximum or the error does not reach the minimum, then return to step 2; otherwise, the global best nest position is the output. Step 7: the global optimal nest position is output as the optimization parameters and the optimization is finished.

##### 3.3. Procedure of the Proposed Method

A new fault diagnosis method for rolling bearing is proposed based on CS and DCSAE in this paper. The flowchart of the proposed method is given in Figure 3.

The procedure of the proposed method is as follows: Step 1: multiscale features are extracted from the rolling bearing vibration signals to constitute feature set Step 2: multiscale features are divided into training samples and testing samples according to a certain proportion as the input of network Step 3: the CS algorithm is used to optimize the key hyperparameters of DCSAE Step 4: training samples are used to train the DCSAE network Step 5: testing samples are used to verify the trained DCSAE; the fault diagnosis results are obtained

#### 4. Experimental Verification

##### 4.1. Rolling Bearing Fault Test Rig

The rolling bearing experiment data are provided by the Anhui University of Technology (AHUT). The test rig is shown in Figure 4. The test rolling bearings are 6206-2RS1 SKF deep groove ball bearings, and the single point faults are seeded by using wire cutting technology with different depths of 0.2 mm, 0.3 mm, and 0.4 mm. The vibration signals from four kinds of rolling bearings, normal bearing (NO), bearings with inner race (IR), outer race (OR), and rolling ball (RB) faults, were collected. The motor speeds are 150 r/min, 300 r/min, 900 r/min, and 1500 r/min with the load of 0 kN, 5 kN, and 7 kN. The sampling frequency is 10.24 kHz, and the data length is 204800.

##### 4.2. Experimental Verification

###### 4.2.1. Dataset Introduction of Experiment 1

In this experiment, rolling bearing dataset is collected under the motor speed of 900 r/min and the load of 0 kN. The operation conditions are listed in Table 1. Each condition consists of 200 samples, and each sample is a measured vibration signal consisting of 1024 data points. The random 180 samples of each fault are chosen as training samples, and the rest 20 samples are chosen as testing samples. The time-domain vibration signals corresponding to seven types of faults are shown in Figure 5.

###### 4.2.2. Dataset Introduction of Experiment 2

In this experiment, rolling bearing data of the same fault degree under the different loads were selected to verify the recognition ability of the DCSAE. The rolling bearing data set is collected under the motor speed of 900 r/min and the loads of 5 kN and 7 kN. The operation conditions are listed in Table 2. Each condition also consists of 200 samples, and each sample also consists of 1024 data points. The random 180 samples of each fault are chosen as training samples, and the rest 20 samples are chosen as testing samples.

##### 4.3. Multiscale Feature Extraction

It is common to perform multiscale analysis on vibration signals to extract multiscale features of the signals to mine fault information more fully. Multiscale analysis based on signal coarse-graining is widely used. It mainly contains two steps. First, the coarse-grained sequence of raw signals at different scales is calculated, and then, the eigen values of the coarse-grained sequence of each scale are calculated to obtain the multiscale features [31].

For a certain length of sample, the coarse-grained sequence of different scales is given by [32].where is the coarse-grained sequence, is the scale, is the sample, and *L* is the sample length, .

To illustrate the effectiveness of multiscale features as network input, raw bearing vibration signals, single-scale features, and multiscale features of bearing vibration signals are used as input for DCSAE. In multiscale features extraction based on the data of experiment 1, each sample consisting of 1024 data points and 17 types of time-domain and frequency-domain features in the first 5 scales of the raw rolling bearing vibration signals are extracted to form a new features set with a dimension of . The extracted features [33] are as follows:(1)Time-domain features: mean, root mean square, root amplitude, average amplitude, maximum peak value, standard deviation, skewness index, kurtosis index, peak value index, margin index, waveform index, and pulse index(2)Frequency-domain features: center of gravity frequency, frequency standard deviation, root mean square frequency, skewness frequency, and kurtosis frequency

The formulas of partial time-domain features and frequency-domain features are shown in Table 3.

The basic parameters of the CS algorithm are set as follows: the number of nests is 20, the probability of discovery is 0.25, the number of iterations is 20, and the upper and lower bounds of parameter optimization are [0.001, 1]. Optimal hyperparameters [ ] are [0.15 0.27 0.13] through the CS algorithm. The deep structure DCSAE is selected as [85 50 30 7] by experimentation, the learning rate is 0.01, and the number of iterations is 200. The comparison results of different inputs for DCSAE are shown in Figure 6. The test accuracy and training time are shown in Table 4.

In Figure 6 and Table 4, compared with the other groups, the training time is the longest and the average test accuracy is the lowest in each trial when raw data are used as input for DCSAE; this is because raw data of rolling bearing are high-dimension data, and they contain a lot of redundant information which is not conducive to the weight training in the network and will lead to long training time. The training time of single-scale feature as the input for DCSAE is the shortest, but the recognition rate is far lower than that of multiscale feature.

Multiscale feature as the input for network can fully mine fault information; therefore, multiscale features are used as input for all training networks of experimental verification in Section 4.4.

##### 4.4. Results and Analysis

###### 4.4.1. Results and Analysis of Experiment 1

To verify the effectiveness of the proposed method, it is compared with SVM, SAE, stacked sparse auto-encoder (SSAE), and stacked contractive auto-encoder (SCAE).

In this experiment, the basic parameters of the CS are set as given in Section 4.2, and the optimal hyperparameters [ ] are [0.15 0.27 0.13] through the CS algorithm. The deep structure DCSAE is selected as [85 50 30 7] by experimentation, the learning rate is 0.01, and the number of iterations is 200. The kernel function of SVM is RBF function, and the penalty factor and kernel function parameters are obtained by the CS, which are 0.64 and 0.017, respectively. The structure of SAE is [85 30 85], the learning rate is 0.01, and the hidden layer features are input to the softmax for classification. The structures of SSAE and SCAE are the same as that of the DCSAE, the learning rate is 0.01, the number of iterations is 200, and the sparse parameters and in SSAE and the contractive penalty parameters in SCAE are also obtained by the CS as well, which are 0.08, 0.18, and 0.69, respectively.

Ten trials are carried out for the five methods to eliminate the influence of accidental errors, and the average result of the ten trials was used as the evaluation index. The comparison results of the five methods are shown in Figure 7 and Table 5. The following can be seen from Figure 7 and Table 5:(1)Comparison between SSAE, SCAE, and DSCAE: the average test accuracies of SSAE and SCAE are 96.43% and 95%, respectively; they are lower than that of DSCAE, which is 98.57%. The standard deviations of SSAE and SCAE are smaller than that of DSCAE. It is indicated that the sparse penalty term and the contractive penalty term are applied to the loss function of the stacked AE at the same time, which can obtain more satisfactory diagnosis results.(2)Comparison between SVM, SAE, and other three deep networks: the average test accuracies of SVM and SAE are 52.14% and 70.71%, respectively; they are much lower than that of other three deep networks as SSAE, SCAE, and DSCAE. This illustrates that deep learning methods have better performance than shallow models in dealing with large samples, and the reason is that the deep learning method can mine more useful features from fault data.

The confusion matrix of the DCSAE for experiment 1 is shown in Figure 8. The columns stand for the true label, and the rows stand for the predicted label; the chart bar in the right shows the correspondence between color and numbers (from 0 to 1). DCSAE obtains the best result of 100% on the IR1-1, IR1-2, OR1-1, OR1-2, and NO1-1 in Figure 8. The only misclassification occurred on RB fault samples, and about 10% of RB1-1 samples were misclassified as RB 1-2.

A manifold learning method is used to visualize the data samples to further analyze the feature extraction capability of DCSAE. t-distribution stochastic neighbor embedding (t-SNE) [34] is an embedding model that can map data in high-dimensional space to low-dimensional space and retain the local characteristics of the data set. It is mainly used for dimension reduction and visualization of high-dimensional data. Therefore, t-SNE is adopted to extract the raw data and the first three components of the last layer of DCSAE and draw the scatter plots in Figures 9 and 10. As shown in Figure 10, fault samples of the same category converge in a center and fault samples of different categories can be easily distinguished. In conjunction with Figures 8 and 10, it can be seen that a small number of samples overlap on the RB1-1 and RB1-2. This leads to the misclassification of the samples with the RB fault condition.

In general, DCSAE has achieved a high accuracy to distinguish the different categories under the same motor speed and load. The average test accuracy is higher than that of existing deep learning methods and traditional machine learning methods.

###### 4.4.2. Results and Analysis of Experiment 2

In this section, DCSAE is compared with other three deep AE networks: stacked AE, SSAE, and SCAE. Architecture of the four deep AE networks are [85 50 30 8] by experimentation, the learning rate is 0.01, and the number of iterations is 200. Multiscale features of the time domain and frequency domain were extracted as the input of different networks. The optimal hyperparameters [ ] are [0.26 0.56 0.5] that were obtained through the CS algorithm.

The result of the 10 trials is shown in Figure 11, and the average test accuracy and standard deviation of the specific 10 trails are shown in Table 6. As shown in Figure 11 and Table 6, compared with other three deep AE networks, the average test accuracy of DSCAE is 91.25 ± 0.0092 which has the highest test accuracy and the lowest standard deviation. It indicates that DCSAE can obtain better diagnosis accuracy when dealing with fault data under different conditions.

The confusion matrix of the DCSAE for experiment 2 is shown in Figure 12. From the result, about 5% of NO2-1 are misclassified as NO2-2, about 25% of RB2-1 are misclassified as RB 2-2, and about 35% of RB2-2 are misclassified as RB 2-1; all of other categories are classified correctly while processing bearing fault data with different loads.

Similar to the above process, t-SNE is used to extract the raw data and the first three components of the last layer of DCSAE in experiment 2 and draw the scatter plots in Figures 13 and 14. As shown in Figure 14, fault samples of different categories under different loads can be distinguished. In conjunction with Figures 12 and 14, it can be seen that a large number of samples overlap on the RB2-1 and RB2-2, and it is difficult for the softmax classifier to achieve a high-precision classification effect. This leads to the misclassification of the samples with the RB fault condition. Overall, DCSAE has robustness and generalization ability to distinguish the different categories with the different loads.

#### 5. Conclusion

In this paper, a novel deep AE network is proposed for rolling bearing fault diagnosis. In the proposed network, the sparse penalty term and the contractive penalty term are adopted to regularize the novel deep AE loss function to enhance the ability of feature learning. Moreover, the multiscale features of rolling bearing vibration signal are extracted as the input of DCSAE to reduce the training time and data dimension. Furthermore, the CS algorithm is used to optimize the key hyperparameters of the deep auto-encoder automatically. Effectiveness of the proposed method is verified by two different experiments of rolling bearings fault diagnosis. The results show that the proposed method is more effective and robust for fault diagnosis than other methods. It is a worthy research topic to apply the deep learning method to other mechanical fault diagnosis fields. The authors would continue to study this field in the future work.

#### Data Availability

The data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 6/12 months after publication of this article, will be considered by the corresponding author.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest concerning the publication of this manuscript.

#### Acknowledgments

This work was supported by the National Key Technologies Research & Development Program of China (no. 2017YFC0805100), the National Natural Science Foundation of China (nos. 51975004, 11932006, and U1934206), the Natural Science Foundation of Anhui Province, China (no. 2008085QE215), and the Key Program of Natural Science Research of Higher Education in Anhui Province of China (nos. KJ2019A053 and KJ2019A092).