Rolling bearing fault diagnosis is a meaningful and challenging task. Most methods first extract statistical features and then carry out fault diagnosis. At present, the technology of intelligent identification of bearing mostly relies on deep neural network, which has high requirements for computer equipment and great effort in hyperparameter tuning. To address these issues, a rolling bearing fault diagnosis method based on the improved deep forest algorithm is proposed. Firstly, the fault feature information of rolling bearing is extracted through multigrained scanning, and then the fault diagnosis is carried out by cascade forest. Considering the fitting quality and diversity of the classifier, the classifier and the cascade strategy are updated. In order to verify the effectiveness of the proposed method, a comparison is made with the traditional machine learning method. The results suggest that the proposed method can identify different types of faults more accurately and robustly. At the same time, it has very few hyperparameters and very low requirements on computer hardware.

1. Introduction

Rolling bearing is an important basic device in mechanical equipment which has been widely used in wind power generation group, high-speed electric multiple unit (EMU), computerized numerical control (CNC) machine tools, and other equipment [1]. Rolling bearing is the core component of rotating machinery, whose failure will result in huge economic losses and threaten personal and property safety [2]. Therefore, it is necessary to accurately grasp the running status of rolling bearing, timely maintain the damaged parts, and prevent them from evolving into a greater threat. Accurately and effectively identifying the types of bearing fault and ensuring the normal operation of mechanical equipment are essential to improve the reliability of the system.

Vibration signals of rolling bearing fault are usually nonstationary and nonlinear [3, 4]. The early bearing fault identification technologies are mainly based on time domain, frequency domain, and time-frequency signal analysis methods [57]. In general, the fault features such as kurtosis, coefficient of variation, energy entropy, information entropy, and power spectrum entropy are extracted from the original signal, and then the fault identification is carried out by combining the classification algorithm. In terms of traditional fault diagnosis methods, the more commonly used time-frequency analysis methods include wavelet transform [8], empirical mode decomposition (EMD) [9], and variational mode decomposition (VMD) [10]. Zhao proposed a rolling bearing fault diagnosis method that combines wavelet packet decomposition (WPD) and multiscale permutation entropy (MPE). The vibration signals of rolling bearing in different states are decomposed into a group of subfrequency signals by using WPD, and then the average MPE value of each subfrequency signal is calculated as the input feature vector, and the fault modes of rolling bearings were identified by the hidden Markov model (HMM) [11]. Zhang proposed an automatic fault diagnosis method for rolling bearings based on lifting wavelet packet transform (LWPT), sample entropy, and classifier integration. And the construction of wavelet function is not based on Fourier transform but is obtained in the time domain. At the same time, considering the unstable accuracy of single classifier, an ensemble system which integrates back propagation neural network (BPNN), radial basis function neural network (RBFNN), and Elman neural network (ElmanNN) is proposed to reduce the impact of initial parameters on the performance of the classifier [12]. Compared with the traditional wavelet transform, the lifting wavelet packet transform has the advantages of the flexibility of constructing the wavelet function, less computation, and less memory. Ensemble empirical mode decomposition (EEMD) [13] is an improved version of EMD. As a typical representative of the adaptive method for dealing with nonlinear and nonstationary data, EEMD has been widely applied in the field of fault diagnosis [14, 15]. VMD is developed on the basis of EMD, which has a solid mathematical theoretical basis and has been proved to be superior to other adaptive data decomposition methods. It is widely used in fault diagnosis [16]. Chen et al. proposed the fault diagnosis of rolling bearing based on VMD and support vector machine (SVM), which significantly improved fault identification accuracy through multiscale fractal dimension and multiscale energy calculation features [17].

In addition, some shallow learning methods are closely combined with intelligent optimization algorithms for fault diagnosis. Dai proposed a fault identification method based on KICA-RBF [18]. In this model, it is important to use kernel independent component analysis (KICA) to fuse multiple signals to eliminate noise, and the genetic algorithm is used to optimize the parameters of radial basis function (RBF), thus the accuracy is improved. Zhao and Deng et al. improved a variety of optimization algorithms and proposed a data-driven feature extraction method—fitting curve derivative method of maximum power spectrum density (FDMPD)—and combined with the kernel extreme learning machine (KELM) and weight application to failure times (WAFT), which can effectively realize the prediction of the remaining service life of rolling bearings [1922]. LV et al. proposed an improved particle swarm optimization (PSO) algorithm to optimize parameters of support vector machine for fault diagnosis of rolling bearings [23]. The PSO is improved by introducing dynamic inertia weight, global neighborhood search, population shrinkage factor, and particle mutation probability. The method solves the problem of blind selection of kernel function parameter and penalty factor parameter of SVM. Experimental results showed that the classification effect is more stable. Shallow learning algorithms require manual participation in the construction of feature engineering and have poor ability of learning representation, while deep learning can effectively model high-level abstraction of data due to its powerful nonlinear representation capability [24]. In recent years, with the development of artificial intelligence, deep learning has made breakthrough progress. The cross-domain application of deep learning in fault diagnosis has aroused great interest and achieved remarkable results [2528]. Zhong et al. used EEMD to decompose intermittent fault signals into multiple intrinsic modal functions (IMFs), combined with Pearson correlation coefficient for feature optimization, and deep belief network (DBN) is used for fault diagnosis [29]. Guo et al. proposed a bearing fault diagnosis method based on hierarchical learning rate adaptive deep convolutional neural network and achieved satisfactory results [30]. Xu et al. proposed a bearing fault diagnosis method based on convolutional neural network (CNN) and random forest, which took two-dimensional images of continuous wavelet transform as input [31]. The multilevel features containing local and global information are used to diagnose bearing faults. The research indicated that this method is superior to the base deep learning method. Although the fault diagnosis method based on deep neural network (DNN) is powerful, due to the complexity of its model, a large amount of training data and its learning performance are excessively dependent on parameter optimization, which limits its applicability. In 2017, Zhou et al. proposed a method different from deep learning called multigrained cascade forest (gcForest), which generates a deep forest with cascade structure for representation learning, and it is regarded as a decision tree ensemble approach [32]. gcForest is easier to analyze theoretically than DNN. Some scholars have explored its application in the field of fault diagnosis. Hu et al. proposed a collaborative method combining deep Boltzmann machine with multigrained scanning forest integration, which effectively solved the problem of industrial fault diagnosis under big data [33]. Liu et al. applied deep forest for the first time in the end-to-end intelligent diagnosis of hydraulic turbine faults [34]. Considering the diversity of the cascade forest classifiers and the classification performance of each classifier, this paper proposes an improved deep forest algorithm. The mechanism changes the cascading mode based on the output results of the multigrained scanning stage and replaces the classifier through the cascade stage to increase the diversity and improve the performance of the classifier.

The main contributions of this paper are summarized as follows:(1)The classification is based on the original vibration signal data, which is different from most existing literatures in which they first extract the features and then classify them. The interference of human factors to feature window and feature type is avoided.(2)An improved deep forest algorithm is proposed, the idea of heterogeneous integration is introduced, and the cascade mode is changed to reduce the loss of sample information, which further improved the accuracy of fault diagnosis of gcForest.

The rest of this paper is organized as follows. Section 2 is the basic principle of deep forest and its improved algorithm; in Section 3, we give the results of the empirical analysis. Section 4 is the conclusion.

2. gcForest

GcForest, also known as deep forest (DF), is a supervised ensemble learning algorithm based on decision tree, which mainly consists of two parts: multigrained scanning (MGS) and cascade forest [32]. MGS solves the problem of high-dimensional input and enhances the difference of input feature. Cascade forest can improve the classification ability of input features by simulating the structure of DNN for representation learning.

2.1. Multigrained Scanning

The multigrained scanning is mainly inspired by the CNN. The key point of the CNN is that different enhancement features can be obtained through the convolutional kernel of different sizes. The MGS just draws on this idea to enhance the cascade forest [35]. Multigrained scanning is mainly used to locally sample the original data through the sliding window so as to obtain multiple feature instances of different dimensions. The process is described as follows: the input sample size be dimension. The sliding window size is dimension. The sliding step size is , and represents the number of generated feature vectors, and then we have the following equation:

After the multigrained scanning, each sample subset is input to a random forest and a completely random forest (CRF) for training so that each forest can obtain the feature vector of , where is the number of categories. Finally, -dimensional features can be obtained as the output of the multigrained scanning structure. As shown in Figure 1, it is a multigrained scanning process. Assuming that the raw data have 400-dimensional feature, the size of the sliding window is 100-dimensional; after sliding scanning, it will produce 301 feature vectors. If this is a three-class problem, each forest will produce 301 three-dimensional class vectors. Finally, 1806-dimensional transformed feature vector is obtained. Similarly, for 400-dimensional input data, sliding windows of 200-dimensional and 300-dimensional will generate 1206-dimensional and 606-dimensional feature transformed vectors, respectively.

2.2. Cascade Forest

The cascade forest stage embodies the process of deep learning through the hierarchical representation learning of features. Each level in the cascade forest corresponds to a different scanning granularity. The latter level receives the feature information from the previous level. The feature information is transmitted to the next level after processing at this level. Each level takes as its input an eigenvector that connects the original input to the output from the previous level [32, 35]. As shown in Figure 2, each layer of the cascade contains two random forests and two completely random forests, each forest is composed of multiple decision trees, and each tree randomly selects features from input features as candidate features. The dividing standard of node splitting in decision tree is to select the feature with the best Gini value as the root node. Completely random forest in the choice of decision tree node split is random; each leaf node until only the similar samples has stopped growing.

The classification results are obtained by the class vectors distribution of the leaf nodes of the decision tree. And then you take the average of all the trees to get an estimate of the class distribution. In order to reduce the risk of overfitting, the cross-validation method is used during training.

2.3. Overall Process of Deep Forest

Combining the multigrained scanning and the cascade forest, the overall flow of the deep forest is obtained, as shown in Figure 3. Suppose that there is original input of 400 features, three sliding windows with length of 100-, 200-, 300-dimensional are used for multigrained scanning, respectively [36]. The sliding step length is 1. As stated in Section 2.1, the feature vectors after MGS are input to a RF and a CRF for training, and then the eigenvectors of 1806-dimensional, 1206-dimensional, and 606-dimensional are obtained, respectively, and used for training the first level of the cascade forest. Taking the 100-dimensional sliding window as an example, suppose there are three classes. After the 1806-dimensional features are trained by four forest classifiers. Then, it is connected to the 1806-dimensional feature vector obtained after scanning transformation, so the first level obtains 1818-dimensional feature vectors, which is the input to the second level. Similarly, the second level obtains 1218-dimensional feature vector after concatenating, which is used as the input of the third-level training. The third level generates 618-dimensional vectors, which is used as the input of the next level. The above process is repeated until there is no significant performance gain, and the training process is stopped.

2.4. Improved Deep Forest Model

DF is an ensemble learning algorithm. If you want to build an ensemble with strong generalization ability, it should be “good but different” for individual learners. Layer-by-layer training of cascade forest can enhance the representation ability of feature information, and it is very important for DF ensemble learning to adopt different classifiers for each layer.

In this study, cascade forest based on multiple heterogeneous classifiers is proposed, and the classifier of each hierarchy is set to RF, ET, XGBoost, and LightGBM. The change is illustrated in Figure 4. The combination of multiple different types of forest classifiers could fully learn the feature information of the input feature vectors and improve the overall performance of the model.

The most important characteristics of RF are sample randomness and feature randomness. By extracting different training sets and randomly extracting features for training, the difference between classification models will be increased, which can effectively avoid the overfitting problem. At the same time, the parallel computing mechanism of the RF algorithm greatly reduces the training time of data [37].

ET is the abbreviated form of the extremely randomized trees model. Each time, all the samples are used for training, and features are randomly selected. Since splitting is random, the results obtained by ET are better than those obtained by RF to some extent [38].

XGBoost (eXtreme gradient boosting) is a supervised learning algorithm based on decision tree, which is proposed on the basis of gradient boosting decision tree (GBDT) [39]. The XGBoost generates a new tree by iteration. Compared with GBDT, XGBoost adds regularization term to the loss function to control the complexity of the model. XGBoost supports linear classification problems and draws on the practice of RF by supporting column sampling, which not only reduces overfitting but also reduces computation, which is a feature of XGBoost that is different from GBDT. XGBoost has been widely used because of its advantages of parallel processing, simple model structure, small computation amount, and high accuracy [40].

LightGBM is an improvement of GBDT, mainly to solve the problem of the decline of training efficiency when the GBDT algorithm is dealing with large amount of data. LightGBM improves the GBDT algorithm from two technical aspects as follows. (1) To solve the problem of large amount of data, the improved method is gradient-based one-side sample (GOSS). The gradient is calculated by sampling the samples, the large gradient data are retained, and the small gradient data are randomly sampled to reduce the amount of data used by the samples; (2) The Leaf-wise splitting method with depth limitation is used to replace the traditional Leaf-wise splitting method. Each time, the Leaf splitting with the largest splitting gain is found to generate a more complex decision tree, which can reduce the error and improve the accuracy of the algorithm. These two technical methods of LightGBM greatly reduce the time cost and accelerate the training process, and a large number of experimental studies have shown that LightGBM has even better performance in terms of accuracy. Therefore, XGBoost and LightGBM are selected in this paper to replace the two original classifiers in the cascade forest [41].

On the other hand, in the gcForest, the probability feature and the original input feature can be serially integrated into the input vector to effectively prevent overfitting. However, as the model depth increases, this sparse connection structure may result in a large amount of information being discarded, which may hinder the diversity of the integration. This study is inspired by the model of dense cascade forest [42]; we improve the gcForest. As shown in Figure 4, for each level of the cascade, a sublevel is added. In the first level of the original cascade, there are three sublevels called Level1A, Level1B, and Level1C. The additional sublevels are created by connecting all the features together. Taking Section 2.3 as an example, this feature will be 3618-dimensional; then the characteristics are input into the classifier to obtain the probability class vector, which is concatenated with 3618-dimensional features as Level1A level. The original 1806-dimensional, 1206-dimensional, and 606-dimensional features are concatenated in Level1B, Level1C, and Level1D, respectively. The process at the second level is similar to that at the first level, in which cascading all the features retains more information of the original sample. In the experiment, we find that this structure makes the training process more stable.

3. Empirical Analysis

3.1. Data Sources

The data in this study come from the public bearing data set of Case Western Reserve University [43] in the United States. The data were collected from a motor, a torque sensor, and a power tester. The experimental platform is shown in Figure 5. The fault of the bearing is manufactured by the electric spark technology. The motor load is from 0 to 3 horsepower, and the corresponding motor speed is 1797 to 1730 rpm. The vibration signal includes the normal data, the drive end acceleration data, the fan end acceleration data, and the base data. This paper only collects the fault data of the drive end for analysis, and the sampling frequency is set to 12 kHz. Four different fault diameters were introduced for inner raceway, outer raceway, and rolling element. The fault diameter ranges from 0.007 inches to 0.028 inches. Due to the lack of data with fault diameter of 0.028 inches in some types of data, only 0.007 inches, 0.014 inches, and 0.021 inches of data are retained. Rolling bearing can be divided into four kinds of condition: normal state (NOR), the inner race fault (IRF), ball fault (BF), and outer race fault (ORF). Figure 6 shows three types of bearing fault. Partial vibration signals of the four conditions are shown in Figure 7.

In this study, we design two experiments for fault diagnosis of rolling bearing. In Experiment 1, the collected data are divided into four categories, including normal state, inner race fault, rolling fault, and outer race fault. The data under each condition are 1460,000 data points. In Experiment 2, the fault data are further divided into three different fault diameters. There are three fault states in each fault diameter, plus the normal state. Therefore, the data are divided into 10 categories, and the sampling point of each category is 480000. Each type of data set is randomly divided into training set and test set in an 8 : 2 ratio.

3.2. Performance Evaluation and Parameters Setting

In order to evaluate the generalization ability of the model, the 8-fold cross-validation method is adopted. Accuracy and macroaverage are used to evaluate the performance of the algorithm. Accuracy is the proportion of correctly predicted samples to the total predicted samples, which represents the average degree of the classification performance of the whole model. The macroaverage is to calculate the average of the performance metrics for each class independently and includes the following three indicators. Precision is the proportion of the number of correctly classified samples to the number of positive samples judged by the classifier. Recall is the proportion of correctly classified positive samples to true positive samples. The larger the measurement value is, the stronger the prediction ability of the model would be. The experimental platform of this paper is 64-bit Window 10 system, and the processor is Inter(R) Core (TM) i7-10510U CPU @ 2.30 GHz. The whole experiment is completed with Python 3.7 software.

The parameter setting of each classification model is as follows:(1)RF: n_estimators = 10, max_depth = 10, criterion = “gini,” min_samples_split = 2, and min_samples_leaf = 10(2)ET: n_estimators = 10, min_samples_split = 2, max_depth = 10, and min_samples_leaf = 10(3)XGB: max_depth = 10, learning_rate = 0.1, and n_estimators = 10(4)LGBM: learning_rate = 0.1, n_estimators = 10, and max_depth = 10(5)SVM: C:1, kernel = “rbf,” type = “Classification,” gamma = “scale,” and tol: 1e^−3(6)LSTM: activation = softmax, loss = categorical_crossentropy, optimizer = Adam, epochs = 500, and batch_size = 30(7)DF: multigrained: windows: 300,600,9001ET + 1RF: the same as individual classifiersCascade layer: 2ET + 2RFET, RF: the same as individual classifiers(8)Proposed method: multigrained: the same as DFCascade layer: 1ET + 1RF + 1XGB + 1LGBMET, RF, XGB, and LGBM: the same as individual classifiers

3.3. Fault Diagnosis and Comparison Analysis

In order to better explain the performance of the improved model, we proposed this study and selected the base model RF, XGBoost, ET, SVM, LightGBM, and long short-term memory (LSTM) model of deep learning and standard gcForest for comparison. In Experiment 1, rolling bearing faults are divided into four categories for experiments. Table 1 shows the accuracy of these models on the training set and test set. The training accuracy and testing accuracy of the proposed method are 98.72% and 98.54%, respectively. Compared with the other 7 models, the standard deviations of gcForest with the highest and highest accuracy are 1.28% and 1.67%, and the standard deviations of RF with the lowest and highest accuracy are 36.68% and 34.03%, respectively. This fully shows that the recognition accuracy and robustness of the proposed method are greatly improved. The model based on deep forest is much more accurate than all other learning methods, which shows the effectiveness of the ensemble learning model based on tree ensemble. In this experiment, in addition to the SVM, LSTM model accuracy is higher than other accuracy of the shallow learning algorithm due to its strong ability of nonlinear said, but with the SVM, model accuracy is close; it may be that the depth study of the performance of the model parameters of dependence is strong, and parameters setting for each model are described in section 3.2

Figure 8 shows the confusion matrix of the four classification test sets of gcForest and the proposed method. The diagonal elements of the matrix represent the recall rate for each fault mode. It can be seen from Figure 8 that the proposed method in this paper can fully identify the bearing normal condition and outer race fault, and the number of misdiagnoses of inner race fault and ball fault is less than that of gcForest.

F1-score index is the harmonic average of precision and recall, and it is a good comprehensive evaluation index. Figure 9 shows the comparison results of F1 values on the test set between the proposed method and the basic models in four categories. The F1-score of the method in this paper is consistent with the gcForest under normal conditions but higher than other basic models under other three types of faults, and the F1-score of most basic models is less than 80%. The above results further confirm the performance of the proposed method.

Macroaverage is the average value of each label evaluation calculated independently. The results of the three macroaverage indicators of different methods are shown in Figure 10, which shows that our method achieves the highest value in macroaverage precision, macroaverage recall, and macroaverage F1-score. It is shown that the proposed method has the best effect on four types of fault diagnosis.

In Experiment 2, rolling bearing faults are divided into three different diameters. Together with bearings in normal state, the faults are divided into 10 categories for experiments. The experimental results compared with the basic model are shown in Table 2. Table 2 shows that the accuracy of the deep forest-based learning algorithm is higher than that of other algorithms. Compared with the other 7 models, the proposed method still achieves the highest accuracy. Compared with the gcForest with the highest accuracy in the base models, the training accuracy and testing accuracy are improved by 2.74% and 2%, respectively.

Figure 11 shows the confusion matrix comparison between the gcForest model and the proposed method under 10 classification test sets. As can be seen from the figure, the proposed method can fully identify six categories, which are NOR, IRF, and ORF with a diameter of 0.007 inches, IRF and ORF with a diameter of 0.021 inches, and IRF with a diameter of 0.014 inches; in addition to the spherical fault misjudgment rate of 0.007 inches in diameter than gcForest slightly higher, the rest of the three conditions of recognition accuracy is higher than that of gcForest, especially the accuracy of ball fault identification increases 0.021 diameter larger extent.

Figure 12 shows the comparison results of F1-score on the test set between the proposed method and the base models in the case of 10 classifications. The F1 value of the method in this paper is high in most types. The best diagnostic results are achieved for all inner race faults, normal condition, and for outer race faults of 0.007 inches and 0.021 inches diameters. For ball faults with diameter of 0.014 and 0.021 and outer race faults with diameter of 0.014 inches, the F1-score is higher than other base models, and only for ball faults with diameter of 0.007 inches, it is slightly lower than gcForest, and the F1 value of most base models is around 40%. The above results further demonstrate the superior performance of the proposed method.

Figure 13 shows the three macro average index values of different methods. According to Figure 13, our method achieves the highest value in macroaverage precision, macroaverage recall rate, and macroaverage F1-score. The macro average precision of the proposed method is 57% higher than that of the lowest ET method and 2% higher than that of the highest gcForest. Similarly, the macroaverage recall rate and macroaverage F1-score of the proposed method are 56% and 57% higher than those of the lowest method and 1% higher than those of the highest method. It is shown that the proposed method has the best diagnostic performance on 10 types of faults.

In addition, in order to further illustrate the performance of the proposed method, we designed Experiment 3. In Experiment 3, two innovations in the proposed method are analyzed separately for the two classification situations and compared with existing methods in literature [36]. The experimental results are shown in Tables 3 and 4, respectively. Table 3 shows the comparison results of training accuracy and testing accuracy in 4 types of situations, and Table 4 shows the comparison results of training accuracy and testing accuracy in 10 types of situations.

“Only improve the cascade” as mentioned in Section 2.4, a sublevel is added to each cascade layer, which is composed of the full connection of class vectors and placed at the first sublevel of each level. The learner keeps consistent with the original deep forest. “Only replace the learner” means that learners are replaced, and the cascade mode after multigrained scanning is still consistent with the original deep forest. Tables 3 and 4 show the accuracy of the proposed method is the highest. In Table 4, the accuracy is 98.05% and 96.99%, respectively. It is 4.86% and 1.25% better than the method of only improve the cascade. It is 0.76% and 0.75% better than the method of only replace the learner. The training accuracy and the testing are also higher than existing methods in the literature, which fully proves that the proposed method in our study on the classification result in the rolling bearing fault diagnosis efficiency.

4. Conclusion

This paper proposed an improved method of rolling bearing fault diagnosis based on deep forest. CWRU bearing vibration signals are used to verify the effectiveness of this method through two different groups of experiments. Connecting multiple scan granularities to add a sublevel at each level of the cascade forest reduces the loss of information flow. In addition, the adoption of more efficient tree model learners not only increases the diversity of classifiers but also helps to improve the recognition rate of fault diagnosis. The analysis results suggest that the detection rates of test faults under the two classifications are 98.54% and 96.99%, respectively, which is higher than those of other base models. The results indicate that the improved deep forest model has high recognition ability and robustness for bearing faults.

This study has a few limitations. Although the proposed method is relatively time-consuming because multiple granularity connections are added for cascading, it can obtain better fault detection performance. In order to solve this problem, feature optimization in the cascade layer is considered in the future research to detect faults efficiently and accurately.

Data Availability

The data of rolling bearing are from the website of Case Western Reverse Lab, and they are all available at http://csegroups.case.edu/bearingdatacenter/pages/download-data-file.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.


This work was partially supported by the National Natural Science Foundation of China Mathematics Tianyuan Foundation (no. 12026430) and the Scientific Project of Education Department of Jilin Province (no. JJKH20210716KJ). The authors are very grateful for the supports.