Abstract
A convolutional neural network has the characteristics of sharing information between layers, which can realize high-dimensional data processing. In general, the convolutional neural network uses a feedback mechanism to realize parameter self-regulation, which solves the disadvantages of manual parameter adjustment. However, it is unable to determine the iteration number with the best calculation accuracy. Calculation efficiency cannot be guaranteed while achieving the best accuracy. In this paper, a multilayer extreme learning convolutional neural network model is proposed for feature recognition and classification. Firstly, two-dimensional spatial characteristics of planetary bearing status data were enhanced. Then, extreme learning machine is embedded in a convolution layer to solve convex optimization problems. Finally, the parameters obtained from the training model were nested into a network to initialize the model parameters to separate each status feature. Planetary bearing experimental cases show the effectiveness and superiority of the proposed model in the recognition and classification of weak signals.
1. Introduction
With the improvement of automation level in a modern production system, rotating machinery presents the development direction of high speed, high efficiency, and maximum economic benefit. However, a continuous production process makes the equipment run under heavy load for a long time, which will easily lead to accelerated fatigue of transmission parts. Furthermore, tight connections between devices make the health status of an individual component affect the efficiency and quality of the entire system. Once the transmission parts fail, it will lead to a series of chain reactions and even make the whole equipment or even the whole production line stop working. Therefore, reliable monitoring of transmission parts is crucial to maintain the whole safety production process.
In recent years, a planetary gearbox with dual rotor bearings has become the main transmission component due to its series of advantages with compact structure, large transmission ratio, light weight, and strong bearing capacity. It is widely used in automotive, wind power, aerospace, and other fields. Because the equipment works in complex environment for a long time, it is easy to cause accelerated fatigue of transmission parts [1]. For example, wind energy as green energy has promoted wind power generation to become one of the fastest growing branches of the current power generation field. Typically, a wind turbine consists of a planetary gear train (I level transmission) and two fixed-shaft gear trains (II level and III level transmission), as shown in Figure 1. Planetary gear trains are usually mounted at the low-speed end to withstand greater torque. In addition, wind turbines are usually located in a relatively wide-open area or offshore areas and often affected by irregular variable speed winds and the external ambient temperature that change with the season. Due to the complex working environment, the key components (gears and bearing) of the planetary gearbox are easily damaged. For example, the G52-850 wind turbine, consisting of Gamesa and Echesa speed-increasing gearboxes and INDER generators, showed abnormalities after 5-year work. Through endoscopic and unpacking tests, it was found that the fault was caused by planetary bearings [2]. It is the key to maintain the whole safety production process to adopt a reliable monitoring method to monitor the equipment condition. Therefore, the working efficiency and safety of a wind power generation system can be greatly improved when the planetary bearing is operated in stable status.

From the evolution process of bearing failure (Figure 2), it can be seen that the initial stage of failure accounts for a larger proportion of the entire damage. As the fault continues to deteriorate, the degradation rate increases exponentially. In the early stage of failure, the abnormal symptoms are slight, the impact on the mechanical system is small, and the maintenance cost is relatively low. If fault goes undiagnosed or unnoticed at an early stage, it will lead to a catastrophic accident when early fault develops and accumulates to a certain extent. Therefore, early detection, early diagnosis, and early maintenance are essential to ensure the safe operation of high-precision equipment. In addition, under the demand of intelligent devices, the amount of data that needs to be analyzed is also large. Traditional fault diagnosis based on point-by-point single-signal analysis is difficult to detect the characteristic components related to fault quickly and accurately, which seriously hinders the development process of high precision, high speed, and high reliability of high-end equipment, whereas intelligent fault diagnosis based on “data-driven” can solve this problem with high precision, high speed, and high reliability [3].

In 2006, the birth of the deep learning algorithm [4, 5] marks the development of fault diagnosis towards rapidity, efficiency, and intelligence. High target feature resolution of a data set will get accurate fault diagnosis results, and complete data volume can improve model learning ability. These existing data-driven neural network models have achieved good results in some ideal environments. However, there are still some factors restricting their application in the field of fault diagnosis. However, due to the randomness and uniqueness of faults, the field of intelligent diagnosis faces some bottlenecks. Rotating machinery is in a healthy status for a long time, and most of the collected signals are in a healthy status. Due to the high cost of collecting measured fault samples, it is difficult to obtain all types of fault samples, which makes the sample set unbalanced. Besides, in case of early fault or large external interference, the fault characteristic information is weak or even submerged. The model may give interference information a high confidence output. Aiming at the incomplete characteristics of the fault data set, Gao et al. and Liu et al. are committed to using finite element method simulation to simulate a sample with different fault statuses [6, 7]. An et al. [8] proposed a self-learning transferable neural network for fault intelligence diagnosis with unlabeled and imbalanced data. Most of the weak fault intelligent diagnosis methods [9, 10] use traditional fault feature extraction as the preprocessing to extract sensitive information, and there is a lack of research on improving the robustness of the model itself.
A convolutional neural network (CNN) [11, 12] is one of the representative models for intelligent recognition and classification of weak fault signals of bearings. It has attracted the attention of many researchers and been widely used in many fields such as bearing fault diagnosis. Fu et al. [13] used 1D convolution kernels of different scales to extract multiscale features and performed dimensional assimilation on feature space of different scales based on fusion theory to adapt to convolution operation. Zhao et al. [14] converted one-dimensional time-domain signal into 2D grayscale images, which were used as the analysis sample data of the CNN model. This solved the problem of insufficient data and avoided the process of artificial feature extraction. Cyclic spectral coherence was adopted as preprocessing to extract information that best characterized the status of bearing [15]. Then, group normalization calculation was introduced to balance the distribution difference of data. Ye et al. [16] proposed a new method called deep morphological convolutional network, which consists of two parallel branches: noise filtering and feature selection algorithm. Noise filtering can update structure elements based on backpropagation. A feature selection algorithm was based on kurtosis weight fusion. Besides, values of various hyperparameters directly affect the training speed and accuracy of the CNN model. Currently, an error backpropagation (EBP) mechanism was often used to modify the model parameters. In the process of parameter adjustment, the initialization values of some parameters may also affect the classification results of the model. In addition, the range of adjustable parameters involved in the algorithm directly affects the computational complexity. Therefore, the CNN model is not suitable for rapid online monitoring, especially for early diagnosis of weak faults. An effective model is urgently needed to improve the performance of online monitoring.
In order to build a model framework with superior performance, the extreme learning machine (ELM) principle is adopted to deal with the convex optimization problem of a convolution layer. ELM was firstly proposed by Huang et al. [17] for a feedforward single-layer neural network. Subsequently, ELM was gradually introduced into the multilayer model structure [18, 19]. Compared with other models such as Deep Belief Network (DBN) [20, 21] and Stacked Autoencoder (SAE) [22], the ELM model involves fewer parameters and has higher computational efficiency and less complexity. Therefore, ELM has been favored by researchers in many fields, such as image processing [23], objective optimization [24, 25], dimensionality reduction [26], and fault diagnosis [27–29].
ELM was combined with other models to improve the training efficiency and recognition accuracy. For example, ELM was combined with an autoencoder to mine deep features of training data and proved to be superior to ELM, SAE, and CNN [30]. Online sequential ELM was proposed to classify and recognize the low-dimensional features extracted from the SAE model, whose effectiveness of this method has been proven for tool wear status recognition [31]. ELM was used as an enhanced classifier to improve the recognition accuracy of an integrating CNN model. Its superiority in training speed and accuracy was verified by comparing with other 6 models [32].
In general, ELM plays a role of an efficient classifier in the hybrid model. The existing model framework based on multilayer perceptron has shortcomings in improving the training speed. The goal of this paper presented here is to find a training model mechanism to improve the training accuracy and speed. Based on this, a fast and effective embedded hybrid model structure, called multilayer extreme learning convolutional feature neural network model (M_ELMConvNet), was proposed. The main contributions here are twofold. (1)The wavelet cyclic spectrum feature extraction method [33] was used to convert the time-domain signal into a two-dimensional image. Then, the obtained image is partitioned, which is more suitable for CNN analysis(2)A new model training mechanism of embedding ELM into a convolutional layer was proposed to improve the calculation speed and classification accuracy. The final classification and recognition results are obtained by multilayer stacking structure. The computational speed and accuracy of the proposed algorithm are verified by comparing with the results of other models
The remainder of this paper is structured as follows: the relevant theoretical research background contents, such as CNN and ELM models, are shown in Section 2. The proposed model framework and implementation process are introduced in Section 3. The proposed method is applied to the experimental data in Section 4. Finally, conclusions and the next step are described in Section 5.
2. Theoretical Background
2.1. CNN
CNN is a self-learning model that can automatically extract the internal feature information of the input data and implement classification tasks. Different from traditional neural networks, CNN generally contains a convolution layer and a subsampling layer (also called pooling layer). CNN learns hidden features by continuously running the convolutional layer in a loop and performing pooling operations. The convolution layer is used to convolute the original input data with multiple local filters to generate locally invariant feature information, which is used as the input of the pooling layer to extract representative features. The procedure is shown in Figure 3.

Suppose the size of the input layer is , and the number of input layers is which is the number of channels. Define a convolution kernel as , whose size is . The entire operation process of the convolution layer is to continuously perform convolution operations on the input layer data and the convolution kernel. The features extracted by the convolutional layer are served as the input of the pooling layer to further reduce the dimension of the feature matrix by calculating the local average or maximum. Subsequently, the fully connection layer tiles the output matrix of the pooling layer. The main task of each layer before the fully connected layer is feature extraction. The classification task starts at the fully connection layer. Generally, there are multiple full connection layers in the whole network structure. Because the single-layer structure can only solve the linear classification problem and most of the problems in real life are nonlinear problems, the softmax layer is also connected behind the fully connection layer to further predict the label. By calculating the probability of each sample’s category, the label category with the largest probability value is assigned to the sample data.
2.2. ELM
ELM uses randomness and Moore-Penrose generalized inverse theory to calculate parameters, which avoids the use of EBP and greatly improves the training speed. The input layer data is usually nonlinear and separable. The core idea of the ELM algorithm is to map the original data into a high-dimensional feature space by adding hidden layer nodes on the premise that the input data is linearly separable. The connection weights between the input layer and the hidden layer are randomly generated. The connection matrix between the hidden layer and the output layer is calculated by Moore-Penrose generalized inverse. The entire training process only needs to adjust the number of hidden layer nodes. The schematic diagram is shown in Figure 4.

(a) ELM network structure diagram and its parameterized model

(b) Spatial model diagram of ELM. The nonlinear characteristics of the input data are fitted by multiple mappings
3. Proposed Architecture and Method
3.1. Formulated General Model Framework
Suppose there is a set of bearing status data to be classified. The data set are , and the target matrix is , where and are the length and the number of channels of the input data, respectively. is the number of samples in the data set. is the number of target category to which the input data belongs. The relationship between the sample set and the label matrix can be achieved by the classification function, as shown in the following formula: where is the classification function of .
3.2. Data Graphical Processing and Enhancement
In order to improve the accuracy of model recognition, the two-dimensional CNN model was used for data processing. The original sample data is one-dimensional, which reflects the time-domain waveform information and often overwhelms some fault active components. In this paper, one-dimensional time-domain data are converted into 2D images based on wavelet cyclic spectrum theory [33] and the periodic characteristics of nonstationary bearing information are extracted. The local information in cycle spectrum of each fault type is similar. In order to make up for the shortage of faulty data samples and improve the quality of sample data, the converted image is further processed by block localization. Suppose that is the 2D cyclic spectrum sample matrix, and are the width and height of each single image sample matrix, respectively, and the label matrix is . For the sample , we regard it as being composed of several submodules . , and corresponding labels . where is the tensor product and and are the classification functions for and , respectively.
3.3. Parameter Transfer
3.3.1. Convolution Layer Detector Based on ELM
The ELM algorithm is integrated into the model to further improve the classification characteristics of the algorithm. As mentioned above, ELM is superior in complexity and accuracy compared with other algorithms [34]. The weight matrix of the input layer is randomized; that is, the weight matrix of the input layer and the input data are independent of each other. We can set it arbitrarily according to some distribution theories. The output matrix of the hidden layer and the weight matrix between the hidden layer and output layer need to be calculated relying on label data.
The specific implementation process is as follows:
Assuming there are test samples , and corresponding labels are . is the number of hidden layer nodes. Hidden layer output can be expressed as
where is the connection matrix between the input layer and hidden layer and is the offset vector. Both of them are randomly generated. The output layer matrix can be expressed as
indicate the categories of the output target label. is the layer connection weight () between the hidden layer and output layer. is unknown and must be calculated based on the label data in the process of prediction classification results of the ELM algorithm. The expression is
In the training stage, the number of hidden layer nodes is an uncertain factor, which has an impact on the prediction performance. In order to evaluate the performance of the parameters calculated by the model, the error minimization loss function was used as an evaluation index of prediction ability, shown as
The parameters corresponding to the minimum value are the optimal values that best characterize the target features. Finding the minimum number of hidden layers while ensuring the highest accuracy is another factor to improve operation efficiency.
3.3.2. Pooling Layer
The pooling layer is actually a downsampling layer, which is mainly used to extract local features and prevent overfitting. The procedure is as follows: first, define the size and step of the local pool module. Then, the local feature extraction method is determined. The most common method is to calculate the average or maximum value of each module. In this paper, the pooling layer is followed by the random parametric dimension reduction layer, as shown in equation (4). The pooling process adopts the method of calculating the average value as shown in
where is the size of each step.
3.3.3. Normalization
After the pooling layer and before the ELM classification layer, a min-max standardization process was added to prevent the occurrence of gradient disappearance. The normalization result is obtained by
where and are operators for calculating the maximum and minimum elements in the matrix , respectively. and are the range of the interval for normalizing the matrix.
3.4. Test Model
Intelligent diagnosis methods based on deep learning theory mostly rely on a large amount of training data to achieve classification and recognition. The premise is that the performance of connection weights must be evaluated and modified on the basis of data with accurate label. In this paper, a method based on parameter transfer theory is proposed. The existing labeled data is input into the supervised model for training, and the connection weights of each module are obtained. Afterward, interlayer connection matrices are input into the test model to recognize and classify the unlabeled data.
3.5. Fault Diagnosis Method Based on M_ELMConvNet
This paper presents a fast feature learning method based on two-dimensional CNN and ELM, and the model frame is shown in Figure 5.

Figure 6 shows the flowchart of the present method, and the procedures are as follows:

Firstly, samples were collected and processed with 2D data transformation and enhancement. (1)The acceleration signals of bearings in four different working conditions were collected. About 85% of the data were labeled as training samples. The rest of the data is unmarked and is considered the test sample(2)The original data is transformed into a 2D image by wavelet cyclic spectrum analysis(3)Subsequently, the obtained image is partitioned to enhance the data according to equation (3)
Secondly, an efficient and accuracy neural network classification model is constructed. (1)The entire model framework consists of two parts. One is the training process for labeled data, and the other is the predictive classification process for unlabeled data(2)The labeled data is input into the supervised training model. The first step of this model is to reduce the dimension of the data by randomizing the connection matrix. The specific implementation process was based on the ELM training principle, as shown in equation (4). Then, the dimensionless data is entered into the pooling layer and standardized. Finally, the data is input to ELM for supervisory testing. The error rate between the test data and the label data is used as the loss function to adjust the random parameterized dimensionality reduction and the number of ELM hidden nodes. Subsequently, the optimal node number was assigned to the model to predict the connection matrix between each layer(3)In the prediction process of the test sample set, based on the idea of parameter transfer, the connection matrix and layer connection weight obtained by the supervised test process were input into the test model of the corresponding prediction sample as the preset model parameter values, and the final prediction results were obtained
Thirdly, the whole training process is applied to the recognition and classification of bearing status data.
4. Experimental Validation
The data for verifying the effectiveness of the proposed algorithm were obtained from the comprehensive test bench for power transmission fault diagnosis. The sample data analyzed were measured under different conditions at different times.
4.1. Experiment Setup and Data Description
The power transmission system of the testbed consists of a planetary gearbox, a parallel shaft gearbox supported by rolling bearings or sleeve bearings, bearing load, and programmable magnetic brake, as shown in Figure 7. The testbed includes all the necessary powertrain configurations for studying gearbox dynamics and noise characteristics, health monitoring techniques based on vibration signal analysis, lubrication conditions, and wear particle analysis. The testbed has stable performance and can withstand strong load impact. There is enough space for the replacement and installation of gears and the installation of a monitoring device. Planetary gear systems, sun gears, planetary gears and gear rings, brackets, and bearings are easy to be disassembled.

The vibration signals under four statuses: no damage, first-stage planetary bearing outer ring failure, first-stage planetary bearing inner ring failure, and first-stage planetary bearing ball failure, were collected for analysis. In the experiment, the relevant parameters are set as follows: sampling frequency is 15360 Hz and motor speed is 2100 r/min. Multiple acceleration signals under different working conditions were collected. The data collected under each health condition were separated into 470 equal parts. Randomly select 400 pieces of data in these four statuses as the training data set and 70 pieces as the test data set, respectively. A detailed partition of the sample set for status data analysis is shown in Table 1. The time waveforms for 2.5 s and the spectrum with bandwidth of the vibration signal for four statuses are shown in Figure 8. It can be seen from Figure 8 that the time-domain waveforms and spectrums for different four-status signals are inevitably affected by external interference information, which is also one of the factors that reduce the ability of model recognition.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)
4.2. Result Analysis
The purpose of this paper is to propose a fast and effective intelligent classification method for weak fault data of planetary bearings. In order to further verify the effectiveness of the proposed M_ELMConvNet neural network model, the experimental data were analyzed by the algorithm described in Section 3.5. For comparison, three other models including ELM, BP-based CNN, and Deep Autoencoder (DAE) were also applied to status identification of experimental data.
The average recognition accuracy of the algorithm under different hidden layer nodes is calculated by executing the model for 20 times. The results are shown in the box diagram in Figure 9. As shown in Figure 9, the prediction accuracy rate of most testing results was above 98%. The validity of M_ELMConvNet in planetary bearing status recognition is further verified. With the increase in the number of hidden layer nodes, the recognition accuracy rate fluctuates slightly. When the number of hidden layer nodes is set to 290, the average prediction accuracy is relatively high and the stability is strong. Meantime, the model training time also showed an exponential growth trend, as shown in Figure 10. In the subsequent analysis, the number of hidden layer nodes was set to 290 based on the balance training time and prediction accuracy.


Figure 11 is the confusion matrix of multistatus classification and recognition accuracy based on the proposed method. As can be seen from Figure 11, the highest prediction accuracy is 100% for status 4. The minimum recognition accuracy is status 3 because there is no obvious distinction between status 2 and status 3. In general, the proposed method in this paper can achieve high predictive recognition accuracy in each status.

4.3. Comparative Verification
The diagnosis performances of the original data and multiparameters DAE and CNN with the original data and wavelet cycle spectrum were also compared with the M_ELMConvNet proposed in this paper. Each model was executed multiple times, and the average result was calculated. Figure 12 shows that the M_ELMConvNet achieved the highest average prediction accuracy of 99.24%. Thus, the proposed algorithm in this paper has strong noise suppression capability in the identification and classification of weak fault statuses.

In order to further verify the superiority of the proposed algorithm in computing time, the CNN model based on EBP was used as a comparison model to analyze the same sample data set with M_ELMConvNet. The results are shown in Figures 10 and 13. As can be seen from the figure, the calculation time of the algorithm increases linearly with the increase in the number of hidden units and the number of iterations. At the same time, the recognition accuracy of the algorithm gradually improves and tends to be stable. Considering the influence of the number of hidden units and the number of iterations on the recognition accuracy, the calculation time of 410 hidden units and 28 iterations was compared. One was 0.43 s, and the other was about 2000 s. The time difference is several orders of magnitude, which proved the superiority of the M_ELMConvNet algorithm in computational efficiency.

5. Conclusions and Further Works
In this paper, a new deep feature extraction and diagnosis method was proposed to improve the recognition accuracy and reduce computational complexity for weak failure signal of planetary bearing with large data volume. In M_ELMConvNet, ELM was embedded in the CNN model instead of convolution operation to avoid a repeated EBP operation process. After two processes of ELM feature dimensionality reduction and extraction, the amount of calculation was reduced and the prediction accuracy was improved. In addition, based on the parameter transfer theory, the model parameters extracted from the labeled training sample data are introduced into the unlabeled sample data training model to achieve prediction. The effectiveness and superiority are proven on experiment setup testing data. Moreover, analysis results show that the proposed model has advantages in recognition accuracy and operation speed compared with other methods.
The present work is mainly carried out in the case of sufficient sample data. However, in the actual operation of planetary bearings, the sample size of analysis data is unbalanced; that is, the trouble-free sample size is large, while the failure sample size is small. For failure data, such as in the early stage or in the case of large external noise, manual marking often leads to missed diagnosis or misdiagnosis. How to realize the self-supervised learning of unlabeled data and make it able to automatically extract data features and perform labeling is the work to be done in the future.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.