#### Abstract

Deep learning-based fault diagnosis of rolling bearings is a hot research topic, and a rapid and accurate diagnosis is important. In this paper, aiming at the vibration image samples of rolling bearing affected by strong noise, the convolutional neural network- (CNN-) and transfer learning- (TL-) based fault diagnosis method is proposed. Firstly, four kinds of vibration image generation method with different characteristics are put forward, and the corresponding pure vibration image samples are obtained according to the original data. Secondly, using CNN as the adaptive feature extraction and recognition model, the influences of main sensitive parameters of CNN on the network recognition effect are studied, such as learning rate, optimizer, and L1 regularization, and the best model is determined. In order to obtain the pretraining parameters, the training and fault classification test for different image samples are carried out, respectively. Thirdly, the Gaussian white noise with different levels is added to the original signals, and four kinds of noised vibration image samples are obtained. The previous pretrained model parameters are shared for the TL. Each kind of sample research compares the impact of thirteen data sharing schemes on the TL accuracy and efficiency, and finally, the test accuracy and time index are introduced to evaluate the model. The results show that, among the four kinds of image generation method, the classification performance of data obtained by empirical mode decomposition-pseudo-Wigner–Ville distribution (EP) is the best; when the signal to noise ratio (SNR) is 10 dB, the model test accuracy obtained by TL is 96.67% and the training time is 170.46 s.

#### 1. Introduction

The safe operation of mechanical equipment is an important guarantee for the modern industrial production. As an indispensable part of intelligent equipment, the fault diagnosis technology of rolling bearings has attracted great attention. Rolling bearing is generally composed of inner ring, outer ring, rolling body, and cage. After a long time of operation, various faults are easy to occur. Therefore, the rapid and accurate fault identification of rolling bearing is a great challenge and of great significance [1, 2].

The traditional fault diagnosis method of rolling bearings is mainly to obtain the characteristic information through the processing of vibration signals. Different analysis algorithms are selected for different types of vibration signal. Common analysis algorithms, such as wavelet transform and empirical mode decomposition (EMD), still play important roles in the fault diagnosis field and are constantly being improved [3, 4]. The multialgorithm fusion analysis can synthesize the integrated advantages and improve the analysis effect. Jiang et al. [5] used multiwavelet packet as the prefilter to refine the vibration signal combining with ensemble EMD (EEMD), which can implement the effective extraction of multifault features. Guo and Deng [6] used particle swarm optimization (PSO) algorithm to screen the optimal intrinsic modal function (IMF), and a multiobjective optimized EMD method was proposed to overcome the modal aliasing of EMD and EEMD. The authors in [7–10] adopted the comprehensive analysis method of EMD and pseudo-Wigner–Ville distribution (PWVD), which not only has the time-frequency focusing but also avoids the problem of cross-interference terms in multicomponent signal processing. The frequency band energy was divided according to time-frequency image, which was used as the characteristic index for the unsupervised clustering calculation and achieved a good classification effect [11].

With the rapid development of intelligent algorithms such as deep learning and cross-integration with various disciplines, the intelligent diagnostic methods are extensively used in mechanical fault diagnosis [2, 12]. Compared with the traditional fault diagnosis method, the intelligent diagnostic method has no strict prior knowledge requirements and can avoid the dependence on the artificial feature extraction [13, 14]. Shao et al. [15] used the dual-tree complex wavelet packet (DTCWPT) as the signal preprocessing, and a new adaptive deep belief network (DBN) with DTCWPT was proposed, which can eliminate the necessity of artificial feature selection and reliably identify different bearing faults. As a common type of network, the convolutional neural network (CNN) generally uses images as input samples. When facing the fault diagnosis problem of rotating machinery, choosing an appropriate image generation method is significant. Wang et al. [16] took the time-frequency image gained by wavelet analysis for the CNN training, which can achieve a good effect of fault identification. Xiao et al. [17] compared the effects of grayscale map samples with different sizes and different optimizers on the accuracy and robustness of training model. The performance of CNN model is often related to its own parameter setting, and Chen et al. [18] changed the extreme learning machine (ELM) as a strong classifier to improve the classification performance. In general, obtaining a group of highly qualified samples in the actual industrial environment is affected by various factors, and too few samples and the imbalance among different data will affect the training effect of the model. A framework based on the auxiliary classifier generative adversarial network (ACGAN) is proposed [19], which has the ability to generate the artificial raw data of mechanical signals. However, in the face of data affected by the strong background noise under the complex working conditions, the process of using generative adversarial network (GAN) to generate new samples is complicated. Transfer learning (TL) provides another solution for small sample cases.

The data corresponding to different working conditions are of different distributions. It is found that the classification accuracy of CNN is low when the distribution of training data set and test data set is different [20, 21]. Qiu et al. [22] performed an unsupervised and semisupervised dimensionality reduction, respectively, on the source and target domain data so that two domain data had a relatively similar distribution state, and then, TL was performed. Wang et al. [23] shortened the conditional distribution distance of vibration data acquired under different working loads for the intraclass adaptation. Fan et al. [24] used the designed image samples for pretraining and transferred the CNN model parameters to the tested samples, and after fine-tuning, the parameters a good classification effect was achieved.

In order to solve the problems of the sample effectiveness and the performance of TL under the strong noise, a method of vibration image-driven CNN- and TL-based model is proposed in this paper. In Section 2, four different types of image generation method are proposed and optimized. In Section 3, CNN is used for the adaptive feature extraction, three sensitive parameters in the model are studied, and the best model is determined. In Section 4, thirteen different TL schemes are designed for the noised samples, and finally, the recognition accuracy and efficiency for four types of image samples are evaluated.

#### 2. Research Framework

##### 2.1. Overall Scheme

The research process of this work is shown in Figure 1, and the main process is divided into three parts. Part 1 is the preprocessing of samples. Four kinds of image processing method are proposed, which are intrinsic mode functions arrangement (IMFA), empirical mode decomposition-pseudo-Wigner–Ville distribution (EP), symmetrical polar coordinates image (SPCI), and grayscale texture map (GTM) to convert the original vibration signal into images, which are used as the training samples of CNN model. Part 2 is the model training process, in which the CNN model is built and then the model parameters are shared for the noised vibration image samples and TL. Part 3 is the test and evaluation of the recognition model.

##### 2.2. Convolution Neural Network

As a feedforward neural network, the CNN model is composed of convolution layer, pooling layer, and fully connected layer, in which the convolution layer is used to convolute the image to obtain the featured map. The convolution formula is shown as follows [24]:where *k* is the size of the convolution kernel; *i* and *j* are the *i*-th row and *j*-th column of the convolution kernel, respectively; is the weight matrix of convolution kernel in layer *l*; is the offset of layer *l*; is the featured map matrix of layer *l-*1; and its two dimensional are *n, m*; and is the output featured map matrix of layer *l*.

After the convolution operation, the parameters are input into the activation function, which increases the nonlinearity of the model, so that the model can be applied to the complex classification problems. The activation function formula is shown as follows:where is the activation function and is the output of the *l-*layer featured map matrix. The common activation functions are Sigmoid, Tanh, and Relu.

The pooling layer is to reduce the dimension of the input parameters, and the commonly used pooling operations are maximum pooling and average pooling. The fully connected layer in the convolution network maps the distributed features to the sample space. In the whole network structure, the fully connected layer generally acts as a classifier, for example, the Softmax function, and the calculated result is the output of the whole CNN model, as shown in the following equations:where is the eigenvalue of the fully connected layer, is the featured matrix before the fully connected layer, is the probability prediction of the sample, *L* is the total number of sample categories, and *u* is the *u*-th sample category in order to distinguish the *i* in the numerator.

In the process of model training, the error between the predicted model output and the actual target is calculated by the loss function, and the commonly used loss function is the cross-entropy function, as follows [24]:where assuming that are the true target value and the Softmax output value, respectively, and *M* is the total number of samples.

The *C* value in equation (5) indicates how close the predicted value is to the target value. The smaller the value is, the closer the prediction is to the target, and on the contrary, the prediction deviates from the target. The *C* value obtained by each forward propagation is input into the optimizer, i.e., by reducing the *C* value to achieve the training effect, the commonly used optimizer has the gradient descent, momentum optimization, etc.

##### 2.3. Transfer Learning Method

The TL is an effective method to solve small sample problem, which takes the model developed for task A as the starting point and reuses it in the process of developing the model for task B. In a practical research, TL is mainly divided into three kinds: the case-based, feature-based, and shared parameter-based transfer [25, 26]. With the development of deep learning, the combination of feature learning and TL is more and more applied. In order to solve the problem of low training accuracy and poor classification effect for low quality data affected by background noise, in this paper, a method of TL combined with CNN is used to classify rolling bearing faults. The original data collected are used as the pure training samples for pretraining, and then, the model parameters will be restored and shared to train the model after pretraining. A part of the convolution layer is frozen so that the parameters cannot be updated in the process of backpropagation. Only the designated training layer is retained to participate in the training process of noised samples to complete the parameter updating.

#### 3. Research Basis

##### 3.1. CNN Model Construction

This paper takes the LeNet-5 network as the basic model. The model has a shallow network layer and requires few training parameters, so it has fast training speed and good classification effect. For the low complexity of vibration image samples in this paper, it is appropriate to choose this network. The network structure is shown in Figure 2. The total number of layers is 7, including 3 convolution layers, 2 pooling layers, and 2 fully connected layers. Under the condition that the image sample is not affected by its color, the use of single-channel image can not only reduce the input parameters and the memory usage but also improve the operation efficiency. Therefore, the sample used in this paper is a single-channel image with a size of , and the output is 4 nodes, which represent 4 bearing state categories. In the training process, the hyperparameters in the model are fine-tuned.

The initial CNN model parameters are shown in Table 1.

##### 3.2. Vibration Image Sample Preparation

The original vibration signals do not have the advantages of images, such as great differences between samples and more intuitive observation. Therefore, this paper carries out four image generation methods for vibration signals and finds which of 4 types of image has better classification performance. The samples used in this paper are the vibration data of rolling bearing from the Case Western Reserve University [27], and the test platform is shown in Figure 3.

The selected data are those with 1797 r/min, fault degree of 0.007 inch, sampling frequency of 12 kHz, and 4 kinds of working state including normal state, inner ring fault, outer ring fault, and rolling body fault. According to the rotational speed and sampling frequency, it is calculated that 400 sampling points are obtained for each rotation, so this paper converts 400 sampling points as a group of data for image conversion.

###### 3.2.1. IMFA Image Generation Method

*(1). EMD*. EMD [28] can decompose the signal into the sum of a series of IMFs with different time scales, and each IMF is a single component signal.

For any signal, all the extreme points are firstly determined, and then, the upper envelope and the lower envelope are obtained by third-order spline interpolation. If is the average value of the upper and lower envelope, is the difference between and , then

Treat as a new , and repeat the above operations until the meets certain criteria; let and is an IMF, then

Treat as a new , repeat the above process, and get the second IMF, and the third IMF, ..., *n*-th IMF. When or satisfies the given stopping criteria, the screening process is terminated and the solution is obtained as follows:where is a residual function, which represents the average trend of the signal.

*(2). Image generation*. The vibration signal is intercepted according to 400 sampling points in each segment, and the signal is decomposed by EMD in turn, in which the first six IMFs contain the main components of the signal, so the first six groups of IMF are extracted and arranged in the form of six curves in order, as shown in Figure 4.

**(a)**

**(b)**

**(c)**

**(d)**

In the process of image generation, in order to show the fluctuation state of each IMF, the method of adaptively adjusting the vertical coordinate is adopted; i.e., the amplitude *A*_{i} with the largest absolute value in the *i-*th signal segment is obtained, and then, the vertical coordinate range is set as −*A*_{i} ∼ *A*_{i}. The advantage of this process is that it can show the complete shape of the IMF component and make the visual contrast between the IMF_{1} ∼ IMF_{6} and strengthen the sample feature difference.

###### 3.2.2. EP Time-Frequency Analysis Method

*(1). EP principle*. EP uses the EMD to decompose the multicomponent signal into finite IMFs. For each IMF, the pseudo-Wigner–Ville distribution (PWVD) is carried out, in which the PWVD is shown as follows: where is the time-domain variable, is the frequency-domain variable, and is the window function.

Then, the Wigner time-frequency distribution of the original signal is obtained by superposing the PWVD results of different IMFs. The combination of EMD and PWVD not only effectively eliminates the cross interference but also retains the excellent time-frequency focusing [10].

*(2). Image generation*. The steps for generating the time-frequency distribution map are as follows: (a)Vibration signal segmentation: processing 400 sampling points at a time.(b)EMD decomposition: the above signals are decomposed by EMD, and the IMF components and residual components from high frequency to low frequency are obtained in turn. Only the first six IMFs are processed in the next step.(c)EP time-frequency analysis: the first six IMFs are analyzed by PWVD, and then, the EP time-frequency distribution is obtained.(d)Grayscale processing: in order to reduce the input parameters and improve the training efficiency, the grayscale processing of the generated samples is done.

According to the above steps and Figure 5, we can clearly see that, for the bearing data with different fault states, the energy distribution is different.

**(a)**

**(b)**

**(c)**

**(d)**

###### 3.2.3. SPCI Method

The above two methods obtain the images from vibration signal processing, but the SPCI does not belong to this scope, and it represents the form of a mirror symmetrical image in polar coordinates and directly converts the sampled signal into an image. The graphic display is intuitive and has a strong ability to express features.

*(1) SPCI principle*. The schematic diagram of SPCI is shown in Figure 6; is the polar radius, and and are the rotation angles along the initial line in the counterclockwise direction and clockwise direction, respectively. The principle is that, in the discrete sampling data series, the vibration parameters at *i* time are , and the vibration parameters at time are ; substituting and into equations (10)–(12), it can transform vibration data into a point in the polar coordinate space. By changing the rotation angle of the initial line, a mirror symmetrical image can be formed [29]:

In the above formula, is the maximum value of vibration data, is the minimum value, is the time interval, is the initial line angle, and is the angle magnification factor. It is often taken as , , and .

*(2) Image generation*. In this paper, the image with high resolution is obtained by adjusting the parameters *a* and *b*. After trying, we take *a* = 2 and *b* = 30°, and the SPCI samples generated are shown in Figure 7. It can be found that the SPCI samples of different data are different.

**(a)**

**(b)**

**(c)**

**(d)**

###### 3.2.4. Improved GTM Method

In addition to the above three vibration images, the GTM is introduced in this paper. Before obtaining a GTM, it is necessary to convert the vibration signal into a grayscale image. The grayscale image is a data matrix, in which the subscript of each element corresponds to its position in the image, i.e., row and column coordinates, and the element value represents the luminance value of the corresponding position. The generation process of grayscale image is actually a process of data mapping. The maximum value “max” in the feature matrix is mapped to the gray level 255. The minimum value “min” is mapped to the gray level 0, as shown in Figure 8 [30]. The relationship between the feature matrix and grayscale value in the image is shown as follows:

In the above equation, is the vibration signal data, in which *i*, *j* is the size of feature matrix, and is the gray value corresponding to .

The data arrangement in traditional grayscale image is “sequential arrangement of vibration data,” but this arrangement uses a large number of data points every time. If the image sample resolution is sacrificed and a small number of sample points are used to convert the grayscale image each time, it cannot highlight the characteristics of the original signal. Therefore, this paper uses vibration data to convert the grayscale image according to the horizontal-vertical cross arrangement; i.e., the *i*-th vibration signal segment with 400 points is longitudinally copied 400 rows, and the matrix A with is obtained. The matrix *B* is obtained by transposing the matrix *A* into another matrix. The matrix *C* is obtained by averaging the *A* and *B*, and the new grayscale image is obtained by inputting the elements of *C* into equation (13). The grayscale image processed in this way has two advantages:(1)Each grayscale image with needs only 400 signal points, and the data of samples are updated frequently. However, if the traditional grayscale image is used, 160000 signal points are needed to form a grayscale image with the same size.(2)The horizontal-vertical cross arrangement can form a grid structure in the grayscale image, which can enhance the image texture feature.

After the above conversion, the grayscale image is extracted by the local binary pattern (LBP) [31], and the algorithm is shown aswhere represents a neighborhood center element, its pixel value is , represents other pixel values in the neighborhood, *p* represents the number of pixels in the neighborhood center, and .

LBP is used to show the relationship between the pixel value of a certain point in the grayscale image and its surrounding pixel value. The image processed by LBP shows the texture information. Figure 9 is an example from grayscale image to GTM.

**(a)**

**(b)**

**(c)**

**(d)**

The four types of image generation method used in this paper focus on showing the different features between vibration data. In this paper, four kinds of image sample will be obtained, in which each kind of sample contains four kinds of bearing state, and each kind has 1200 samples. The samples is divided into the training set, validation set, and test set according to 6 : 2 : 2, in which the test set does not participate in the training process and only evaluates the final model.

#### 4. Research on the Optimal CNN Model

The task of deep learning is divided into two stages: training and test, in which the training is the process of optimizing the model, such as improving the classification accuracy and strengthening the generalization ability. After many experiments, it is found that the model performance is highly sensitive to the initial learning rate, training optimizer, and regularization parameters, so this paper mainly focuses on these items. As the research methods for four types of image sample are similar, the following only shows the research process of EP, and the other gives their results in the end.

##### 4.1. Determination of Pretraining Model Parameters

###### 4.1.1. Selection of Initial Learning Rate

Learning rate is an important parameter of CNN training. For the training with fixed learning rate, if the learning rate value is small, the high training accuracy can be obtained, but the convergence speed will be affected. If the learning rate value is big, there are the opposite results. To avoid a fixed learning rate, this paper proposes a degenerative learning rate, i.e., to find an equilibrium between training speed and accuracy, and the learning rate decreases with the increase in the number of training steps. The formula is shown as follows:where dr is the decay index, lr is the initial learning rate, gs is the current number of iterative rounds, ds is the measure of decay in each iteration, which can be called decay speed, and DLR is the learning rate after decay.

In order to obtain the best training effect, this paper takes the learning rate as 0.001, 0.0001, and 0.00001, respectively. The accuracy and loss curves in the training process are shown in Figure 10. The models trained under the above three parameter settings are tested, and the test set includes 60 bearing samples in each of the 4 states. The test results are shown in Table 2.

**(a)**

**(b)**

In Figure 10(a), when the initial learning rate is 0.0001, the model can quickly converge and maintain a stable state, while the initial learning rate is 0.00001, it cannot smoothly converge and the training accuracy is low, and the early convergence rate at 0.001 is slower than that at 0.0001. In Figure 10(b), the loss value change at 0.00001 is unstable, while the losses at the other two cases remain stable and low, and 0.0001 is the best choice. In Table 2, when the learning rate is 0.001, 0.0001, and 0.00001, the test results are 98.75%, 98.33%, and 95.42%, respectively. After comparison, 0.0001 is selected as the initial learning rate. When the learning rate is 0.0001, the confusion matrix of the test results is shown in Table 3.

###### 4.1.2. Optimizer Determination

In deep learning, there are many optimization methods to find the optimal solution of the model. In this paper, three kinds of optimization algorithms are used to select the most suitable optimizer.

*(1). Gradient descent algorithm*. It is often used to approximate the minimum deviation model, where the gradient descent direction is to use the negative gradient direction as the search direction, and the minimum value is solved along the gradient descent direction. The most frequently used gradient descent algorithm is stochastic gradient descent (SGD) [32], and SGD formula is shown as follows:where is the cost function; is the model parameter at *t* time; are the samples of each input and output, respectively; is the correlation gradient of the cost function at *t* time; represents a randomly selected gradient direction; and is the learning rate at *t* time.

*(2). Momentum optimization algorithm*. Because the SGD optimizer frequently updates the variables, it will cause serious shock to the loss function, easy to fall into the local extremum, and easy to be trapped in the saddle point. Therefore, based on the gradient descent method, the momentum optimization is proposed. The commonly used momentum optimizer is momentum [33], and its formulas are as follows: where represents the acceleration accumulated at *t* time and indicates the magnitude of the power; generally, , which means the maximum speed is 10 times that of SGD.

Momentum adds the inertia in the gradient descent process, which makes the speed with the same gradient direction faster and the renewal speed of the dimension with the change in gradient direction slower so that it can speed up the convergence and reduce the oscillation.

*(3). Adaptive learning rate optimization algorithm*. The loss in deep learning is usually highly sensitive to some directions of the parameter space. The momentum algorithm can alleviate these problems to some extent, but at the cost of introducing another hyperparameter. And the traditional optimization algorithm needs to set the learning rate to a constant or adjust the learning rate according to the number of training, which greatly ignores the possibility of other changes in the learning rate. Adaptive moment (Adam) estimation is an adaptive optimization algorithm of learning rate [34], and its formulas are as follows:where , are the first-order and second-order moment estimation of the gradient, respectively; , are the corrections to , ; and is the learning rate subject to dynamic constraints.

In this paper, under the condition that the initial learning rate is 0.0001 and the other parameters are unchanged, the model optimizer is set as SGD, momentum, and Adam, respectively. The accuracy and loss comparison curves in the training process are shown in Figure 11. The model test results after training are shown in Table 4.

**(a)**

**(b)**

Compared with the three curves in Figure 11(a), when the Adam is used, the training accuracy slightly fluctuates below 100% with the increase in the number of iterative steps. The curve of training accuracy is stable after 100 steps with momentum. However, SGD curve constantly fluctuates within 1000 steps. There is no significant difference in three loss curves in Figure 11(b). In Table 4, when the optimizer is SGD, Adam, and momentum, the test results of the model are 98.33%, 98.33%, and 99.17%, respectively. After comparison, the momentum optimizer should be selected for this kind of samples. The confusion matrix of the test results with momentum is shown in Table 5.

###### 4.1.3. Regularization Parameter Determination

There are two kinds of abnormal fitting in the training process: overfitting and underfitting. The overfitting means that the model established is too superior in the training samples, resulting in poor performance in the validation and test data sets, while underfitting generally means that the features extracted from the training samples are relatively few, resulting in the training model cannot match well, and the performance is very poor.

In order to solve the overfitting, the called regularization is introduced into the training process. The main purpose of regularization is to control the complexity of the model and reduce overfitting. The basic regularization method is to add a penalty term to the original loss function to “punish” the model with high complexity. In this paper, several commonly used regularization methods, such as *L*1, *L*2, regularization, and dropout, are studied in the training process, and it is found that *L*1 regularization performs the best, so *L*1 regularization is adopted. The sum of squares of weight parameters is added on the training loss function, as follows:where *C*_{1} is the final loss value, *C*_{0} is the real loss value, *W* is the network learning parameter, and *λ* is the adjustable regularization parameter. In this paper, the effects are compared when *λ* is 0.1, 0.01, 0.001, and 0.0001 and has no regularization, and the contrast curves of accuracy and the final loss *C*_{1} in the validation process are given, as shown in Figure 12. The model test results after training are shown in Table 6.

**(a)**

**(b)**

In Figure 12(a), all the validation curves under five parameters slightly fluctuate around 98% after 50 to 200 steps, in which *λ* = 0.1, 0.01, and 0.0001 are more efficient. Figure 12(b) shows the loss value in the validation process, in which the curve of *λ* = 0.1 has a downward trend, but its validation loss values keep higher within 1000 steps. The value of *λ* = 0.1 curve stabilizes at about 9 after 20 iterations. In Table 6, the test results of the models under the five regularization settings are 98.75%, 98.75%, 98.75%, 99.17%, and 99.17%, respectively. After comparison, *λ* = 0.0001 is selected. The confusion matrix of the test results with *λ* = 0.0001 is shown in Table 7. The training, validation accuracy, and loss curves are given, as shown in Figure 13.

**(a)**

**(b)**

##### 4.2. Optimal CNN Model Parameters for Four Image Generation Methods

Under the condition that other parameters remain unchanged, the other three types of image samples were trained and validated for the above three sensitive parameters, and the final parameters are shown in Table 8.

Under the best model parameters obtained above, four types of sample were tested, respectively, and the number of tested samples in each type is 240. This paper uses the accuracy, precision, recall, and *F*1-score of the test results as the evaluation indexes. The average values of the corresponding indexes for 4 types of samples are given in Table 9. Among them, the accuracy is the proportion of all predictions that are correct, and the precision is positive in all predictions; the recall rate is the proportion that the correct prediction is positive, while the F1-score considers both the accuracy rate and the recall rate to achieve the highest and balance at the same time.

From the test results, it can be found that the model trained by SPCI samples has a good classification effect, and the accuracy reaches 99.19%, which is the best among four kinds of image samples. The classification effect by EP samples is also good, with the test accuracy of 98.75%. GTM and IMFA samples have the test results of 96.67% and 93.75%, respectively.

#### 5. Parameter-Based TL

In the actual industrial field, the collected vibration signals will be disturbed by background noise, and the signals polluted by noise will cover up the effective information in the original signals, so it is a significant work to identify the fault categories quickly and accurately for the noised signals. The samples used in the pretraining have a certain similarity with the noised samples, which increases the probability of successful parameter transfer. By transferring the model parameters obtained from the pretraining, the slow training process can be avoided and the model efficiency can be improved. In this paper, by adding Gaussian white noise (GWN) to the bearing data of Case Western Reserve University to simulate the actual field signal, the noised signal is converted into image. The designated layers of the training model using noised samples are frozen, and the pretraining parameters by the pure samples are shared. The TL flowchart is shown in Figure 14.

In order to simulate the influence of different degrees of noise on the signal, the GWN with slight, moderate, and severe SNR is added, respectively. Through a test, it is found that when the SNR is 22 dB, the noised signal can drown the pure signal slightly, so the GWN of 22 dB, 16 dB, and 10 dB is added. The time-domain comparison between the pure and noised signals is shown in Figure 15, in which the blue waveform is the pure signal and the red waveform is the noised signal. The conversions of four types of image were performed for the noised signals of 10 dB, as shown in Figure 16.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

**(c)**

**(d)**

Firstly, the experiment for 10 db was carried out on the noised samples obtained by the EP method. The layers capable of being frozen are the convolution layer 1 (*C*1), convolution layer 2 (*C*2), convolution layer 2 (*C*3), and fully connected layer 1 (*F*1). The trainable parameters in these four layers are 156, 1516, 48120, and 10164, respectively. In order to study which layer parameters play an important role in the model training process, 13 model freezing schemes are carried out in this paper, and the nonfrozen layer (type A) is used as the contrast group, as shown in Table 10. Figure 17 summarizes the final test accuracy and the time required for each training process.

Among the above *A* ∼ *N* test results shown in Figure 17, the type *C* (freeze *C*2) has the highest test accuracy, reaching 99.17%, and the time-consuming is 351.61 s. The model with the least time consumption is type *H* (freeze *C*1-*C*2-*C*3), whose test accuracy is 92.92% and takes 167.25 s. In the industrial field, the ideal TL scheme should have a certain accuracy and be able to complete the training process in a short time, so this paper takes the top 7 with high test accuracy among the above 14 types and then selects the type that takes the shortest time, i.e., the ideal transfer type of this paper. Type *F* (*C*1-*C*2) can achieve the desired results, its test accuracy is 96.67%, and it takes 170.46 s; its training, validation, accuracy, and loss curves are shown in Figure 18.

**(a)**

**(b)**

The noised samples with a SNR of 16 dB and 22 db were tested in the same way, and the training and test results were sorted out for the best scheme. The results for the selected freezing type in each case are shown in Table 11. Relative reduction rate in Table 11 represents the reduction rate of the current test value relative to the last test value.

With the noise intensity increase in the original signal, more noise information is introduced into the converted image samples, and the difference between the noised samples and the original pure samples is greater. Because of the influence of noise on the characteristics of the original signal, it weakens the feature differences between different bearing samples, which makes it difficult for CNN to obtain the distinguishable features between different samples.

In the experiment, under the slight noise of 22 db, the four types of image samples have accurate classification results, and their average accuracy is 96.42%. When the noise intensity is increased to 16 dB, the test effect of EP samples is not significantly reduced, and the accuracy is still maintained at 97.92%, while IMFA, SPCI, and GTM are significantly reduced, whose accuracy is reduced by 7.5%, 13.8%, and 16.36%, respectively. Under the strong noise of 10 dB, the relative increase in SPCI samples is 4.01% compared to 16 dB case, and the decrease rates of EP, IMFA, and GTM are 1.28%, 13.37%, and 22.29%, respectively, compared to 16 dB case. It can be seen that IMFA and GTM are more sensitive to noise intensity. However, EP and SPCI can maintain the test accuracy of 96.67% and 86.67% under the strong noise, indicating that they have the good tolerance ability to noise.

It is found that, under the strong noise of 10 db, the training time of four kinds of sample is 170.46 s, 165.36 s, 194.19 s, and 175.27 s, respectively, and EP has the highest test accuracy of 96.67%. Others are 78.33%, 86.67%, and 59.58%. From the results, it can be concluded that the long feature extraction time can be avoided and the model training speed can be accelerated by pretraining the model parameters, but if only the parameter updating of the last fully connected layer is retained, the model learning ability will be weakened. Therefore, different schemes should be tried in the process of parameter transfer to obtain the best transfer learning efficiency, i.e., to improve the speed of network training on the premise of guaranteeing accuracy.

#### 6. Conclusions

(1)In this paper, four vibration image generation methods are discussed, and in order to distinguish the features among different image samples and optimize the resolution, the adaptive IMFA and gridding GTM are proposed, which provide new approaches for vibration image sample preparation.(2)In order to give full play to the learning efficiency of CNN model, the best model parameters are obtained by adjusting sensitive parameters including learning rate, optimizer, and regularization, and the trained model has accurate classification result when the samples obtained by EP and SPCI are used.(3)Aiming at the samples with different GWN, the effect of 13 model freezing schemes on TL is studied. Under the strong noise, the model still has good classification effect. Through the specific TL schemes, the training time-consuming of the model is reduced; meanwhile, the test accuracy can be kept at a high level.#### Data Availability

The data used to support the results of this study are available from the corresponding author upon request or can be downloaded at Case Western Reserve University website “http://csgroups.Case.edu/bearing/data/center/home.”

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This research was funded by the National Natural Science Foundation of China (Grant nos. 51605380, 51875451, and 51834006) and Natural Science Basic Research Program of Shaanxi (Program no. 2021JM-391).