#### Abstract

Bearing fault diagnosis plays a vitally important role in practical industrial scenarios. Deep learning-based fault diagnosis methods are usually performed on the hypothesis that the training set and test set obey the same probability distribution, which is hard to satisfy under the actual working conditions. This paper proposes a novel multilayer domain adaptation (MLDA) method, which can diagnose the compound fault and single fault of multiple sizes simultaneously. A special designed residual network for the fault diagnosis task is pretrained to extract domain-invariant features. The multikernel maximum mean discrepancy (MK-MMD) and pseudo-label learning are adopted in multiple layers to take both marginal distributions and conditional distributions into consideration. A total of 12 transfer tasks in the fault diagnosis problem are conducted to verify the performance of MLDA. Through the comparisons of different signal processing methods, different parameter settings, and different models, it is proved that the proposed MLDA model can effectively extract domain-invariant features and achieve satisfying results.

#### 1. Introduction

Rolling element bearings are of great importance for mechanical equipment. They usually need to run for a long time under harsh conditions, which will inevitably cause faults. The failure of the bearings will cause a lot of economic loss and even safety problems [1]. Therefore, the study of reliable and accurate bearing fault diagnosis methods is of great importance, which can monitor and diagnose the health conditions of the bearings, so as to guarantee the normal working condition of the mechanical equipment and reduce the risk of failure. With the development of intelligent manufacturing, higher requirements have been put forward for fault diagnosis in the industrial process.

Vibration signals are high-precision indicators that can provide information for detecting the status of mechanical equipment [2]. Most traditional fault diagnosis methods based on signal processing are used to extract fault information from raw vibration signals, such as empirical mode decomposition (EMD) [3], wavelet packet transform (WPT) [4], and other time-frequency domain signal processing methods [5]. Yu et al. [6] adopted the EMD method to calculate the original statistical characteristics of the vibration signals through the intrinsic mode functions and combined with a modified method which can reduce the feature dimension to comply bearing fault diagnosis. Liu [7] decomposed a vibration signal into sub-band signals via one-level stationary wavelet packet transform (one-level SWPT), which improved the ability to extract fault features. Signal processing methods require analysts to have certain expert knowledge to extract fault features accurately. However, the equipment status can be very complicated in the actual operation process; such methods cannot achieve sufficient accuracy. Therefore, the researchers introduced machine learning (ML) methods to make up for this deficiency. Chen et al. [8] combined rough set theory (RS) with the support vector machine (SVM) to propose a multisensor data fusion fault diagnosis method, which reduced the computing cost of the SVM but improved the effectiveness and accuracy. Yu et al. [9] applied WPT to extract fault features of the planetary gearbox; the features are discretized and regarded as the input of the flexible naive Bayesian classifier (FNBC). Fenineche et al. [10] studied the influence of the parameter selection in the artificial neural network (ANN) to obtain the best performance of fault diagnosis. However, the performance of ML models is often limited by manual feature extraction. When the fault signals are complex, it is difficult to achieve the expected diagnostic accuracy.

With the development of deep learning (DL) [11], it can automatically extract features from nonlinear bearing signals, and the tedious signal preprocessing can be greatly reduced. Zhang et al. [12] applied the sparse autoencoder (SAE) to propose a new label generation method, which can identify samples that do not belong to known categories. Dong et al. [13] introduced the convolutional neural network (CNN) into a deep belief network (DBN) to propose a random convolutional deep belief network for the mechanical fault. By adding unsupervised components, the generalization ability of the model was improved. The novel hierarchical learning rate adaptive CNN presented by Guo et al. [14] was designed for diagnosing bearing faults and determining severity. In order to overcome the shortcomings of traditional vibration signal processing methods, Jiao et al. [15] proposed a method based on multivariate encoder information to diagnose the fault intelligently. The method presented in [16] can effectively diagnose the compound fault by combining the features automatically extracted by the model with the time-domain features designed manually. However, DL models have to satisfy the assumption that the source domain and target domain (for example, training set and test set) should obey the same distribution and feature space. Actually, in many actual industrial scenarios, the difference of distribution between source-domain samples and target-domain samples varies considerably, which degrades the diagnostic performance [17].

To tackle this challenge, the application of transfer learning (TL) to bearing fault diagnosis is an emerging research aspect in recent years. Its purpose is to fully reuse the knowledge learned from the source domain to another different but related target domain [18, 19]. Peng et al. [20] added the idea of residual learning to the model, which can effectively learn high-level and abstract features. Wen et al. [21] adopted a three-layer SAE as the feature extractor and calculated the maximum mean discrepancy (MMD) to minimize the difference between domains. The representation clustering algorithm proposed by Li et al. [22] can maximize the distance metric of interclass variations and minimize the distance metric of intraclass variations at the same time. Zhang et al. [23] improved the domain adaptation ability of the model through implementing the adaptive batch normalization method. The method presented in [24] matched the marginal distributions of the output of every convolution layer, improving the cross-domain testing performance. For the distribution discrepancy between the source domain and target domain, some studies consider that the transferability of high-level features drops significantly [25], while others believe that low-level features may be more responsible for domain shift [26]. Moreover, most existing TL methods focus on marginal distributions and ignore conditional distributions of different domains, while they both have different effects on domain adaptation.

In this paper, a novel multilayer domain adaptation (MLDA) method is proposed for TL-based intelligent bearing fault diagnosis. By calculating multikernel MMD (MK-MMD) and considering conditional distributions in multiple layers, the model can extract effective domain-invariant features, which can clearly contribute to transfer tasks. The main contributions of this work can be summarized as the following: (1) a MLDA method is raised to diagnose the unlabeled bearing fault signals from the target domain through the shared domain features extracted from the source domain. (2) The method can diagnose the compound fault and single fault of multiple sizes at the same time. (3) A special designed residual network based on the ResNet [27] framework is adopted as the feature extractor in the fault diagnosis task to extract features automatically without complex time-frequency domain analysis. (4) In order to minimize the distribution discrepancy between the source and the target domain, MK-MMD and pseudo-label learning are adopted in multiple layers, considering both marginal distributions and conditional distributions.

The remaining parts of the paper are organized as follows. In Section 2, the domain adaptation problem is demonstrated, and MK-MMD is introduced. The proposed MLDA method for bearing fault diagnosis is raised in Section 3. The comparisons of different signal processing methods, different parameter settings, and different methods are discussed in Section 4. Finally, the conclusions are drawn in Section 5.

#### 2. Theoretical Background

##### 2.1. Problem Formulation

Since bearings are affected by many factors during operation, such as load and running time, the distributions of samples in the source domain are different from those of samples in the target domain. The emergence of TL provides a new idea for solving this problem. There are two important concepts in TL named domain and task. The detailed description of TL is given as follows [28–30].

The domain, abbreviated by *D*, contains the data space *X* and its marginal distribution *P* (*X*), which can be described as *D* = {*X*, *P* (*X*)}.

The task, abbreviated by *T*, contains the label space *Y* and its predictive function *f* (·), which can be described as *T* = {*Y*, *f* (·)}. *f* (·) can also be described as a conditional distribution *P* (*Y|X*) from the perspective of probability.

The labeled sample space of *D*_{s} can be written as *X*_{s} = {} = 1 with a relevant task *T*_{s}, and the unlabeled sample space of *D*_{t} can be written as *X*_{t} = with a relevant task *T*_{t}, where *n*_{s} and *n*_{t}, respectively, denote the numbers of samples of their specific domain.

TL aims to make full use of the knowledge learned from source domain *D*_{s} and source task *T*_{s} to find a target predictive function *f* (·) in target domain *D*_{t}, where *D*_{s} ≠ *D*_{t} or *T*_{s} ≠ *T*_{t}. The condition *D*_{s} ≠ *D*_{t} indicates that *P*_{s} (*X*) ≠ *P*_{t} (*X*) or (and) *X*_{s}≠*X*_{t}, and the condition *T*_{s} ≠ *T*_{t} indicates that *P*_{s} (*Y|X*) ≠ *P*_{t} (*Y|X*) or (and) *Y*_{s} ≠ *Y*_{t}.

Domain adaptation (DA) can be regarded as a specific setting in TL, as shown in Figure 1, which solves the problem of *X*_{s} ≠ *X*_{t}, but *T*_{s} *=* *T*_{t}.

##### 2.2. Multikernel Maximum Mean Discrepancy

DA is a challenge problem when there are no (or limited) labeled data in the target domain. To address this problem, many existing methods focus on minimizing the difference between two domains by adopting a nonparametric distance measure called MMD, which can measure the discrepancy of marginal distributions. As stated in [31], compared with a single kernel, MK-MMD can greatly improve the efficiency of domain adaptation.

*H*_{k} denotes the reproducing kernel Hilbert space (RKHS) with a characteristic kernel *k*. The MK-MMD between distributions *U* and *V* is defined as the RKHS distance between the mean embeddings of *U* and *V*. The squared formulation of the MK-MMD is given as

The most important property is that only when *U* = *V*. The calculation formula of the multikernel is given bywhere *G* is the number of kernels and is the Gaussian kernel with bandwidth . Gretton et al. [32] theoretically studied that the kernel used in the mean embedding of *U* and *V* is essential to reduce the test error. The multikernel can enhance MMD test through different kernels, thus providing a method for optimal kernel selection.

#### 3. The Proposed Method

##### 3.1. A Special Designed Residual Network

The ResNet has been proved to have strong feature extraction capability. Considering the size of the bearing fault dataset, ResNet-18 is selected as the feature extractor of MLDA. The detailed information of ResNet-18 is shown in Table 1. Convolutional layer, batch normalization, rectified linear unit, and fully connected layer are abbreviated as Conv, BN, ReLu, and FC, respectively.

ResNet-18 contains 4 blocks, and the internal structure of the block is illustrated in Figure 2.

The block can be represented aswhere is the input of the *l*th block and *F* is the residual function. Equation (4) represents the identity mapping, and *h* is the ReLu activation function. Based on equations (3) and (4), the deep features from low-level *l* to high-level *L* can be obtained:

The original ResNet-18 has achieved great success in the field of image recognition. In this paper, ResNet-18 is adopted as a feature extractor. Combined with the characteristics of the bearing signals, some modifications need to be made: (1) in order to match the input dimension of the bearing signals, the kernel size of Conv1 is changed to 3 × 3. (2) To retain as much fault information as possible, the Max pool layer is removed. (3) Since ResNet-18 is only adopted as a feature extractor and its classification function is not required, the FC layer and softmax layer are removed. The modified ResNet-18 can effectively extract domain-invariant features.

##### 3.2. Network Architecture

In order to diagnose bearing faults under variable working conditions, the architecture of MLDA is shown in Figure 3.

The data from *D*_{s} and *D*_{t} are used as the input of ResNet-18 pretrained by labeled data from the source domain. For extracting domain-invariant features effectively, the marginal distributions are minimized through calculating MK-MMD in multiple layers. The MK-MMD loss can be obtained bywhere *N*^{l} denotes the number of blocks for calculating MK-MMD, *K* is the number of Gaussian kernels, *U*^{l} and *V*^{l} represent the distribution of *D*_{s} and *D*_{t} extracted from the *l*th block, and (*U*^{l}*, V*^{l}) is the MK-MMD calculated by equation (1) with kernel *k*.

In two domains with different working conditions, the classification categories are the same. Since the label of *D*_{s} is available, the classification loss can be minimized, and cross-entropy is used as the optimization objective:where *M* denotes the number of samples, *y* represents the true label, and represents the label output by the classifier.

MK-MMD can bound the marginal distributions of the extracted features from *D*_{s} and *D*_{t}. However, unlabeled data from the target domain cannot be directly used in the training process because supervised information is not available. Pseudo-label learning [33] can be one of the solutions to this problem. The pseudo-label of the specific sample is determined by selecting the label with the maximum probability of prediction, which can be summarized into two steps: the predicted probability of labels and the conversion to the pseudo-label [34]. In MLDA, each block is followed by a matching classifier (FC layer). The predicted probability of labels given by the classifier and the softmax layer can be calculated aswhere *y*_{i} is the *i*th sample, *C* is the number of categories, and *W* is the weight of the corresponding category. The conversion of pseudo-labels can be expressed aswhere denotes the pseudo-label of the *i*th sample. The correctness of pseudo-labels will be improved during the training process so that the conditional distributions could be more similar. The pseudo-label loss of each block can be calculated by cross-entropy:

The total pseudo-label loss can be obtained by

The loss of the overall model can be expressed aswhere *λ*_{1} and *λ*_{2} are tradeoff parameters.

##### 3.3. Diagnosis Procedure

The flowchart of MLDA is developed in Figure 4.

First, the original vibration signals under different loads are collected from the bearing test platform. The frequency-domain signals are constructed via fast Fourier transform (FFT) and reshaped into 2-dimention [35, 36]. Then, the data are divided into supervised source domain and unsupervised target domain and further separated into training set and test set. Furthermore, in order to accelerate the training process, ResNet-18 is pretrained by source-domain data.

Secondly, based on the particular problem of fault diagnosis and the input dataset information, the diagnostic model built is ready for the training process. The data from the training set are fed into the pretrained ResNet-18 which can extract domain-invariant features. Both marginal distributions and conditional distributions are minimized in multiple layers. On the final layer of the network, the FC layer is adopted to identify the bearing faults with the extracted domain-shared features. The optimization objective of the model (equation (14)) is minimized through the Adam method. When the training process is over, the loss function of the overall method converges in general.

Finally, after training, the test set from the target domain is input into the model to evaluate the model capability and output the fault diagnosis results.

#### 4. Experimental Analysis

##### 4.1. Dataset Description

The bearing fault dataset used to evaluate the effectiveness of the raised MLDA method was collected from the bearing test platform as shown in Figure 5. The drive motor, healthy bearing, and test bearing are fixed on the same motor shaft from left to right. The data were collected by an NI PXle-1082 data acquisition system. The adjustable loading system is settled in the radial direction of the motor shaft. An SGSF-20K dynamometer is installed in the bolt-nut system to measure the load. The sampling frequency of the PCB 352C33 accelerometer is 10 kHz, and the motor speed is 896 rpm. During the bearing operation cycle, the accelerometer continuously collects bearing signal data.

The 14 health conditions are gathered under four working conditions with different loads (0 kN, 1 kN, 2 kN, and 3 kN). There are ten health conditions for single faults, namely, normal bearing (NO), inner race fault (IF), outer race fault (OF), and ball fault (BF). Each fault condition covers three fault diameters. Furthermore, four kinds of compound faults are processed in a width of 0.2 mm: inner race and ball fault (IB), inner race and outer race (IO), outer race and ball fault (OB), and inner race, outer race, and ball fault (IOB). For the sake of clarity, all 14 fault patterns are summarized in Table 2. Each sample contains 2048 data points.

When the bearing rotates at a certain constant speed, different fault patterns will generate different vibration signals. The vibration signals under 0 kN load are shown in Figure 6. The vibration signal of the healthy bearing is relatively stable. For single faults, the periodicity of IF and OF can be seen obviously. However, the vibration signal of the BF has no obvious periodicity and amplitude, which is difficult to identify. Compared with the single fault, the amplitude of the compound fault increases significantly and changes greatly. The complexity of compound fault signals makes them difficult to extract features and brings challenges to academic research [37].

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

##### 4.2. Comparison of Different Signal Processing Methods

In order to extract features from the fault-related bearing signals, traditional methods employ many tricks to process the signal. In this part, we carried out time, frequency, and time-frequency analysis to decide the best signal preprocessing method for the MLDA model. Three experiments are carried out: (1) input time-domain signals to the MLDA model, (2) input time-frequency domain signals by empirical mode decomposition (EMD), and (3) input frequency-domain signals by FFT.

In the experiment, the learning rate is set as 0.0001, and *λ*_{1} and *λ*_{2} are set as 1 and 0.01, respectively. The MK-MMD adopts a mixture of 5 Gaussian kernels with bandwidths of 4, 8, 16, 32, and 64. A total of 12 transfer tasks are conducted. For example, transfer task 0-1 indicates that 0 kN is the source domain and 1 kN is the target domain, hereinafter the same.

The results are shown in Table 3. It is found FFT + MLDA can achieve the best results, which demonstrates the power of the deep neural network to model the fault-related nonlinear vibration signals. Hence, our method does not need complicated techniques on signal preprocessing. Simply transforming the signal from the time domain to the frequency domain by FFT is sufficient for the MLDA model.

##### 4.3. Comparison of Different Parameter Settings

Different parameter settings will bring different effects on the experimental results. For verifying the effect of multilayer domain adaptation, two groups of experiments are designed to apply MK-MMD and pseudo-label learning only on block 1 and block 4, respectively, represented as MLDA-1 and MLDA-4. Furthermore, in order to prove the effect of MK-MMD, only one Gaussian kernel with a bandwidth of 4 is set as the third group of comparisons, represented as MLDA-SK. The experimental results are shown in Table 4.

As shown in Table 4, MLDA achieves the best accuracy in almost all tasks, with an average of 99.14%. It can be seen from MLDA-1 and MLDA-4 that both low-level features and high-level features will cause domain shift to a certain extent. Matching discrepancy of high-level features can obtain better accuracy, which indicates that the transferability of high-level features is better than that of low-level features. Moreover, when the source domain is relatively different from the target domain, such as the transfer between 0 kN and 3 kN, the advantages of applying domain adaptation in multiple layers are obvious. The comparison of MLDA-SK shows that a single kernel of MMD has a limited capability to narrow the marginal distributions. MLDA makes a great improvement in all 12 tasks, which proves the effect of mixed kernels. The radar diagram in Figure 7 shows the diagnostic results of different parameter settings. It can be seen intuitively that MLDA achieves the best results.

Figure 8 shows the classification results of the proposed method for the 0-3 transfer task. The fault characteristics of NO, IF, and OF are relatively obvious. All samples are diagnosed correctly. Misdiagnosis mostly occurs in BF04, IB, and IOB. The diagnostic results of BF04 are mainly the fault diameter recognition error, which is classified as BF03. IB is misclassified as BF04 and IOB. IOB is the category with the most misdiagnoses, and all the misclassified samples are identified as BF04. BF is easily misclassified because its fault characteristics are not obvious. Since the compound fault is a mixture of multiple fault types, the feature extraction is difficult with mixture features, especially the mixture of three fault types.

##### 4.4. Comparison of Different Models

In order to further demonstrate the effect of MLDA, traditional TL methods are investigated, which include transfer component analysis (TCA) [38], joint distribution adaptation (JDA) [39], correlation alignment (CORAL) [40], and the pretraining model (ResNet-18). The comparison results are illustrated in Table 5.

From the comparison results of different methods, the conclusions can be drawn with three points: (1) the best performance is achieved by MLDA distinctly among 5 methods. Without domain adaptation, the target-domain data cannot be diagnosed effectively by the pretraining model (ResNet-18). (2) The traditional TL methods can achieve good results when the discrepancy is relatively small, such as the transfer between 0 kN and 1 kN. The conclusion can be explained by the fact that the transferability of extracted features and the degree of domain shift are affected by the degree of working conditions. (3) The unsatisfied transfer effect will occur by the traditional methods when the variety of working conditions is dramatic, which decreases the accuracy of fault diagnosis. Notably, the raised MLDA method maintains the satisfied accuracy and the generalizability.

Figure 9 illustrates diagnostic results using different models. Clearly, the best performance is achieved in various transfer tasks by the raised MLDA method, which proves its superiority.

Although MLDA achieves satisfactory results, it is still confusing whether MLDA can extract domain-invariant features. T-SNE method [41] which can reduce the dimension is introduced to visualize the features extracted by each method. The results are shown in Figure 10, which shows transfer task 3-0. 20 feature points are sampled randomly in each category.

As a pretraining model, ResNet-18 has strong feature extraction ability. However, the discrepancy between two domains can hardly be narrowed in the ResNet-18 method. The other three traditional TL methods can narrow the distribution shift between different domain features to a certain extent, but the capability is limited. The proposed MLDA method learns the feature mapping from source and target domains to the shared feature space, decreasing domain shift and effectively using the knowledge learned from the shared feature extractor to diagnose the target domain through unsupervised learning. It can clearly extract domain-invariant features with high generalizability.

#### 5. Conclusions

In summary, this paper develops a MLDA method for bearing fault diagnosis, which can diagnose compound faults and single faults of multiple sizes simultaneously. First, modified ResNet-18 is pretrained as a feature extractor. The MK-MMD is calculated for the extracted features in multiple layers to narrow the marginal distributions. Second, the features extracted from each block are input into the matching classifier. The predicted probability is calculated through the softmax layer and converted into the pseudo-label to narrow the conditional distributions. Third, the Adam optimization method is adopted to optimize the overall model parameters and speed up the convergence of the model. Through the comparisons of different signal processing methods, different parameter settings, and different methods, the raised MLDA method classifies the fault patterns precisely and achieves better transfer performance. The proposed method is meaningful to prognostics health management (PHM) and can provide reliable fault diagnosis results for practical industrial scenarios.

#### Data Availability

The data can be obtained from the Institute of Industrial Measurement, Control and Equipment Diagnostics, School of Rail Transportation, Soochow University.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was financially supported by the National Natural Science Foundation of China (no. 51875375) and the Suzhou Science Foundation (no. SYG201802).