#### Abstract

In real industrial scenarios, the working conditions of bearings are variable, and it is therefore difficult for data-driven diagnosis methods based on conventional machine-learning techniques to guarantee the desirable performance of diagnosis models, as the models assume that the distributions of both the training and testing data are the same. To enhance the performance of the fault diagnosis of bearings under different working conditions, a novel diagnosis framework inspired by feature extraction, transfer learning (TL), and feature dimensionality reduction is proposed in this work, and dual-tree complex wavelet packet transform (DTCWPT) is used for signal processing. Additionally, transferable sensitive feature selection by ReliefF and the sum of mean deviation (TSFSR) is proposed to reduce the redundant information of the original feature set, to select sensitive features for fault diagnosis, and to reduce the difference between the marginal distributions of the training and testing feature sets. Furthermore, a modified feature reduction method, the local maximum margin criterion (LMMC), is proposed to acquire low-dimensional mapping for high-dimensional feature spaces. Finally, bearing vibration signals collected from two test rigs are analyzed to demonstrate the adaptability, effectiveness, and practicability of the proposed diagnosis framework. The experimental results show that the proposed method can achieve high diagnosis accuracy and has significant potential benefits in industrial applications.

#### 1. Introduction

Rolling element bearings (REBs) are one of the most common machine elements of rotating machinery equipment in modern industry and smart manufacturing [1, 2], and the health state of REBs can seriously affect the safe and stable operation of rotary mechanical equipment [2]. REBs often operate in harsh working environments, and their failure probability is therefore higher than that of other components [3, 4]. Thus, REB fault diagnosis is of great significance for the guarantee of equipment safety and the reduction of maintenance costs [4]. In the past decade, because the vibration signals usually carry rich information about the machine operating conditions, the vibration signals collected from REBs have been commonly used as the analytical signals in many intelligent machine fault diagnosis systems [5]. In recent years, with the rapid development of signal processing, data mining, and artificial intelligence technology, data-driven fault diagnosis has become a popular research topic [5]. Data-driven fault diagnosis consists of four steps, namely, signal collection and processing, feature extraction, feature reduction, and pattern recognition [5–8], among which feature extraction is the crucial step for the extraction of more useful information of the original vibration signals for fault pattern recognition. However, most existing data-driven intelligent diagnosis methods have two main limitations that hinder their applicability in real industrial scenarios [5, 6, 9]: (1) most existing feature extraction and fault classification models assume that the training and testing data have the same distributions. Due to the harsh working environment and different working requirements in an industry, the working conditions are not consistent; this can therefore lead to differences between the distributions of the training and testing data [6]. (2) The variability of working conditions and the diversity of the types of failure of rotating machines often lead to insufficient labeled target fault data. Therefore, diagnostic models based on conventional machine-learning techniques that are learned with training data do not guarantee the preferred diagnosis performance of the use of testing data collected from industrial scenarios. To overcome these two limitations, it is necessary to use an improved fault diagnosis framework in which the training data are the labeled data under one working condition, and the resulting model can be applied to the unlabeled data under other working conditions.

Signal processing is the first step in data-driven REB fault diagnosis methods and has been carried out in many previous investigations by numerous scholars. Because vibration signals collected from REBs generally have nonlinear and nonstationary features, a time-frequency analysis can be effective for feature extraction [10]. Some commonly used and representative conventional time-frequency domain analysis approaches include empirical mode decomposition (EMD), short-time Fourier transform (STFT), the Wigner–Ville distribution (WVD), and wavelet transform (WT) [11]. In addition, parameterized time-frequency transform (PTFT) methods [12, 13] have been proposed to achieve a more accurate extraction of the instantaneous rotation frequency (IRF) from strong nonstationary vibration signals. In the work by Wang and Xiang [14], spline-kernelled chirplet transform (SCT), one of the PTFT methods, was employed to calculate the time-frequency distribution and extract the instantaneous rotation frequency for REB fault diagnosis under varying speed conditions. In the work by Wang et al. [15], polynomial chirplet transform (PCT), another PTFT method, was employed to estimate the IRF of REBs from the vibration signals for fault diagnosis.

EMD is a common and effective time-frequency approach in the fault diagnosis of rotating machinery and can automatically decompose nonstationary and nonlinear signals into multiple modal compositions [16, 17]. In some previous studies [18–21], EMD was employed to process the original signals and extract features for REB fault diagnosis. However, there are some limitations of EMD, such as overenveloping, end effects, and mode mixing [17, 22]. STFT is also an effective time-frequency analysis approach that can be used to divide the entire time domain into numerous segments of the same length, and each time period is an approximately stationary process [23, 24]. Some researchers [25–28] have used STFT for fault diagnosis, but its effectiveness is still hampered by the limitation of its single triangular basis [17, 29]. The WVD is also a widely used nonlinear time-frequency distribution for signal processing due to its excellent resolution and localization in the time-frequency domain [30]; however, the presence of cross-terms when they are applied to multicomponent signals can result in misleading interpretations [31, 32]. WT, including continuous wavelet transform (CWT) and discrete wavelet transform (DWT), is an outstanding and powerful method in rotary machine diagnosis because its multiresolution capability is suitable for the analysis of nonlinear and nonstationary signals [33]. However, CWT can generate redundant data, has a huge operand, and is very time consuming [34, 35]. DWT can overcome these drawbacks of CWT, but its limitations of shift variance and frequency aliasing may lead to the loss of useful information [36]. To address these shortcomings of DWT, dual-tree complex wavelet transform (DTCWT) was proposed by Tang et al. [37, 38] and further investigated by Selesnick et al. [39] in the dyadic case. DTCWT possesses some advantageous properties [36, 37, 40], including its (1) near shift-invariance and reduced aliasing, (2) good directional selectivity, which can overcome the lack of the directional selection of DWT, (3) limited redundancy and efficient order, and (4) ability to acquire amplitude information and achieve perfect reconstruction. These properties are all beneficial for feature extraction in the task of mechanical fault diagnosis [31]. Dual-tree complex wavelet packet transform (DTCWPT) is an extension of DTCWT and can overcome the foremost limitation of DTCWT, namely, that it cannot realize multiresolution analysis in the high-frequency band. In previous research [36, 37, 40, 41], DTCWPT has been employed to process signals and extract features for REB fault diagnosis. In this paper, DTCWPT is introduced to process the original vibration signals collected from REBs.

For the construction of a feature set for fault pattern recognition, the statistical properties of the signals in the time, frequency, and time-frequency domains can be extracted to represent feature information [9, 42–44], such as the peak value (PV), root mean square (RMS), variance (*V*), skewness (Sw), kurtosis (*K*), energy, and energy entropy. In [42], after the vibration signals were processed by wavelet analysis, single-branch reconstruction signals and the corresponding HHT envelope spectrum (HES) were used to generate 192 statistical features using 6 statistical parameters for bearing fault diagnosis. In [43], vibration signals collected from REBs were decomposed into several different IMFs by EMD. The first four IMFs were selected to obtain the HHT marginal spectrum and HES, which were then used to calculate the original statistical properties. In [9], 29 statistical parameters were selected to extract 29 statistical features, which formed a high-dimensional original feature dataset for REB fault diagnosis. In [44], more than 30 feature indicators of vibration signals were calculated for axle bearings under different conditions, and the features that could more effectively and representatively reflect the fault features were selected for fault detection. In the research [45], the RMS and *K* were used to calculate fault features for wind turbine bearing fault diagnosis. It is often difficult to determine which statistical property can best reflect the nature of a fault from the feature space because of the complex mapping relations between some bearing faults and their signals [42, 43]. Thus, when unsuitable statistical features are chosen for fault pattern recognition, it may lead to a decline in the accuracy and efficiency of fault diagnosis. According to some previous studies [11, 42, 43], the selection of a feature subset that is formed by fault-sensitive features is a crucial step for the achievement of the expected diagnostic accuracy.

As discussed previously, the two foremost limitations of most existing data-driven intelligent diagnosis methods [5, 6] are that (1) they assume the uniform distributions of the training and testing data are the same and (2) the variability of working conditions and the diversity of failures often lead to insufficient labeled target fault data. Recently, these two problems have garnered considerable attention and have been further investigated by some researchers. An et al. [5] proposed a novel three-layer model inspired by a recurrent neural network (RNN) and TL for REB fault diagnosis under different working conditions. Ma et al. [9], aiming at overcoming the first limitation, proposed a transfer diagnosis framework based on domain adaptation for bearing fault diagnosis across diverse domains. In the work by Gao et al. [46], the finite element method (FEM) was employed to simulate samples with different faults to overcome the missing sufficient and complete fault samples. In a study by Liu et al. [47], aiming at the problem of the faulty samples of real-world running mechanical systems being difficult to obtain, a personalized fault diagnosis method for the detection of bearing faults was proposed for the activation of smart-sensor networks using FEM simulations. Some existing research has shown that TL [48] has broad application prospects and wide applicability in various fields [49–51]. Feature-based transfer, a mainstream branch of TL technology, has been used in image classification [52–54] and has inspired a novel idea for overcoming the two limitations of data-driven intelligent fault diagnosis. In this paper, the use of a novel feature extraction procedure, namely, transferable sensitive feature selection via ReliefF and the sum of standard deviation ratio (TSFRS), is proposed. TSFRS has the following two aspects: (1) it is characterized by the selection of fault-sensitive features and combines the ReliefF algorithm and the sum of within-class mean deviations (SMD) of feature data; (2) a feature-based TL method, namely, transfer component analysis (TCA), is used to reduce the differences between the marginal distributions of the training and testing data.

After the steps of signal processing and feature extraction, a high-dimensional feature set can usually be generated; if this feature set is used directly in fault pattern recognition, it will lead to very high computational complexity and the degradation of fault diagnosis accuracy [42]. Hence, dimensionality reduction is another key step that must be taken before fault pattern recognition. In fact, the dimensionality reduction of features can not only limit storage requirements and increase the algorithm speed, but can also improve the predictive accuracy of the classifier model by removing noisy and redundant features while retaining the most useful information regarding diverse bearing failures [55]. Dimensionality reduction methods can be classified into either linear or nonlinear methods. Principal component analysis (PCA) and linear discriminant analysis (LDA), as two classical linear dimensionality reduction methods, have been extensively used for linear data, but they may be invalid for nonlinear data [56]. Therefore, some nonlinear dimensionality reduction methods, namely, kernel principal components analysis (KPCA), Isomap, Laplacian eigenmaps (LE), and local linear embedding (LLE), among others, have presented valid solutions for the dimensionality reduction of nonlinear data [56]. However, nonlinear dimensionality reduction methods have some limitations in practical applications, such as the problem of “out-of-sample” that has no explicit mapping matrix [57], the problem of the overlearning of locality [58], and high computational complexity. In recent years, some unsupervised manifold learning methods that preserve the local geometric structure on the data manifold using the linear approximation of the nonlinear mappings have been proposed, and some representative methods include locality-preserving projections (LPP) [59], neighborhood-preserving embedding (NPE) [60], and orthogonal neighborhood-preserving projection [61]. Among these manifold learning methods, LPP has attracted attention in the fault diagnosis field [62–64], but it does not utilize the label information in dimensionality reduction. LDA is a supervised dimensionality reduction method that considers the label information in feature reduction, but it cannot be directly applied when the within-class and between-class scatter matrixes are singular because of the small sample size (SSS) problem [65]. Based on the respective dominant attributes of LPP and LDA, a novel dimensionality reduction method, namely, local Fisher discriminant analysis (LFDA), was proposed by Sugiyama [66]. LFDA takes into account the label information of data while simultaneously preserving the local geometric structures of the feature data. However, LFDA only considers the neighbor relationships between samples of the same class while ignoring those between samples of different classes. Aiming at the alleviation of the SSS problem of LDA, the maximum margin criterion (MMC), a supervised dimensionality reduction method, was proposed [65]. Inspired by the attributes of LFDA and MMC, this paper proposes a novel feature reduction method, namely, the local maximum margin criterion (LMMC), an improved MMC in which both the neighbor relationships between samples of the same class and those between samples of different classes are considered.

Therefore, the contributions of this paper are summarized as follows. To solve the problem of fault diagnosis via vibration data that are variably distributed under different working conditions, a novel intelligent fault diagnosis framework of REBs based on multidomain features that systematically combine statistical feature extraction, feature-based TL, feature reduction, and pattern recognition is proposed. TSFRS, a novel feature extraction procedure, is proposed for the selection of the transferable fault-sensitive statistical features as the basis of the subsequent fault analysis. LMMC, an improved feature reduction method, is proposed for the excavation of abundant and valuable information with low dimensionality, which is beneficial for fault diagnosis. The execution of the proposed fault diagnosis framework of REBs is divided into four steps, namely, signal processing, feature extraction, feature reduction, and fault pattern recognition. First, DTCWPT is performed on raw vibration signals collected from REBs, and different terminal nodes can be obtained. Multidomain statistical features are then extracted from the reconstructed signals of the terminal nodes to construct the original feature set. Secondly, based on the ReliefF algorithm and mean deviation, a new evaluation index, namely, the ratio of the feature weight value and the SMD, is employed to indicate the sensitivity of statistical features; the most sensitive features can be selected to form a feature subset that represents the fault peculiarity of REBs. Additionally, TCA is used to reduce the differences between the marginal distributions of feature datasets under different working conditions. Thirdly, LMMC is performed on the original high-dimensional feature set to acquire a new lower-dimensional projection of it. Finally, vibration signals collected from two test rigs under different working conditions are employed to validate the effectiveness, adaptability, and superiority of the proposed method for the identification and classification of REB faults. The first test rig is from Case Western Reserve University, on which two cases with 12 fault types under different motor loads of 2 hp and 3 hp are employed for validation experiments. The second test rig is an SQI-MFS test rig, on which two cases with 10 fault types under different motor speeds of 1200 rpm and 1800 rpm are employed to further verify the adaptability of the proposed method.

The remainder of this paper is organized as follows. In Section 2, the theoretical backgrounds of the DTCWPT technique, TCA technique, and MMC are summarized. In Section 3, a description of the proposed diagnosis technique is provided, and the fault diagnosis framework of REBs is illustrated. In Section 4, REB fault vibration signals collected from two experimental test rigs are investigated to verify the performance of the proposed method. Finally, the conclusion of this work is presented in Section 5. Some acronyms used in this paper are presented in Table 1.

#### 2. Theoretical Background

##### 2.1. Dual-Tree Complex Wavelet Packet Transform (DTCWPT)

DTCWT, an enhancement of DWT, is characterized by some important properties including near shift-invariance and the inhibition of frequency aliasing components [36]. However, DTCWT cannot be used for multiresolution analysis in the high-frequency band where useful fault feature information usually exists [67]. To address this limitation, DTCWPT, which is composed of two parallel discrete wavelet packet transforms with different low- and high-pass filters, can present a more precise frequency band partition over the entire analyzed frequency band [36, 40]. DTCWPT is divided into real- and imaginary-part wavelet packet transforms, which can be, respectively, regarded as the real and imaginary trees. The real tree decomposition and the corresponding coefficients can be expressed as follows [37]:where is the coefficients in the real tree at a scale *l* and node *N* and and are the low-pass filter and the high-pass filter, respectively. The imaginary tree decomposition and the corresponding coefficients can be expressed as follows:where is the coefficients in the imaginary tree at a scale *l* and node *N* and and are the low-pass filter and the high-pass filter, respectively. When the scale *l* is 0, the coefficients are both equal to the original signal , namely, . The decomposition coefficients of DTCWPT are composed of and , and it can be expressed as follows:

The reconstruction procedure of DTCWPT is as follows:where and are the wavelet packet reconstruction filters of the real and imaginary trees, respectively. DTCWPT is characterized by two prominent advantages: (1) it is beneficial to the detection of multiple harmonic signals and (2) it can help to extract the periodic impact features of signals. Therefore, in this work, DTCWPT is used to process the original vibration signals, and the corresponding single-branch reconstruction signals of the terminal nodes are used to extract original features.

##### 2.2. Maximum Margin Criterion (MMC) and Local Fisher Discriminant Analysis (LFDA)

Linear discriminant analysis (LDA), one of the most popular methods for dimension reduction in statistics research fields [68, 69], was proposed by Fisher [70] for the dimension reduction of binary classification problems and was further extended to multiclass cases by Rao [71]. However, LDA cannot be directly applied when the within-class and between-class scatter matrixes are singular because of the SSS problem [65]. To address this drawback, Li et al. [65] and Song et al. [72] used the difference of the between-class and within-class scatter matrixes as a discriminant criterion called the maximum margin criterion (MMC), which can make the inverse matrix not to be constructed. Thus, the SSS problem in traditional LDA is alleviated.

Let be the input data set and be the associated class label set, where is an *M*-dimensional sample, *N* is the number of samples, and *c* is the total number of classes. To reduce the dimensionality of a sample , some measures are needed to be employed to assess the similarity or dissimilarity. We want to find a linear transformation , transforming *x* from to , where . After the dimensionality reduction, the similarity or dissimilarity information is preserved as much as possible. In the work by Li et al. [65], the Euclidean distance was applied to measure the dissimilarity, and the objective of the MMC is for a sample to be close to those in the same class but far from those in different classes. Thus, the MMC can be presented as follows:where and are the prior probability of the class and , respectively. The is defined as the distance between mean vectors, that is,where and are the mean vectors of the classes and , respectively. However, due to the fact that (6) neglects the scatter of classes, (6) is not suitable. Though is large, it is not easy to separate two classes that have the large spread and overlap with each other. For this problem, considering the scatter of classes, the between-class distance can be redefined as follows:where is some measure of the scatter of the class . The generalized variance or overall variance trace () is usually used to measure the scatter of data, where is the covariance matrix of the class . Thus, based on (5) and (7), two new parts can be obtained by decomposing (5):

By employing the Euclidean distance, the first part in (8) can be simplified as

Because and , (9) can be further simplified as

The second part in (8) can be simplified as

Equation (5) can be transformed towhere is the between-class scatter matrix, is the within-class scatter matrix, and measures the between-class separation, while measures the within-class cohesion.

Local Fisher discriminant analysis (LFDA), a linear supervised dimensionality reduction method, was proposed by Sugiyama [66]. LFDA can not only maximize between-class separability and preserve the within-class local manifold structure at the same time in a reduced dimensional space, but also inherits an excellent property from LDA; that is, it has an analytic form of the embedding matrix, and the solution can be easily computed by solving a generalized eigenvalue problem [66]. LFDA and LDA have the same optimization framework . Furthermore, LFDA incorporates local information into the definition of weight. The objective of LDA is to maximize the ratio of the between-class scatter matrix to the within-class scatter matrix :where is a projection matrix, and the definitions of and are as follows:where is the number of samples in class *l*, is the mean of the samples in class *l*, and is the mean of all samples:where *n* is the number of samples. According to the literature [66, 73], and also have equivalent form [66, 73]:where and are the weight matrices, and are the diagonal matrices, is the *i*th diagonal samples of and the sum of elements of the *i*th row of , and is the *i*th diagonal samples of and the sum of elements of the *i*th row of .

LFDA incorporates local information into the definition of weight. Thus, has been replaced by and has been replaced by . and are presented as follows [66]:wherewhere can be defined as follows:where is the local scaling around , defined by and is the *k-*th nearest neighbor of . If and are close to each other in the feature space, is large; otherwise, it is small [66]. According to (21), for the far apart sample pairs in the same class, it can be weighted and have less influence on and . Furthermore, the sample pairs in different classes cannot be weighted [74].

##### 2.3. Transfer Component Analysis (TCA)

Transfer component analysis (TCA) [75] is a typical feature-based TL method. Given the source domain data that are the training dataset with corresponding labels and the target domain data that are the dataset without corresponding labels, TCA aims to reduce the difference between the marginal distributions of the different datasets by leveraging the transferable features or knowledge from the source domain [75].

A domain *D* consists of a D-dimensional feature space *X*, whose marginal probability distribution is , where is a training dataset, and the representation of *D* can be . consists of a label space *Y* and a predictive function , where is a training dataset label and represents the conditional probability distribution. There are two learning tasks, namely, task of and task of . Feature transfer is employed to facilitate the learning process of the target predictive function in by using the knowledge and information in and , where or [75]. Given two datasets and , , and a transformation exits such that and , where is a nonlinear mapping function in a reproducing kernel Hilbert space *H*. The learning objective of TCA is to find a domain-invariant feature space in which the marginal distribution distance between the source domain and the target domain is minimized. The distribution distance is measured using the maximum mean discrepancy (MMD) criterion, which is defined as follows [76]:where

In equation (22), and represent the numbers of source domain samples and target domain samples, respectively, represents the trace of the matrix, *K* is a kernel matrix, and , , and are the kernel matrices in the source domain, cross domain, and target domain, respectively. *L* can be calculated as

TCA maps the features of two domain datasets into the same kernel space through the unified kernel function. The resultant kernel matrix can be calculated as follows:where , and the distribution distance between the different domain datasets can be defined as

The complexity of *W* needs to be controlled by a regularization term , which is employed to avoid the rank deficiency of the denominator. Thus, the objective function of TCA can be rewritten aswhere is a trade-off parameter and can be used to guarantee that the optimization objective can be well defined. represents an identity matrix. *H* is a centering matrix. can avoid the trivial solution .

According to the introduction of TCA, the optimization objective of TCA is that the latent space spanned by the learned samples preserves the variance of the data and minimizes the marginal distributions between the different domain datasets as much as possible. The optimization problem of equation (27) can be efficiently solved by the trace optimization problem.

#### 3. Proposed Method and System Framework

##### 3.1. Feature Extraction Procedure TSFRS (Transferable Sensitive Feature Selection by ReliefF and the Sum of Mean Deviation)

TSFRS has two components: (1) the selection of fault-sensitive features, which combines the ReliefF algorithm and the sum of within-class mean deviations (SMD) of feature data, and (2) the feature-based TL method, in which TCA is used to reduce the difference in the marginal distributions between the training and testing data.

In this paper, it is suggested that the sensitive statistical features be selected before the implementation of fault pattern recognition. Thus, the ReliefF algorithm [77] and MD are employed for a dataset that includes different statistical features for the case of REB conditions. Each type of statistical feature is evaluated by the ReliefF algorithm to determine its weight value (WV). ReliefF, a supervised algorithm for feature ranking, is usually applied in data preprocessing as a feature subset selection method. The basic concept of ReliefF is to compute instances at random, compute their nearest neighbors, and adjust a feature weighting vector to give more weight to features that discriminate the instances from the neighbors of different classes. For each kind of the statistical feature, the MD of the feature data samples in each REB condition can be calculated, and the sum of MD in all REB conditions can be further calculated. Aiming at the evaluation of each statistical feature, the higher the WV, the greater the discriminative degree of the feature class. The lower the value of MD, the greater the class cohesion of the characteristic. Therefore, the ratio of WV and SMD is selected to indicate the sensitivity of a statistical feature, based on which the sensitive feature subset can be selected from the original feature set.

Furthermore, the variable working conditions of REBs in industry scenarios can lead to distribution differences between the training and testing data [6]. Therefore, after the construction of the sensitive feature subset, TCA is employed to reduce the difference between the distributions of the sensitive feature training and testing subsets. The description of TSFRS is summarized as the following steps.

*Step 1. *In the training samples, there are *M* types of REB faults, *N* vibration signal samples in each type of REB fault pattern, and *K* types of statistical features. Via the processing of the vibration signals, original feature sets can be obtained*,* where can be expressed bywhere is the *k-*th statistical feature of the *j*-th sample in the *i*-th type of REB fault. Next, can be evaluated to obtain the corresponding feature WV using the ReliefF algorithm, and the WV of each statistical feature can be used to evaluate the distinguishability of the feature. The higher the WV, the greater the discriminative degree of the feature class.

*Step 2. *The MD of feature samples of a type of statistical feature in each type of REB condition is calculated, i.e., the MD of the elements of row of . Therefore, an MD set, , can be obtained, where can be expressed bywhereNext, can be obtained, that is, the sum of the MD of feature samples of the *k-*th statistical feature for all cases of REB conditions; can be expressed byIn this paper, it is presumed that the MD can be used to express the cohesion of data. Thus, there is a mean deviation sequence , which becomes another evaluation index for sensitive feature selection. The lower the value of , the greater the class cohesion of the feature.

*Step 3. *A new sequence is obtained, where the definition of is as follows:In this paper, it is presumed that the greater the value of , the better the fault sensitivity of the corresponding feature elements. Therefore, the sorted ratio sequence of WV and SMD can be obtained by sorting the WSD in the descending mode.

*Step 4. *For the labeled training data under one working condition and unlabeled testing data under another working condition, based on the training data, the sorted sequence WSD in the descending mode is acquired and is used to select the most sensitive statistical features that can construct a sensitive feature set (SFS). The most sensitive statistical features will be directly applied to the extraction of features for testing data. Thus, two sensitive feature sets can be obtained; the first is the SFS of training data, called , and the other is the SFS of testing data, called . Furthermore, and are used as the input of TCA, and a new feature set , in which the difference in marginal distributions between and is minimized, can be generated.

##### 3.2. Local Maximum Margin Criterion (LMMC)

Although the MMC can avoid the SSS problem of LDA, it may be invalid for nonlinear datasets due to the lack of consideration of the local structure of the dataset. LFDA considers the neighbor relationships between samples of the same class while ignoring those between samples of different classes. Aiming at this problem and inspired by the attributes of the MMC and LFDA, this paper proposes a novel feature reduction method, LMMC, which is an improved MMC. The LMMC naturally inherits the merits of the MMC and LFDA, and the underlying idea of the solution to the problem mentioned previously is that the optimization objective of LFDA can be integrated into the MMC; in addition, the neighbor relationships between samples of different classes are taken into consideration.

Based on the descriptions of the MMC and LFDA provided in Section 2, the optimization objective of the LMMC can be obtained by combining the optimization objectives of the MMC and LFDA. In addition, the LMMC has an improvement on the local information of the definition of the weight. The LMMC and MMC have the same optimization framework, but and are, respectively, replaced by and . The objective function can be presented as follows:

According to equations (16) and (17), the local and the local are defined as follows:wherewhere and are the weight matrices and and are the diagonal matrices. In , the means of is that *j* is the nearest neighbor of *i*, and they belong to different classes.

According to equations (30)–(37), the local structure of the dataset, including the neighbor relationships between samples of the same class and the neighbor relationships between samples of different classes, can be considered into the dimensionality reduction by changing the WV.

Let be a linear transformation that transforms the high-dimensional dataset from to , where . Thus, in the lower-dimensional space, the scatter matrices, respectively, become and , where *W* can be determined by maximizing

It is assumed that *W* is constituted by the unit vectors, that is, and , . Thus, *W* can be obtained by solving the following constrained optimization:

The above constrained optimization can be transformed to the eigenvalue problem:

Thus,

According to equation (41), *W* is composed of the eigenvectors of corresponding to the first *L* largest nonnegative eigenvalues. Finally, with the utility of the LMMC, the low-dimensional feature matrices of the training and testing datasets can be obtained with more sensitive and less redundant information for REB fault diagnosis.

##### 3.3. System Framework

The implementation of the proposed fault diagnosis framework is presented in Figure 1, in which the statistical analysis and artificial intelligence approaches are systematically blended to detect and diagnose REB faults under different working conditions. The entire fault diagnosis procedure is divided into four steps, namely, signal processing, feature extraction, feature reduction, and fault pattern recognition.

In the signal processing step, vibration signals collected from REBs under different working conditions are decomposed into different wavelet packet nodes by DTCWPT. The single-branch reconstruction signals of terminal nodes will be employed to extract statistical features. In the feature extraction step, with the utilization of the proposed TSFRS, the most sensitive statistical features can be selected based on the training dataset to construct a sensitive feature subset for the training classifier, and these most sensitive statistical features will be directly applied to the extraction of features for the testing dataset. The sensitive feature subsets of the training and testing datasets are, respectively, used as the source domain data and the target domain data. TCA is used to reduce the difference in marginal distributions of the source domain data and the target domain data. For the feature reduction, the low-dimensional training feature space is acquired by the proposed LMMC, which generates a projection that can be directly used for the dimensionality reduction of the testing feature dataset; thus, the low-dimensional testing feature dataset can be obtained. The WSD and projection matrix *W* are obtained by processing the training dataset and can be directly used for the testing dataset. In the final step, the low-dimensional training feature dataset is employed as the input of the fault type to train the classifier. In this paper, support vector machine (SVM) is used as the fault pattern recognition classifier. The trained classifier will be employed to conduct fault pattern recognition using the low-dimensional testing feature dataset. Finally, the procedure of the proposed method outputs the fault identification and classification accuracy.

#### 4. Experiments and Analysis Results

##### 4.1. Experiments Based on Experimental Test Rig 1

###### 4.1.1. Experimental Setup and Cases

The REB vibration data from Case Western Reserve University (CWRU) [78], which reproduces several fault scenarios, were used to verify the effectiveness of the proposed methods. The experimental test rig is presented in Figure 2; the test rig was composed of an electric motor (left), a torque transducer/encoder (center), a dynamometer (right), and control circuitry (not shown). An SKF6205-2RS deep-groove REB was used in the test rig, and electro-discharge machining was employed to set single-point defects with different fault diameters, namely, 0.007, 0.014, 0.021, and 0.028 inches. The collected vibration signals of the REBs consisted of inner race fault signals, ball fault signals, outer race fault signals, and normal signals. The test rig supported a motor load of 0–3 horsepower (hp), and the corresponding motor speeds were 1730 to 1797 rpm. Three accelerometers were, respectively, placed at the 12 o’clock position. The sampling frequency was 12 kHz for the drive-end and fan-end bearings.

In order to verify the effectiveness, adaptability, and the practical value of the proposed bearing fault diagnosis framework under different working conditions, the vibration signals of different fault types and diameters under different motor loads are employed. The signal samples of 2 hp and 3 hp are applied for experiments, and there are four bearing conditions (normal, ball fault, inner race fault, and outer race fault). Ball fault and inner race fault have four fault diameters (0.007 inches, 0.014 inches, 0.021 inches, and 0.028 inches, respectively). Outer race fault has three fault diameters (0.007 inches, 0.014 inches, and 0.021 inches). Therefore, there are 12 bearing conditions which can correspond to 12 patterns for fault diagnosis. The bearing vibration signals are divided into several data segments, and each segment which is used as a sample has 2000 data points. Each bearing condition contains 60 samples, among which 40 random samples are selected as the testing samples and 20 random samples are selected as the training samples. Based on these samples, two group datasets are used in experiments. The first group dataset includes two cases (cases 1 and 2), and the samples of 2 hp are used as training samples. In case 1, the samples of 2 hp are used as testing samples. In case 2, the samples of 3 hp are used as testing samples. The second group dataset includes two cases (cases 3 and 4), the samples of 3 hp are used as training samples. In case 3, the samples of 3 hp are used as testing samples. In case 4, the samples of 2 hp are used as testing samples. The detailed information of the two group experimental datasets is shown in Tables 2 and 3, respectively.

###### 4.1.2. Analysis Results

According to the diagnosis framework shown in Figure 1, each sample is decomposed into different wavelet packet nodes by DTCWPT, and the decomposition level is 4. Thus, 16 terminal nodes, namely, subband signals, can be obtained. Then, 16 single-branch reconstruction signals of terminal nodes can be obtained, and 16 single-branch reconstruction signals are selected to generate 16 Hilbert envelope spectra (HES). By using the 6 statistical parameters shown in Table 4, each single-branch reconstruction signal can generate 6 statistical features by calculating 6 statistical parameters, and each HES can generate 6 statistical features by calculating 6 statistical parameters. Thus, 16 single-branch reconstruction signals and 16 HES can generate 192 statistical features which compose the original feature set (OFS). Then, the TSFRS is performed to select the sensitive statistical features and reduce the difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. The WV, SSMD, and WSD of 192 statistical features of the training samples (2 hp) are presented in Figures 3–5, respectively. In Figure 4, the horizontal axis represents the number of statistical features. The 1–6, 7–12, … , 85–90, and 91–96 represent time domain features of single-branch reconstruction signals of terminal wavelet packet nodes 1–16, respectively. The 97–102, 103–108, … , 181–186, and 187–192 represent the HES features. After the procedure of TSFRS, the feature reduction method LMMC is further performed to obtain a low-dimensional feature set which is used as the input of the SVM.

In order to verify the effectiveness of the proposed TSFRS and LMMC, two group comparative experiments are performed. In addition, WPT is also used in experiments, and the results of experiments using WPT are compared with those of DTCWPT, which can help to verify the superiority of DTCWPT. In this paper, the training dataset is employed to train the fault diagnosis model, the testing dataset is employed to test the fault diagnosis model, and the accuracy results presented in a series of tables and figures are the average diagnostic accuracy of 12 bearing conditions. Thus, we use the average diagnostic accuracy results for experimental analysis, and the detailed experimental analysis is described as follows.

In the first group of experiments, the TSFRS is not applied. The OFS containing 192 statistical features is directly processed by some dimensionality reduction methods, namely, PCA, LDA, MMC, LFDA, and LMMC. OFS-SVM is a diagnosis model based on SVM, in which the OFS is used as the input of the SVM. OFS-PCA/LDA/LFDA/MMC/LMMC-SVM are also SVM-based diagnosis models that, respectively, use PCA, LDA, LFDA, MMC, and LMMC. According to results shown in Tables 5–10, for the cases 1 and 2, the performance of each model using DTCWPT is better than that of the model using WPT; for the cases 3 and 4, the performance of OFS-SVM, OFS-LFDA-SVM, and OFS-LMMC-SVM models using DTCWPT is better than that of these models using WPT. For the OFS-PCA-SVM, OFS-LDA-SVM, and OFS-MMC-SVM models using DTCWPT, the diagnosis results of case 4 are better than that of models using WPT. In general, the DTCWPT has more advantages than WPT.

The detailed experimental results of all models using DTCWPT are presented below. For the testing set of case 1, all models can obtain preferable diagnosis accuracy. The maximum accuracy of each model can attain over 98%, and the highest accuracy can attain 100%, which is obtained by OFS-LMMC-SVM. For the testing set of case 2, the working condition is different from the training set. The diagnosis accuracy of OFS-SVM can only attain 83.33%, compared with OFS-SVM, and all models have enhancement in diagnosis accuracy. But, the performance of OFS-LMMC-SVM is better than that of other models, and the highest accuracy of OFS-LMMC-SVM can attain 93.75% when the dimension size is 11. For the testing set of case 3, the maximum accuracy of each model can attain over 96%, and the highest accuracy can attain 100%, which is obtained by OFS-LMMC-SVM. For the testing set of case 4, the diagnosis accuracy of OFS-SVM can only attain 78.54%, and the model using PCA, LDA, MMC, and LMMC has an obvious enhancement in diagnosis accuracy, respectively. The highest diagnosis accuracy can attain 92.08%, which is obtained by OFS-LMMC-SVM. According to the experimental results of the four testing cases under various models, it is evident that the fault diagnosis model using LMMC can achieve preferable diagnosis performance.

In the second group of experiments, the TSFRS is applied before the implementation of feature reduction and fault pattern recognition. OFS-TSFRS-SVM is an SVM-based diagnosis model, in which the most sensitive features can be selected from OFS according to WSD; in addition, the TCA is employed to reduce difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. OFS-TSFRS-PCA/LDA/LFDA/MMC/LMMC-SVM are also SVM-based diagnosis models that, respectively, use PCA, LDA, LFDA, MMC, and LMMC. According to Tables 11–16 and Figures 6–19, the detailed experimental results of all models using DTCWPT are presented in the following.

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

For the testing set of case 1, all models can achieve preferable performance, which is reflected on diagnosis accuracy. The maximum diagnosis accuracy of each models can attain over 98%, and the highest diagnosis accuracy can attain 100%, which is obtained by both OFS-TSFRS-LFDA-SVM and OFS-TSFRS-LMMC-SVM. For the testing set of case 2, as compared with the experimental results of the first group, diagnosis accuracies of all models using TSFRS appear with an enhancement. Among the models mentioned above, the performance of OFS-TSFRS-LFDA-SVM and OFS-TSFRS-LMMC-SVM is better than that of OFS-TSFRS-SVM, OFS-TSFRS-PCA-SVM, OFS-TSFRS-LDA-SVM, and OFS-TSFRS-MMC-SVM. The maximum diagnosis accuracies of OFS-TSFRS-LMMC-SVM and OFS-TSFRS-LFDA-SVM can attain over 98%, but the highest diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 100%. For the testing set of case 3, the maximum accuracy of each model can attain over 96%, and OFS-LMMC-SVM can achieve 100% fault diagnosis accuracy, which is higher than that of the other models. For the testing set of case 4, as compared with the experimental results of the first group, diagnosis accuracies of all models using TSFRS appear with an enhancement. The performance of OFS-TSFRS-LMMC-SVM is better than that of other models, and the highest diagnosis accuracy can attain 99.79% (when the sfn (selected feature number) is 140).

According to the experimental results of the second group, when a suitable parameter sfn is selected, it can achieve a desirable improvement on the performance of the fault diagnosis model, which can attain a preferable diagnosis accuracy. According to Figures 6–19, it is evident that the fault diagnosis model can attain better diagnosis performance when a suitable sfn is selected. For example, for the testing sets of cases 1, 2, and 3, the diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 100% when the sfn is between 35 and 42, and for the testing set of case 4, the diagnosis accuracy of OFS-TSFRS-LMMC-SVM can attain 99.79% when the sfn is between 137 and 146. In summary, the effectiveness of the proposed TSFRS and LMMC can be demonstrated, and a diagnosis model trained by the proposed diagnosis framework using the training data of a single working condition can achieve a desirable accuracy for a testing set collected from bearings under different working conditions.

##### 4.2. Experiments Based on the Experimental Test Rig 2

###### 4.2.1. Experimental Setup and Cases

In order to further validate the effectiveness and adaptability of the proposed methods, the SQI-MFS test rig [42] is used to collect bearing vibration signals for fault diagnosis. The experimental test rig is shown in Figure 20, and the corresponding bearings with different fault types are shown in Figure 21. The laser machining is employed to set single-point defects with the different fault diameters, which contains 0.05 mm, 0.1 mm, and 0.2 mm. The collected vibration signals of the bearing consist of the inner race fault signals, ball fault signals, outer race fault signals, and normal signals. The test rig supports the motor speeds of 1200 rpm and 1800 rpm. Two accelerometers are used to collect vibration signals. The sampling frequency is 16 kHz.

**(a)**

**(b)**

For the experimental data of the SQI-MFS test rig, the signal samples of 1200 rpm and 1800 rpm are applied for experiments, and there are four bearing conditions (normal, ball fault, inner race fault, and outer race fault). Ball fault, inner race fault, and outer race fault have three fault diameters (0.05 mm, 0.1 mm, and 0.2 mm), respectively. Therefore, there are 10 bearing conditions which can correspond to 10 patterns for fault diagnosis. The bearing vibration signals are divided into several data segments, and each segment which is used as a sample has 5000 data points. Each bearing condition contains 60 samples, among which 40 random samples are selected as the testing samples and 20 random samples are selected as the training samples. Based on these samples, two group datasets are used in experiments. The first group dataset includes cases 1 and 2. The samples of 1800 rpm are used as training samples. In case 1, the samples of 1800 rpm are used as testing samples. In case 2, the samples of 1200 rpm are used as testing samples. The second group dataset includes cases 3 and 4. The samples of 1200 rpm are used as training samples. In case 3, the samples of 1200 rpm are used as testing samples. In case 4, the samples of 1800 rpm are used as testing samples. The detailed information of the two group experimental datasets is shown in Tables 17 and 18, respectively.

###### 4.2.2. Analysis Results

The experimental procedure is the same as that of the test rig 1, and each sample is decomposed into different wavelet packet nodes by DTCWPT; the decomposition level is 4. Thus, 16 terminal nodes, namely, subband signals, can be obtained. Then, 16 single-branch reconstruction signals of terminal nodes can be obtained, and 16 single-branch reconstruction signals are selected to generate 16 HES. By using 6 statistical parameters shown in Table 4, each single-branch reconstruction signal can generate 6 statistical features by calculating 6 statistical parameters, and each HES can generate 6 statistical features by calculating 6 statistical parameters. Thus, 16 single-branch reconstruction signals and 16 HES can generate 192 statistical features which compose the OFS. Then, the TSFRS is performed to select the sensitive statistical features and reduce the difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. The WV, SSMD, and WSD of the 192 statistical features of the training samples are presented in Figures 22–24, respectively.

In order to verify the adaptability and effectiveness of the fault diagnosis model using proposed methods, OFS-LMMC-SVM and OFS-TSFRS-LMMC-SVM are employed for experiments. The experimental results of OFS-LMMC-SVM are presented in Table 19. For the testing set of cases 1 and 3, the maximum accuracy can attain 99.67% and 99.17%, respectively. When the testing set is from cases 2 and 4, the maximum accuracy can attain 67.67% and 59.50%. When the TSFRS is applied, according to the experimental results of OFS-TSFRS-LMMC-SVM shown in Table 20, it is evident that the diagnosis accuracies appear with an enhancement by the use of TSFRS. OFS-TSFRS-LMMC-SVM can achieve desirable performance. For the testing sets of cases 1, 2, 3, and 4, the maximum accuracies can attain 100.00%, 89.00%, 100%, and 84.33%, respectively. The curve representation of the experimental results of OFS-TSFRS-LMMC-SVM is shown in Figure 25. For comparison, the diagnosis results of OFS-TSFRS-SVM, OFS-TSFRS-PCA-SVM, OFS-TSFRS-LDA-SVM, OFS-TSFRS-LFDA-SVM, and OFS-TSFRS-MMC-SVM are shown in Figure 26. According to the experimental results, especially for the testing set of cases 2 and 4, it is evident that the performance of OFS-TSFRS-LMMC-SVM is better than that of other models when a suitable sfn is selected. In summary, the effectiveness of the proposed TSFRS and LMMC can be further demonstrated, and the adaptability of the proposed diagnosis framework using four testing sets collected from different bearings under different working conditions is also verified.

**(a)**

**(b)**

**(c)**

**(d)**

**(a)**

**(b)**

**(c)**

**(d)**

#### 5. Conclusions

Due to the harsh environment and variability of the working conditions in real industrial scenarios, data-driven fault diagnosis using traditional machine-learning methods has limited the performance of the model under different working conditions, in which the distribution between the training data and testing data is different. To address this problem, this paper proposed a novel intelligent fault diagnosis framework for REBs under different conditions, with systematically blending statistical analysis with artificial intelligence. In this framework, DTCWPT is used to process raw vibration signals and extract statistical features. A new feature extraction method, TSFRS, is used to select the most sensitive features to form a sensitive feature subset and reduce the difference of distribution between the training sensitive feature subset and the testing sensitive feature subset. A modified MMC, namely, the LMMC, is used as a feature dimensionality reduction method. Compared with the other dimensionality reduction methods, the advantages of the proposed LMMC is presented. SVM is used as an automated fault pattern recognition classifier. Finally, experimental datasets collected from two experimental test rigs contain samples of different bearing fault conditions such as ball fault, inner race fault, and outer race fault at different defect diameters.

According to the experimental results, the proposed methods for REB fault diagnosis have great potential to be beneficial in industrial applications. For the experimental test rig 1, a set of comparative cases, namely, cases 1, 2, 3, and 4, are employed for experiments. The cases 1 and 2 select samples of the same motor loads of 2 hp as the training sets, and the samples of different motor loads of 2 hp and 3 hp are selected as the testing sets, respectively. The cases 3 and 4 select samples of the same motor loads of 3 hp as the training sets, and the samples of different motor loads of 3 hp and 2 hp are selected as the testing sets, respectively. The experimental results indicate that the fault diagnosis model using the proposed methods can achieve desirable performance, when a suitable parameter sfn is selected, and the maximum diagnosis accuracies of cases 1, 2, 3, and 4 can attain 100%, 100%, 100%, and 99.79%, respectively. The experimental test rig 2 is employed to further verify the adaptability and effectiveness of the diagnosis model using the proposed methods. The experimental results show that the proposed methods can help the diagnosis model to achieve preferable diagnosis accuracy, and at the same time, the desirable adaptability of the diagnosis model using the proposed methods is demonstrated.

#### Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

This work was funded by the Special Funds Project for Transforming Scientific and Technological Achievements in Jiangsu Province (BA2016017); the National Key R&D Program of China (2017YFC0804400 and 2017YFC0804401).