Intelligent fault diagnosis methods based on deep learning have achieved much progress in recent years. However, there are two major factors causing serious degradation of the performance of these algorithms in real industrial applications, i.e., limited labeled training data and complex working conditions. To solve these problems, this study proposed a domain generalization-based hybrid matching network utilizing a matching network to diagnose the faults using features encoded by an autoencoder. The main idea was to regularize the feature extractor of the network with an autoencoder in order to reduce the risk of overfitting with limited training samples. In addition, a training strategy using dropout with random changing rates on inputs was implemented to enhance the model’s generalization on unseen domains. The proposed method was validated on two different datasets containing artificial and real faults. The results showed that considerable performance was achieved by the proposed method under cross-domain tasks with limited training samples.

1. Introduction

Mechanical fault diagnosis plays a significant role in modern industry. Failures of machines are likely to result in an entire mechanical system collapse and production line downtime, as well as serious economic losses. Timely and accurate fault diagnosis has become an indispensable technology in modern industries to ensure the safe and reliable operation of mechanical systems [13].

Recently, deep learning has achieved considerable progress in computer vision [4, 5], speech and natural language processing [6], product defect detection [7], and road planning [8]. Expectedly, an increasing number of researchers have applied deep learning techniques to fault diagnosis and proposed intelligent fault diagnosis methods [916]. Hasan et al. [17] proposed an explainable AI-based model for bearings fault diagnosis. Sun et al. [18] developed a sparse autoencoder-based deep neural network for the fault diagnosis of induction motors, which realized accurate fault prediction. Li et al. [19] designed a two-layer Boltzmann machine to develop representations of the statistical parameters of wavelet packet transform for gearbox fault diagnosis. Ding et al. [20] applied a deep convolutional neural network (CNN) by using wavelet packet energy as the input to develop a bearing fault diagnosis system, with which they obtained reasonable fault detection performance. Zhang et al. [21] proposed a method based on deep learning that uses raw temporal signals as input, which achieved high accuracy under noisy conditions. Qiao et al. [22] built a dual-input model and achieved satisfactory antinoise and load adaptability based on a CNN and a long short-term memory neural network. The deep learning methods have discarded the traditional time-consuming and unreliable manual analysis, improving the efficiency of fault diagnosis [2328] considerably.

Traditional deep learning methods can only achieve satisfactory results when the training set (source domain) and the test set (target domain) are in the same data distribution. In practical applications, however, due to the complexity of the working conditions of the mechanical system (load, motor speed, etc.), the training set and the testing set may have distinct distributions. The predictive performance of the deep learning models is greatly affected by these facts. To face this challenge, some transfer learning algorithms have been proposed to enhance the domain adaptability of the model. Zhang et al. [21] presented a novel algorithm based on deep learning to alleviate the degradation of the performance of intelligent fault diagnosis under noisy environments and different working loads. Yao et al. [29] designed a new model based on a Stacked Inverted Residual Convolution Neural Network to ensure the accuracy of the model in noisy environments. Hu et al. [30] proposed a data augmentation algorithm and presented a self-adaptive neural network to boost models’ generalization ability. Lu and Yin [31] developed a transferable common feature space mining algorithm to extract the common features from multidomain data. Wu et al. [32] constructed a few-shot transfer learning method in variable conditions. Wei et al. [33] proposed multiple source domain adaptation methods to extract condition-invariant features for fault diagnosis.

Aside from the obstacle posed by cross-domain tasks, a limited training set is another challenge that restricts the practical application of deep learning fault diagnosis algorithms. Most of the deep learning methods require a large amount of labeled data for model training. However, in actual industrial application scenarios, collecting a huge amount of labeled data for every type of failure under each working condition poses a considerable challenge. To address this problem, some studies on mechanical fault diagnosis using limited labeled training data have been conducted. Wang et al. [34] presented an integrated fault prognosis and diagnosis method for the predictive maintenance of turbine bearings, which achieved reasonable performance under limited labeled data. Zhang et al. [35] applied the few-shot approach for fault diagnosis and designed an artificial neural network based on a Siamese network, achieving interesting results with limited data. Li et al. [36] designed a meta-learning fault diagnosis method (MLFD) framework using model-agnostic meta-learning, which has performed excellently under complex working conditions. Hang et al. [37] applied a two-step clustering algorithm and principal component analysis to improve classification performance in the case of unbalanced high-dimensional data. Li et al. [38] proposed a deep, balanced domain adaptation neural network, which achieved satisfactory results with limited labeled data. Duan et al. [39] proposed a novel data description support vector based on deep learning for unbalanced datasets.

As two important research directions of fault diagnosis, improving the model’s generalization to new domains and performance under limited training samples has made good progress, respectively. However, the reports of studies combining these two directions are relatively rare to find. In this study, to achieve domain generalization under limited training samples, we proposed a hybrid matching network (HMN) designed by connecting a prototypical network to the bottleneck of an autoencoder for fault diagnosis to unseen domains with limited training samples.

Our model mainly consists of two parts: (1) the autoencoder regularizing the feature extractor of the model to reduce the risk of overfitting and (2) the matching network achieving the measurement of samples similarity. Besides, a novel strategy is implemented in the training process to improve the model’s domain generalization.

The main contributions of this study can be summarized as follows:(1)A novel fault diagnosis method based on matching network and autoencoder, known as HMN, was proposed to face the cross-domain scenarios. In the tasks, the model was training on the source domain with limited data and testing on the unseen target domains without access to their distributions.(2)Dropout on the input layer with randomly changing rates was employed to improve the generalization ability of the model. Autoencoder was built to reduce the risks of model overfitting with limited training samples by regularizing the feature extractor of the network.(3)The well-designed algorithm can effectively cope with domain generalization (DG) fault diagnosis. Comprehensive experiments were designed and executed to prove the effectiveness of the proposed HMN with two bearing faults datasets containing artificial and real faults.

The rest of the paper is organized as follows. Autoencoder and prototypical networks are introduced in Section 2. Section 3 describes the proposed method in detail. Section 4 presents the experiments, results, and discussion. Finally, the conclusions are drawn in Section 5.

2. Autoencoder and Prototypical Network

2.1. Autoencoder

Autoencoder, an unsupervised learning method, uses a neural network to implement the representation learning task. Specifically, a neural network architecture designed to impose a bottleneck layer forces a compressed knowledge representation of the original input.

As shown in Figure 1, the autoencoder is mainly composed of two parts: an encoder and a decoder. The encoder function, which is denoted as , enables the efficient computation of a feature vector from an input vector . It is important to note that the dimensions of are usually lower than the dimensions of . Another parameterized function , known as the decoder, maps the feature vector back to the input space, generating a reconstruction vector .

A simplified autoencoder structure can be represented as a fully connected neural network with three layers, i.e., an input layer, a bottleneck layer, and an output layer. The parameter sets of the encoder and the decoder are trained simultaneously when performing the task of reconstructing the input as much as possible, i.e., minimizing reconstruction error which is usually described by MSE over training examples. For a training set , the reconstruction error of MSE is expressed as follows:

If the input is normalized to , the cost function can be described as binary cross-entropy, which comes in the form below:where and represent the th element of and , respectively, and represent the batch size and the dimension of , respectively.

Using penalizing parameters based on reconstruction errors, the network can learn about the most important attributes of the input data and how to best reconstruct the input from the feature vector.

2.2. Prototypical Networks

Prototypical Networks [40] have been proposed for few-shot learning, which requires only a small amount of training data with limited information, as compared to traditional machine learning methods requiring a large amount of data to train a model for good results. As shown in Figure 2, the classification task can be achieved by comparing the distances with mean representations of each class in the metric space produced by Prototypical Networks.

Specific to a few-shot task, given a support set that has labeled samples , is a vector with D-dimension and is the label of each class, describes the set labeled with . A representation , or prototype, of each class is computed by meaning the support points belonging to class :where is an embedding function with learnable parameters . For a function computing distance , distribution of a query point over distances to all prototypes of each class in the metric space is computed by prototypical networks:

Train the network by minimizing , the loss of the k class.

3. Methods

The proposed HMN for fault diagnosis is described in detail in this section. As shown in Figure 3, our model has both one-input and two-output configurations. One of the outputs was the reconstruction of the input, and the other was the prediction of health conditions using a prototypical network. The details of the model are illustrated in Table 1.

3.1. Data Preprocessing

The proposed model used the short-time spectrogram as a 2D input. Firstly, as shown in Figure 4, the sliding window of 2048 points generated the samples. Secondly, STFT used a fixed-length nonzero window function to slide along the time axis, truncating the source signal into segments of equal length. Assuming that these segments are stable, Fourier transform can be used to obtain the local frequency spectra of the segments. And finally, these local frequency spectra were recombined along the time axis to obtain a 2D time-frequency graph. The formula is presented in equation (5) as below:where is the original timing signal and is the window function applied as the center point at time τ. In this study, the Hann window was used. To speed up the convergence of the model, we converted 2D spectrogram into a grayscale image with a value between 0 and 1. This process can be expressed as follows:where is the element magnitude, and represent the minimum and maximum magnitude, respectively. Finally, the normalized spectrogram was compressed into 64×64 time-frequency graphs as the input of the model.

3.2. Random Dropout on Input

Dropout is a technique proposed in [41] to prevent the deep neural nets from overfitting. The key idea is to randomly deactivate the units along with their connections from the network with probability p during training, preventing units from coadapting too much. Applying dropout amounts to sampling a “thinned” network from the original one during training. During the testing phase, dropout is disabled, which can be seen as an average of the predictions of many “thinned” networks. The networks trained with dropout usually have much better generalization ability on supervised learning tasks.

The deactivated units affect all the ones in the network, including the layers with dropout. Dropout applied in the lower layers can also be seen as providing noisy inputs for the higher layers. It can be interpreted as a method of data augmentation by adding noise to its hidden layers.

Adding noise with a specific distribution was not enough. Inspired by [21], we randomly changed the dropout rate during the training to obtain noise with the uncertain feature. Specifically, in each batch of training, the dropout rate was a random value between 0.1 and 0.9. The visualization of the operation is illustrated in Figure 5.

Here denotes an elementwise product. is a vector whose elements follow independent Bernoulli random variable which has a probability p. and are the raw input and the interfered output of .

The purpose of adding dropout to the input layer was to add masking noise to the input, making the model insensitive to disturbance and improving the domain generalization of the model.

3.3. Feature Extraction

To make full use of unlabelled information, an autoencoder was designed for feature extraction. In the encoding stage, the 2D time-frequency images first passed through a set of 2D convolutional layers. The 2D convolutional layers captured the localized features of the image well due to its translation invariance. To obtain more diverse features at the same feature level, the weights in the convolutional layer were designed as a series of 2D filters. Each filter convolves independently across the input feature map in the forward pass, obtaining the output of one of the convolution layer’s channels. Generally, the computing of the convolutional layer is expressed as follows:where operator denotes the convolution of the channel of the feature matrix and the kernel , which produces the feature map of the channel of the layer . is the bias of channel in the layer . The , a nonlinear activation function using RELU in this study is implemented on the final output of the convolution network.

The encoder and decoder were designed in a symmetrical form. To reconstruct the coding of the bottleneck layer to the same size as the input time-frequency image, a transposed convolution layer was used in the decoder to unsampled the feature map. Following [42], the encoder contained four convolution layers and two fully connected layers, while the decoder contained four transposed convolution layers and two fully connected layers.

3.4. Training of the Proposed Model

The two outputs of the model correspond to two different losses, including the reconstruction loss computed by the autoencoder and the classification loss computed by the prototype network. In the training process, and are minimized. The total loss in the model training can be described as follows:where the hyperparameter is the weight coefficient used to adjust the weights of different losses. In the training process, the network is optimized with an Adam optimizer which sets the learning rates for each parameter adaptively. The steps of the proposed training algorithm are listed in Algorithm 1.

Initialize: weight coefficient α = 0.5, the batch size is set to 8, the learning rate η is set to 0.0001, and the epoch is set to 300
for n = 0,…, epoch do
for i = 0,…,steps do
  input a batch samples from the source domain
  random sampling p from Uniform (0.1, 0.9)
  dropout on inputs with rate p
  compute prototypes
end for
end for

4. Experiments, Results, and Discussion

4.1. Experiment Setup
4.1.1. Experiment Description

To verify the validity of our method, experiments are carried out on two bearing datasets selected from the Case Western Reserve University (CWRU) bearing datasets [43] and Paderborn bearing dataset [44]. We assume the source domain contains limited labeled samples and set 6, 10, 15, 50, 100, 200, 300, 500, 600 training samples per class to test the performance of the proposed method. Fivefold cross-validation is applied to the experiments. The test platform uses an Ubuntu 18.04 + Python 3.6 + Pytorch with an Intel® CORE™ i7-9750H CPU and a Nvidia GTX 1080Ti GPU.

4.1.2. Comparison Methods and Evaluation Metrics

To verify the advantages of the proposed model, as shown in Table 2, several popular models are compared, using three types of time series input methods (Siamese-based CNN [35], PSDAN [45], and WDCNN [46]) and three types of time-frequency input methods (SCNN, HCAE [42],s and DeIN [47]). The Siamese-based CNN was designed by [35]. PSADAN was an adversarial domain adaptation method. WDCNN, in which a wide convolution kernel was used in the front of the network, was proposed in [46]. DeIN was proposed in [47]. SCNN is a common CNN that follows a softmax at the end of the same structure with the encoder of HMN. The HCAE was proposed in [42]. The HMN model was proposed by our team.

All the models are trained in the source domain and tested in the unseen target domain. For the sake of fair comparison, the hyperparameters of models are carefully selected.

Several evaluation indicators are used to evaluate the performance of the proposed model in the following aspects: (1) accuracy, (2) precision, (3) F1 score (F1), and average F1 score (αF1). Precision, F1, and αF can be obtained using the following equations:where , , and represent true positive, false positive, and false negative, respectively.

4.2. Case Study 1: CWRU Bearing Datasets
4.2.1. Data Description

In the CWRU bearing datasets [43], the 12k drive end fault data were selected as the original experimental data. Four types of faults, i.e., normal, ball fault, inner race fault, and outer race fault, were found in these data, as shown in Table 3. Each fault type had three different subtypes, i.e., 0.007 inches, 0.014 inches, and 0.021 inches. Thus, there were altogether 10 different types of fault.

Signals of all fault types are shown in Figure 6. Each type of fault had three different loads, i.e., 1, 2, and 3 hp (motor speed of 1772, 1750, and 1730 RPM), as illustrated in Table 4. During data collection, each sample was collected from a vibration signal, as shown in Figure 7. Half of the signals were used to generate training data, and the remaining signals were used to generate the test set. As shown in Figure 4, the training samples were generated using 2048 points sliding window with 80 points overlapping steps. The test set samples passed through sliding windows in the same size, but the samples were generated without overlapping.

We set the data under different working conditions as experimental data. Datasets A, B, and C correspond to different working conditions with loads of 1, 2, and 3 hp, respectively. Each dataset contained 6000 training samples and 250 test samples.

4.2.2. Results and Analysis

Figure 8 illustrates the accuracy of all methods of training with various amounts of samples. With outstanding performance, HMN is evidently superior to the other approaches. We can find that cross-domain task C to A is the most difficult, in which even with sufficient training samples, the accuracy of four compared methods does not reach 90%, but the proposed model still achieves satisfactory results.

The results of training with 6 samples per class were observed. The classification accuracies of the cross-domain tasks are shown in Table 5. The best performance was achieved using HMN among all the methods in all the scenarios. Specifically, HMN achieved an accuracy of 92.65% in C-A, which was 34.61%, 21.57%, 26.38%, 19.32%, 40.21%, and 27.09% higher than DeIN, Siamese Based CNN, WDCNN, SCNN, HCAE, and PSDAN, respectively.

In Tables 6 and 7, the precisions and F1 (αF1) of HMN and the other 6 methods in the cross-domain task C-A are compared, each training class containing 6 samples (the most difficult task). The results reveal that the suggested HMN outperformed all of the compared approaches. This evidenced that HMN can achieve more robust performance in cross-domain diagnostic tasks with limited training samples.

To further evaluate the effectiveness of the proposed method, we observed the effects of the autoencoder and random dropout in improving model’s performance through the loss curve. Figures 9 and 10 show the loss curves in cross-domain task C-A with 6 training samples per class.

As shown in Figure 9, training losses containing reconstruction loss and classification loss are considered to originate from equation (9), with testing losses set to classification loss . According to equation (9), when α is set to 0, the autoencoder does not work. A greater α indicates a higher weight of autoencoder during the training process. As α increases from 0 to 0.2, the testing loss converges to a smaller value. The testing loss’s convergence process is smoother when α equals to 0.5. This demonstrates how the autoencoder branch may prevent overfitting and improve the model’s performance.

As shown in Figure 10, when the HMN does not employ random dropout on input, the convergence value of the testing loss is greater than 3; however, when random dropout is used, the convergence value of testing loss drops to less than 1, and the curve descends more smoothly. The effect of random dropout on input in improving the model’s cross-domain generalization is demonstrated.

4.3. Case Study 2: Paderborn Dataset
4.3.1. Data Description

As shown in Figure 11, the test rig [44] consists of five modules: (1) electric motor, (2) torque-measurement shaft, (3) rolling bearing test module, (4) flywheel, and (5) load motor. Bearings with different state types were installed in the test module to obtain experimental data. Fault types of bearings come from artificial and real damages.

In the basic setting of operating condition, the test platform ran at n = 1500 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1,000 N. Other settings were set up by changing the parameters one by one to M = 0.1 Nm and F = 400 N (named D, E, F, respectively, shown as Table 8.

The bearings with 32 different states were operated under different working conditions, including 14 states with natural damages from accelerated lifetime tests, 12 states with artificial damage, and 6 states with health data.

Each bearing under a load setting is measured with a vibration signal of about 4s at a 64 kHz sampling rate. In the experiment, datasets contained signals obtained from healthy bearings, artificially damaged bearings, and naturally damaged bearings. All bearings of different fault types were running under three different loads at a speed of 1500 rpm. The datasets filenames selected are shown in Table 9. The details of the datasets selected are listed in Table 10. Each dataset contains 1800 training samples and 120 test samples.

4.3.2. Results and Analysis

By performing the same implementation, Figure 12 compares our method with the compared approaches in terms of the accuracy of different cross-domain tasks. The results show that our method outperformed the other six stat-of-the-art methods in all the scenarios.

Table 11 illustrates the cross-domain tasks accuracy of different methods with 6 training samples per class. The proposed method outperformed all comparative methods by 6.87%–41.26% on average. Tables 12 and 13 compare the methods in terms of precision, F1, and αF1 in the cross-domain task E-D with 6 training samples per class. The results also show that our method outplay the alternatives.

5. Conclusions

A novel HMN was proposed for cross-domain fault diagnosis with limited training samples. We improved the model’s diagnostic performance in two ways: (1) a novel deep learning structure combining autoencoder and matching network was built, (2) a random dropout strategy adding random disturbance into the inputs during the training process was developed to enhance the model’s domain generalization. In Section 4, we present the experimental results showing that the proposed method has better domain generalization ability with limited training samples compared with the state-of-the-art approaches.

However, the method proposed in this study still has some restrictions. For example, the method is limited to cross-domain tasks between different working conditions on the same device. However, cross-domain across multiple devices makes intelligent fault diagnosis algorithms more realistic. In addition, HMN can only perform classification tasks, limiting the model’s potential to multitask. In future work, we will further optimize HMN and employ it in more complex cross-domain fault diagnosis scenarios and multitask learning.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.


This work was supported in part by the National Key Research and Development Program (grant number 2020YFB1713300), the National Key Technologies Research and Development Program of China (grant number 2018AAA0101803), the National Natural Science Foundation of China (grant number 91746116), the Science and Technology Project of Guizhou Province Talents (grant numbers (2015)4011 and (2017)5788), the Guizhou Province University Talent Training Base Project (grant number (2020)009), the Guizhou Science and Technology Planning Project (grant number (2017)3001), the Guizhou Optoelectronic Information and Intelligent Application International Joint Research Center (grant number (2019)5802), and the Guizhou Province University Integration Research Platform Project (grant number (2020)005).