Abstract

Conventional methods for fault diagnosis typically require a substantial amount of training data. However, for equipment with high reliability, it is arduous to form a large-scale well-annotated dataset due to the expense of data acquisition and costly annotation. Besides, the generated data have a large number of redundant features which degraded the performance of models. To overcome this, we proposed a feature transfer scenario that transfers knowledge from similar fields to enhance the accuracy of fault diagnosis with small sample. To reduces the redundant information, data were filtered according to manifold consistency. Then, features were extracted based on CNN and feature transfer was conducted. For adequate fitness, the joint adaptation of conditional distribution and marginal distribution was used between the two domains. Minimum structural risk and MMD of adaptation were two indicators weighted for training the model. To test the efficiency of the model, we built an airborne fuel pump testbed, and contributed a new dataset that contained 15 categories of fault data, which serves as the small sample dataset in this research. Then the proposed model was applied in our experimental data. As a result, the fault diagnosis rate increases by 28.6% through our proposed model, which is more precise than other classical methods. The results of feature visualization further demonstrate that the features are more distinguished through the proposed method. All code and data are accessible on my GitHub.

1. Introduction

In recent years, data-driven methods have been widely applied in the fields of prognostics and health management, which have become a hot method in building complex diagnosis models [13]. Most data-driven methods are based on a large number of samples, from which the corresponding relationship between input and output is extracted to establish a model. However, for some equipment with high reliability and long lifespan, it is arduous to obtain sufficient fault data. Along with that, the generated data have a large number of redundant features which degraded the performance of models. A model trained by small samples has limited generalization ability, which will lead to low accuracy when applied to other fields [4].

Some popular methods have been proposed to solve the problem of small sample. To our best knowledge, these methods are of three categories. The first is based on resample, which resampling small samples to generate more data, such as Random Under-sampling [5] and Synthetic Minority Over-sampling [6]. The second is based on Generative Adversarial Net [7, 8], whose principle is to make the generated network sample as realistic as possible by antagonizing generative network and discriminant network, so as to enlarge small samples. The third is based on few-shot learning [9, 10], which decompose the small samples into different meta tasks to learn the generalization ability of the model. Therefore, few-shot learning has adaptive capacity to an unseen dataset. Although these methods achieved success to some extent, their sources of knowledge are only from the small samples. From the perspective of information theory, these methods do not change the nature of a small sample with little knowledge.

Transfer learning is a new machine learning method that uses existing knowledge to solve problems in different but related fields [11]. Transfer learning has been applied to some tasks such as hand gesture recognition [12], sentiment analysis [13], fraud detection [14], and hyperspectral image analysis [15]. Besides, many advanced transfer learning theories have been proposed. For example, Liu [16] proposed a one-step approach towards classifiers have to be trained with noisy data. Chen [17] proposed a boundary based Out-of-Distribution (OOD) classifier which classifies the unseen and seen domains by only using seen samples for training. Teshima [18] proposed a meta-distributional scenario in which a data generating mechanism is invariant among domains.

Due to the advantages of transfer learning in domain generalization, it is also widely applied in fault diagnosis. Wang [19] proposed an LDA-based deep transfer learning framework for fault diagnosis in industrial chemical processes, Singh [20] utilized minimum redundancy maximum relevance (mRMR) for intelligent fault diagnosis of rotating machines. Deng [21] proposed a double-layer attention based adversarial network (DA-GAN) for partial transfer learning in machinery fault diagnosis. To draw a conclusion from the existing literatures, there are two aspects that few researchers had considered. Firstly, most proposed transfer learning methods focus on the adaptation of the diagnosis algorithm in different fields but pay less attention to the condition of small samples. Secondly, most literatures use the raw signal in the source domain, which is feasible for many fields. However, some high-reliability equipment has a long lifespan, which results in redundant features of monitoring data. If the raw signal is used directly, negative migration may occur.

Given that high-reliability equipment has characteristics of small sample size and redundant features, we proposed a feature transfer framework. To reduces the redundant information, data were filtered according to manifold consistency. Then, features were extracted based on CNN and feature transfer was conducted. For adequate fitness, the joint adaptation of conditional distribution and marginal distribution was used between the two domains. Minimum structural risk and MMD of adaptation were two indicators weighted for training the model to enhance the generalization ability of the model. To test the efficiency of the model, we built an airborne fuel pump testbed, and contributed a new dataset that contained 15 categories of fault data, which serves as the small sample dataset in this research. Then the proposed model was applied in our experimental data. As a result, the fault diagnosis rate increases by 28.6% through our proposed model, which is more precise than other classical methods. The results of feature visualization further demonstrate that the features are more distinguished through the proposed method.

2. Problem Setup

2.1. Definition of transfer learning

In most cases, a domain D consists of two components: a feature space and a marginal probability distribution , where . For example, given a concrete domain , a task can be expressed as two components: a label space and an objective predictive function (denoted by), which is not observed but can be trained from the data, which consist of pairs , where and. The function can be used to predict from the data to the corresponding label. From a probabilistic viewpoint, can be written as.

Definition of transfer learning: Given a source domain and learning task, a target domain and learning task, transfer learning aims to help improve the learning of the target predictive function in using the knowledge in and, where , or [22].

2.2. Fault diagnosis with small sample

For some highly reliable products, the small sample data set can be represented as , which contains samples. In the form of transfer learning, the data set is described as target domain and the target task , where is the Marginal Distribution of , is the label space of the target domain, is a function that maps the sample to the tag space in the target domain.

Another dataset with rich samples is represented as , which contains samples (). In the form of transfer learning, the data set is described as the source domain and the source task in transfer learning, where is the marginal distribution of , is the label space of the source domain, is a function that maps the sample to the tag space in the source domain.

The goal of transfer learning is to acquire and apply the knowledge from the source domain [23]. More specifically, it is to establish the nonlinear mapping relationship from the equipment monitoring data to the health label space in the source domain, then transfer it to the target domain. In the given situation, the labels of the source domain and target domain are accessible. Such problem is concluded as multitask learning. Aiming at this problem, feature transfer is often used to transfer knowledge from source domain to target domain [22].

2.3. Feature transfer

The idea of feature transfer is to learn a pair of mapping functions to extract diagnostic features respectively from the source domain and the target domain. Then adapt features extracted by mapping functions, and the target domain could extract features according to the paradigm of the source domain. The fault classification is carried out in the target domain based on the features, and the diagnosis knowledge of the source domain is transferred to the target domain through the feature adaptation [24].

Aiming for learning transferable features between a given target domain and source domain, a common approach of feature adaptation is to minimize inter-domain differences between source feature and target feature. During the adaptation, the extracted features from the target domain act as a template, that the source domain could learn from the template. The schematic diagram of domain adaptation is shown in Figure 1.

3. General Framework

The main issues of transfer learning can be concluded as the following three: when to transfer, what to transfer and how to transfer. Aiming at these three issues, this paper designed a framework of LLE-CNN-JDA. Aiming at when to transfer, we designed a data filtering method based on manifold consistency. We mapped the high-dimensional data into low dimensional space, to analyze the similarity between the source domain and target domain. To ensure the availability of transfer data, we filtered data in the source domain based on the Euclidean distance between the source domain and the target domain in manifold space. Aiming at what to transfer, the method of feature transfer is adopted, the convolutional neural network is used to extract the deep feature of the data in the source domain and target domain. By adapting the feature in the source domain and target domain, the knowledge may be transferred in the feature layer. Aiming at how to transfer, we designed a term that jointly adapt conditional distribution and marginal distribution of feature layer, and the network is trained based on structural risk minimization. The framework diagram of the proposed model is shown in Figure 2.

3.1. Data filtered based on manifold consistency

The so-called manifold is a general term for geometric objects, such as curves and surfaces of various dimensions. Manifold learning maps data from higher-dimensional space into lower-dimensional space. Unlike other dimensionality reduction methods, manifold learning assumes that data is sampled from a potential manifold. If we can find the laws of the data in the manifold space, we may find the potential laws of the data in high dimensions to mine the essential characteristics of data [25, 26].

For the given situation, the sample size of the source domain is large, so the prognostics model in the source domain can be trained well. However, the data in the target domain is not abundant, we may not excavate enough information from the target domain. The source domain has sufficient data, but it is necessary to explore whether the source data can be effectively applied to the target domain. Based on the idea of manifold learning, we consider that if we can explore the relationship between the source domain and the target domain in the low-dimensional manifold space, and filter the data transferable to the target domain, source data may be applicable to the target domain.

Locally linear embedding (LLE) is a method of manifold learning, which enables the data to maintain the original manifold structure well after dimension-reduction. The manifold of LLE is an unclosed surface, which has features of relatively uniform and dense distribution. Every data point can be constructed by the linear weighted combination of its nearest points. LLE transfers manifolds from higher dimensions to lower dimensions and preserves some features of manifolds in higher dimensions as much as possible [27, 28]. The steps of the LLE algorithm are as follows:

Step1. :Calculate K adjacent points for each sample point. Adopting the KNN strategy, K points with the smallest Euclidean distance to the sample point are taken as the K adjacent points of the sample.

Step2. :Calculate the local reconstruction weight matrix W of the sample. The reconstruction error is defined as , and the local covariance matrix C is defined as , where X represents a specific point, and its K adjacent points are denoted by . Then Minimize , then get .

Step3. :Map all sample points to a low-dimensional space, where the mapping function satisfies . The formula can also be represented as , where . Combining with the restrictive conditions and, the problem is transformed into , take M eigenvectors of matrix M to form column vectors, that matrix Y = N ∗ M, where N is the size of data.
In this paper, we adopted the LLE algorithm to evaluate the similarity of the data from the source domain and target domain. We adopted the bearing failure data from Case Western Reserve University as the source domain, the failure data from our airborne fuel pump test-bed as the target domain. Two kinds of data were mapping to manifold space respectively. As the neighbouring points number K change, the mapping results in manifold space are shown in Figure 3. Through analysis, we found that when K = 100, the mapping manifold of the two types of data is most close. Therefore, the K = 100 was chosen to filter data in the low dimension. The failure data from the airborne fuel pump testbed worked as the target domain, which was in a small sample size. The bearing failure data from Case Western Reserve University worked as the source domain. The bearing failure data is in big sample size, but it is difficult to ensure the validity of the data. Therefore, we proposed to filter data of the source domain in the manifold space by calculating the Euclidean distance to the target domain. We adopted the tactics of KNN, computed the distance of each sample point from the source domain to all sample points from the target domain, then chose the minimum distance as an indicator of the sample point. We got an indicator set , then sorted the indicator set, and chose the n sample points with the smallest distance as the target domain.

3.2. Deep feature extraction

The source domain and target domain data both contain prognostics information related to the equipment. Therefore, we adopt a convolutional neural network to extract fault features. The source domain data and target domain data were extracted separately by the neural network, which contains convolution layer, pooling layer, flatten layer and full connection layer. The network parameters are trained based on structural risk minimization and feature adaptation between the source domain and target domain.

Convolution layer operation: Assume that and represent vectors in the L-1 layer of the network. Through a convolution kernel function , the feature vector of layer L is , where is the activation function of layer L, is the parameters to be trained of the convolution layer.

Pooling layer operation: The processing logic of the pooling layer is to compress the input matrix. The formula of the pooling layer is , where represents the result of pooling the features from sequence M to N in the convolution layer L, represents the vector whose sequence is in the convolution layer L, represents that the operation is for the source domain.

After multiple convolution and pooling processes, the feature extraction layer output, namely the flatten layer input vector , is obtained. The vector fed into flatten function, and the flatten layer output was obtained. The output of the flattening layer is then used as the input vector of the full connection layer, where the vector is mapped to the label space through the neural network of the full connection layer. The function is , where is the activation function, is the parameter set to be trained.

3.3. Structural risk minimization

For the target domain data with a small size, overfitting is easy to occur in the training process. To prevent overfitting, it is necessary to avoid the excessive complexity of network structure in the process of training. Therefore, complexity and accuracy are important indicators impacting the efficiency of the network. In this paper, the Convolutional Neural Network was trained based on structural risk minimization. A penalty term (regularization term) for the complexity of the model is added to the empirical risk to reduce the risk of data overfitting [29]. The formula of structural risk minimization is expressed as follows:

In the formula, the more complex the model is, the greater the value of will be. indicates the importance of model complexity. As the empirical risk convergence to a certain degree, when the empirical risk decreases, model complexity will increase sharply. Model complexity may make the model fit the data in the source domain over exactly, and the model would be difficult to generalize to the target domain. Thus, we added the penalty term for model complexity to inhibits the excessive increase of model complexity.

For a conditional probability distribution, the loss function is logarithmic, and the model complexity is determined by the prior probability of the model. Therefore, the structural risk minimization is equal to the maximum posterior probability estimate. Given a sample set , it is assumed that the prior distribution of parameter is, and the probability of is. Maximize the probability as , then take the log of the result, will be obtained. Take the complex number of the above equation, it is transformed into the minimization problem . Define the loss function as , the coefficient as , the penalty term as , the equation is equal to form (2), which is structural risk minimization [30].

3.4. Joint adaption of conditional distribution and marginal distribution

Feature extraction and network training are conducted separately in the source domain and target domain. To transfer knowledge, the adaptation between the source domain and the target domain is to be conducted at the feature layer.

As for the source domain, the size of data is large, so the extracted feature contains a lot of information related to the equipment. Thus, the pattern how deep features be extracted in source domain is an empirical model for fault identifying, which may be appliable to other related fields. For the target domain, the sample size is small, which may lead to poor generalization ability of the network in feature extraction. Therefore, the feature extraction of the source domain is a reference to the target domain, and feature transfer will be available in this way.

Marginal distribution and conditional distribution reflect domain distribution [31]. Therefore, feature adaptation is to adapt marginal distribution and conditional distribution. According to probability theory , we seek to minimize the distribution distance (1). between the marginal distributions and, and (2). between the conditional distributions and simultaneously. For source domain and target domain , assume that the features extracted through CNN network are and . If the features of the two domains are to be adapted, the marginal distribution and conditional distribution of feature vectors must be adapted.

3.4.1. Adaption of the marginal distribution

We try to minimize the distance between marginal distributions and. Since directly estimating probability densities is nontrivial, we resort to explore nonparametric statistics. We adopt empirical Maximum Mean Discrepancy (MMD) to measure the distance, which compares different distributions based on the distance between the sample means of two domains in a reproducing kernel Hilbert space (RKHS) [32].

Specifically, the statistical approach of MMD is conducted in the following manner. Based on the samples of the two distributions, look for a continuous function in the sample space, get the function values corresponding to the two distributions, and calculate the mean of the function values of each distribution. By making difference between the two mean values, the mean discrepancy of the two distributions will be obtained corresponding to. Look for an that causes the mean discrepancy to have a maximum value, the value is MMD. Thus, MMD is taken as a test statistic to determine whether the two distributions were close. If this value is small enough, the two distributions are considered the same; otherwise, they are not. This value is also used to determine the degree of similarity between two distributions [33]. If F is used to represent a continuous set of functions in the sample space, then MMD can be expressed as follows:

Assume that X and Y are two data sets obtained by independent identical distribution sampling from distribution p and q respectively, and the sizes of the data sets are M and N. The empirical estimate of MMD based on X and Y is as follows:

Given two distributions of observation sets X, Y, this result will depend heavily on a given set of functions F. To express the properties of MMD, if and only if P and Q are of the same distribution, MMD is 0. Thus, F is required to be rich enough. On the other hand, the empirical estimation of MMD should rapidly converge to its expectation as the size of the observation set increases. Thus, F must be sufficiently restrictive. To satisfy the two requirements, we adopt the reproducing kernel Hilbert Spaces.

In reproducing kernel Hilbert Spaces, F space is a complete inner product space, and each F corresponds to a feature map. Based on the feature map, we defined a mean embedding of p for a distribution p that satisfies the following properties: such that for all . Mean embedding exists with constraints. Under the existence of the mean embedding of P and Q, the MMD squared can be expressed as follows:

If F is a unit ball in a universal RKHS, such as Gaussian and Laplace RKHSs, the square of this MMD can be expressed as:

X and X 'are two random variables that obey p, and y and y' are two random variables that obey q. One of the above statistical estimates can be expressed as:

Marginal distribution adaptation is to maximize mean discrepancies that for minimizing the marginal distributions of the features from the source domain and target domain [31], which is:

3.4.2. Adaption of the conditional distribution

We try to minimize the distance between conditional distributions and. Since calculating the nonparametric statistics of and are difficult, we resort to explore the quasi-conditional distributions as and instead, which can well approximate and when sample sizes are large [34, 35].

Conditional distribution adaptation is to maximize mean discrepancies. Simultaneously, the quasi-conditional distributions of the features from the source domain and target domain will reach a minimum, namely:

4. Training of the network

Overall, manifold consistency was used to filter data from the source domain. The deep features of the source domain and the target domain were extracted by CNN, then the features were adapted, and the classifier was constructed through the full connection layer. After network construction, network parameters need to be trained in the following aspects: (1). structural risk minimization; (2). marginal distribution adaptation; (3). conditional distribution adaptation. The network structure of the proposed model is shown in Figure 4. The joint optimization formula is as follows:

Due to the addition of the adaptive function in the feature layer, the parameter training and backpropagation of the entire convolutional neural network will be affected. For a single fault diagnosis with convolutional neural network, the error is from the difference between the expected label and the real label. However, when the joint distribution adaptation function is added to the feature layer, the adaptation error of the feature layer will also affect the parameter training of the whole network. Therefore, to search how the network is trained, the error backpropagation formula of the network is derived in this paper.

4.1. Error backpropagation in the full connection layer

The error of the output layer comes from the difference between the expected label and the real label, which is usually expressed by the two-norm of the difference between the two labels. The formula is as follows:Where is output of the network, y is the label of the data set. According to the functional relationship, can be expressed as: , so we can get:Where is Hadamard product, is activation function, and are the weights and bias of the output layer.

In this paper, a single-layer full connection layer is used as a feature classifier. To enhance the extensibility of the algorithm, a more general case of multi-layer full connection is considered in the derivation of the backpropagation formula. Suppose an error propagation variable is:

If we can figure out ,according to the formula of layer : , the gradient formula of layer can be obtained as follows:

So, the whole point of the problem is to figure out . In this paper, was deduced by mathematical induction. As for the output layer, . For layer , can be figured out through from layer, the formula is as follows:

Therefore, of each layer can be obtained by continuous backward recursion from the output layer, and then the updated formula of weight and bias of each layer can be calculated as:

4.2. Error backpropagation in pooling layer

There is no need to optimize and update W and B in the pooling process. However, in the process of backpropagation, the error will change in the pooling layer. Similar to the full connection layer, we still use as the bridge to calculate the backpropagation formula for the pooled layer.

For the pooling layer, during the backpropagation, we firstly restore all of the submatrix matrix sizes of to their pre-pooling sizes. If the pooling operation is Max, the values of each pooled locality of all submatrices of are placed at the position where the previous forward propagation algorithm obtained the maximum value. If the pooling operation is Average, then the values of each pooling locality of all the submatrices of are averaged and placed at the reduced submatrix position. This process is usually called the Upsample, and the formula is as follows:Where the Upsample function completes the logic of enlarging the pool error matrix and redistributing the error.

4.3. Error backpropagation in the convolution layer

Based on the analysis of the full connection layer, we can get that the recursion relation of in the convolution layer is:

The key to the problem is to solve for . According to the matrix relation: , it can be obtained that: . According to the matrix relation: , the gradient of W and b can be obtained as follows:

Thus, the updated formula of weight and bias of each layer can be calculated as follows:

4.4. Error backpropagation in feature adaption layer

For the model, the error of the feature adaption layer will affect both the source domain and the target domain. Taking the target domain as an example, we calculated the gradient update error of the feature adaption layer.

In the feature adaption layer, the output is set as . In the process of backpropagation, the error of this layer has two sources. One is the error gradient returned by the next layer over the network, and the other is the maximize mean discrepancies between the feature and . As for the feature adaption layer, the error can be expressed as:

The gradient of W and b can be obtained as follows:

Thus, the updated formula of weight and bias of each layer can be calculated as follows:

After updating W and b of the feature adaption layer, it is also necessary to deduce the transfer of the errors of this layer to the previous layer. Referring to the backpropagation methods of the convolution layer, the error propagation variable of the feature adaption layer is defined as:

According to the matrix relation between two layers, we can figure out:

Thus, the recursive relation of the feature adaption layer can be obtained as follows:

5. Construction of testbed and application of the model

To verify the effectiveness of the method, an airborne fuel pump test platform was built to obtain the fault test data of the airborne fuel pump. Firstly, based on the FMECA analysis of the statistics data of the airborne fuel pump over recent years, the common fault modes of the airborne fuel pump were obtained to guide the test. Secondly, based on the analysis of the physical model of the airborne fuel pump, the fault injection test was carried out in the testbed to collect the fault data of the relevant fault modes of the airborne fuel pump. After obtaining the data, the bearing fault data of Western Reserve University combined with our experimental data was used to carry out transfer learning research.

5.1. FMECA of the airborne fuel pump

The airborne fuel pump is a core component of the fuel system, and the pump is responsible for the fuel supply and fuel transfer of the aircraft. The structure of the airborne fuel pump is shown in Figure 5. Searching the degradation law of the airborne fuel pump is the basis for the life prediction of the airborne fuel pump. The time sequence of the degradation data of the airborne fuel pump is the key to predict the trend of breaking down, then estimate the life span of the airborne fuel pump [36, 37].

Through Failure Mode, Effects, and Criticality Analysis (FMECA) of the airborne fuel pump, six typical faults as blade damage, diffusion pipe damage, leakage, diffusion pipe and impeller rub, pump port and impeller rub, and bearing wear were selected, which is shown in Figure 6. Further analysis of the working principle and failure mechanism showed that when the fuel pump failure or performance declines, it will cause an abnormal vibration signal of the shell. However, in the military airfield, it is usually dismantled or returned to the factory for maintenance, without effective data monitoring and recording measures. Thus, it is difficult to quickly locate the fault, resulting in a reduction of the maintenance support level and waste of airborne equipment. Therefore, we considered selecting the vibration signal of the airborne fuel pump as the monitoring signal and carried out the time-frequency analysis and statistical characteristics analysis to extract the fault feature [38]. Our goal is to realize the intelligent and effective diagnosis of the airborne fuel pump.

5.2. Construction of airborne fuel pump testbed

A centrifugal AC electric pump provided by Nanjing Engineering Institute Centre is selected as the experiment object, as shown in Figure 7. This type of fuel pump is mainly used for the thermal subsystem and oil tanks. The fuel pump uses aviation fuel RP-3 as the working medium, whose temperature range from minus 60°C to 85°C. The other working parameters are shown in Table 1.

The experimental platform of the airborne fuel pump is shown in Figure 8. The platform mainly includes an oil storage tank, oil feeding tank, centrifugal test fuel pump, electric diaphragm pump, air-cooled radiator, pressure transducer, flow transducer, temperature transducer, liquid level transducer, data acquisition equipment, etc. In the main loop, the fuel pump pumps the oil from the oil feeding tank to the oil storage tank. For cycling, the oil in the storage tank returns to the feeding tank by gravity through the valve [39]. Through cycling, the working environment of the test pump is stable. In the second loop, an electric diaphragm pump is used to ensure the uniform distribution of particles in the impurity experiment. An air-cooled radiator is used to cool the oil and keep the oil temperature near room temperature. As shown in Figure 9, in the oil feeding tank, three vibration sensors are adopted to monitor the vibration of the airborne fuel pump, among which the vibration sensors are installed at three mutually perpendicular positions on the motor housing in the form of magnetic suction seats.

When testing, open the valve, fill the oil storage tank with fuel and connect the pump power supply, and make sure the pump continues to work. When the pump runs stably, collect the pump vibration signal, the outlet pressure signal, and the outlet flow signal. After the signal collection, close the power supply and the valve. Then, similar to the normal fuel pump, the other six typical fault signals are obtained by replacing different fault parts. The typical fault parts are shown in Figure 6. As shown in Table 2, vibration and pressure signals under normal state and 14 kinds of fault state were measured respectively in the experiment. Each group of data contained 4 channels, with a sampling frequency of 6000Hz and a sampling time of 5s for each channel.

5.3. Model application in the airborne fuel pump

In this paper, bearing fault data from Case Western Reserve University is selected as the source domain of transfer learning. Bearing faults in Case Western Reserve University are mainly caused by bearing wear, and the degree of bearing wear are 0.1778mm, 0.3556mm, 0.5334mm, and 0.7112mm. As for the target domain of airborne fuel pump, 1 impeller blade damage, diffusion tube damage, leakage, and bearing wear of 0.02mm were selected as the target domain for transfer learning. There are 4 types of 3496 (vector number) 246 (vector length) fault data available for the airborne fuel pump, and only 2246 data are selected for each type of fault. The remaining large amount of airborne fuel pump fault data are used as verification data for the transfer learning effect. In other words, the network trained by 2 sets of data was verified by 3494 sets of validation data to judge the diagnostic accuracy of the network. After the pre-processing, the small target samples combined with their labels are available for training.

For the proposed model, there are hyper-parameters like learning rate, regularization parameter, cost function to be determined. In this research, we adopted randomized search to seek the optimized hyper-parameters. Nevertheless, the proposed model was complex that the process of hyper-parameters optimization cost too much time. Therefore, we sampled a smaller dataset to fed into the model. Although the accuracy of hyper-parameters decreased, the time cost reduced greatly. Through this way, the hyper-parameters of the model were determined [40].

To test the efficiency of the proposed model, firstly, fault diagnosis without transfer learning was carried out for small sample fault data of airborne fuel pump, and the confusion matrix of diagnosis results was shown in Figure 10. Then, the feature transfer model was used, bearing fault data from Case Western Reserve University was taken as the source domain, and the small sample fault data of airborne fuel pump was taken as the target domain. The results of the confusion matrix of the transfer diagnosis were shown in Figure 10. According to the results, the fault diagnosis accuracy of each type of airborne fuel pump is improved by 28.25% on average through feature transfer learning.

Some other classical transfer learning algorithms such as ResNet-50, DANN, ADDA, JAN, MADA, CBST, CAN, CDAN+E, DM-ADA, 3CATN, ALDA are chosen as comparisons. The accuracy of diagnosis is given in Table 3. The results show that the diagnostic accuracy of the proposed model is remarkably higher than other algorithms. To further explore the capacity of the proposed model to filter the transferable data, some data filtering methods such as PCA, K-Means, DBSCAN, GMM, BIRCH are selected as comparisons. The accuracy of diagnosis is given in Table 4, the proposed model gets a higher score than models with other data filtering methods. The results show that the proposed model is more capable to filter available data in the source domain, which further proves that the proposed algorithm may prevent negative transfer to some extent.

5.4. Validation of feature transfer

To further verify the effectiveness of the model, this paper uses the method of feature visualization to show the learning effect of feature transfer. Since the features extracted from the network are in high-dimensional that cannot be directly visualized, the T-Distribution Stochastic Neighbour Embedding method is adopted to visualize the high-dimensional data.

T-distribution Stochastic Neighbour Embedding is often used to visualize high-dimensional data. The main advantage of T-SNE is the ability to preserve local structures. The T-SNE algorithm models the distribution of the nearest neighbours of each data point. We model the high-dimensional space as a Gaussian distribution, while model the two-dimensional output space as a T-distribution. The goal of this process is to find the transformation that maps a higher-dimensional space to a two-dimensional space and to minimize the gap between these two distributions of all points [41].

For the network without transfer learning, the convolutional neural network for feature extraction was trained by 2 sets of data, and 3,494 sets of data were used as validation data. Feature extraction was carried out in the Convolutional Neural Network from validation data, and the extracted features were visualized using T-Distribution Stochastic Neighbour Embedding. The obtained results are shown in Figure 11(a). Similarly, for the model with transfer learning, the features extracted from the verification data were visualized using T-Distribution Stocking Neighbour Embedding, as shown in Figure 11(b). As can be seen from the figure, for networks without feature transfer, the boundaries of feature categories 1, 2, and 4 extracted from verification data are not clear, which is not conducive to the next step of feature classification and diagnosis. In the network with feature migration, the four categories of features extracted from the verification data have clear boundaries, which is easy to carry out classification and diagnosis. The effectiveness of the model for feature transfer is further proved from the feature visualization.

6. Conclusions

In this research, we proposed a feature transfer scenario that transfers knowledge from similar fields to enhance the accuracy of fault diagnosis with small sample. To reduces the redundant information, data were filtered according to manifold consistency. Then, features were extracted based on CNN and feature transfer was conducted. For adequate fitness, the joint adaptation of conditional distribution and marginal distribution was used between the two domains. Minimum structural risk and MMD of adaptation were two indicators weighted for training the model. To test the efficiency of the model, we built an airborne fuel pump testbed, and contributed a new dataset that contained 15 categories of fault data, which serves as the small sample dataset in this research. Then the proposed model was applied in our experimental data. As a result, the fault diagnosis rate increases by 28.6% through our proposed model, which is more precise than other classical methods. The results of feature visualization further demonstrate that the features are more distinguished through the proposed method. All code and data are accessible on my GitHub.

Data Availability

Data and code of this research are accessible, please visit: https://github.com/ppqweasd/Diagnosis-for-High-reliability-Equipment-with-Small-Sample-Based-on-Transfer-Learning-A-General-Fra.

Conflicts of Interest

The authors declare that they have no conflicts of interest.