Abstract

Early detection of fault events through electromechanical systems operation is one of the most attractive and critical data challenges in modern industry. Although these electromechanical systems tend to experiment with typical faults, a common event is that unexpected and unknown faults can be presented during operation. However, current models for automatic detection can learn new faults at the cost of forgetting concepts previously learned. This article presents a multiclass incremental learning (MCIL) framework based on 1D convolutional neural network (CNN) for fault detection in induction motors. The presented framework tackles the forgetting problem by storing a representative exemplar set from past data (known faults) in memory. Then, the 1D CNN is fine-tuned over the selected exemplar set and data from new faults. Test samples are classified using nearest centroid classifier (NCC) in the feature space from 1D CNN. The proposed framework was evaluated and validated over two public datasets for fault detection in induction motors (IMs): asynchronous motor common fault (AMCF) and Case Western Reserve University (CWRU). Experimental results reveal the proposed framework as an effective solution to incorporate and detect new induction motor faults to already known, with a high accuracy performance across different incremental phases.

1. Introduction

IMs support most of the production process in the modern industry’s daily life due to their straightforward construction, reliability, and relatively low cost. However, IMs operate for long uninterrupted working periods, are exposed to the elements, and minimum preventive maintenance. These operative conditions raise unexpected faults that can show up at any time, causing lower productivity and economic losses. Thus, early motor failure detection and correction are challenging problems that catch many researchers’ attention.

From a general overview, motor fault analysis methods split into signal processing and artificial intelligence approaches [1]. The first ones have been focused on analyzing diverse physical magnitudes to find features that help identify abnormal behavior in the motor’s performance [2, 3]. For example, rotor vibrations [4], bearing faults [5], and broken rotor bar [2]. Meanwhile, artificial intelligence-based methods have been integrated to provide automatic fault detection using a data-driven approach. These methods base their performance on extracted features from raw signals to be used as inputs. In past years, deep learning (DL) architectures, such as autoencoders (AE) [6], convolutional neural network (CNN) [5, 7, 8], and capsule networks (CapsNet) [1], have been used in fault diagnosis due to their potential applicability for the automatic feature extraction, reported in several cases new state-of-the-art results. In the literature, most works combine DL architectures with different handcraft features and feature extractors (e.g., Fourier and Wavelet transform) [8]. Recently, some authors [912] have shown some promising advances to eliminate the requirement of the handcraft features, where CNN architectures have demonstrated high effectiveness. Despite this progress, classification models have been focused on detecting a set of known patterns that characterize typical faults on equipment from manufacturers. However, modifications in the operative conditions can generate patterns from new failures that differ from those detected by the current model. This issue forces existing methods to learn a new model considering unknown failure conditions.

To overcome the practical challenge mentioned above, multiclass incremental learning (MCIL) arises a promising solution by updating the current model on new data instead of training once on a whole dataset. Indeed, MCIL aims to learn new classes from previous ones, although none or a few samples of old classes are retained. Unlike the conventional classification setting, in MCIL, samples from different classes come in different time phases, whereas incremental classifiers aim to achieve a competitive performance overall seen classes [13]. Motivated by this, only a few works have been reported by traditional approaches to address multiclass incremental learning. For example, Saucedo-Dorantes et al. [14] trained a self-organizing map (SOM) every time that a new detection occurs. However, this model does not retain samples from previous classes, and the complexity of the model increases when new faults are incorporated. Incremental model transfer learning (IMTL) [15] follows a domain adaptation approach to allow a classification model to detect new faults but requires all samples from past faults during the subsequent incremental phases to achieve high performance. Overall, these works are still limited because they depend on an engineered data representation. In this direction, deep learning approaches have certain advantages by learning task-specific features and classifiers from raw signals. However, deep learning models can suffer from catastrophic forgetting [16] when they are trained incrementally, i.e., the tendency of a neural network to underfit past classes when new ones are learned.

This study presents an MCIL framework based on an 1D CNN for fault detection in IMs. To tackle the catastrophic forgetting problem, the presented framework employs a memory containing representative exemplars from past data and updates a 1D CNN model across incremental states, using a fine-tuning procedure [16]. The representative exemplars from past (known) faults are selected using the Herding method [17]. Next, nearest centroid classifier (NCC) is used to classify test samples in each incremental phase. By doing this, the proposed framework maintains a constant model complexity while new classes appear each time. We evaluated and validated the presented model over two different study cases: (1) motor common faults diagnosis and (2) bearing fault diagnostics. Experimental results show that the proposed MCIL framework effectively incorporates new faults on an 1D CNN, achieving a high accuracy performance across different incremental phases.

2. Convolutional Neural Network

Convolutional neural network (CNN) is a biologically inspired artificial neural network that processes data with a known grid-like topology [18]. CNN alternates convolution and pooling layers, followed by a fully connected layer to extract features and generate the desired output. Due to the inherent one-dimensional signals obtained from a vibration analysis in IMs, it has been preferable to deal with these signals using one-dimensional models [9, 11]. Thus, we first describe 1D convolution operators, which are used in the presented work. Then, we describe the complement layers that integrate a convolutional network.

In its standard approach, CNN performs a set of convolutions between an input signal and some finite impulse response (FIR) filters. The convolution operation () is described as a weighted average of an input signal :where is called -th feature map and denotes a weighting factor, called filter or kernel, with length . The kernels are built to identify spatial features on the input data. The output from a convolutional layer defines the next layer’s activation value. Then, the output of a convolutional layer at -th feature map is defined as follows:where denotes each local weighting factor of the kernel, represents the -th feature map at the layer , is the number of filters applied over , is the bias, and is the activation function.

Most of the time, raw data contain noise and undesirable spectral shapes that affect the feature extraction process [19]. Motivated from this issue, the SincNet layer [9, 20], an extension of the standard convolution, applies a set of temporal convolutions between a raw signal and digital filters to boost the first convolutional layer output.

2.1. Sinc Convolution

Instead of learning the filters from the data, as the conventional CNN, the SincNet [9, 20] performs the convolution operation with a preset function that requires only a reduced set of learnable parameters , as defined in the following equation:where is a filter bank for band-pass filter in the frequency domain; it takes advantage of the Sinc function to convert to time domain through the inverse Fourier transform [19]. The use of rectangular filters represents a practical selection to define . The magnitude of a generic band-pass filter can be described as the difference between two low-pass filters.where the set of the trainable parameters; and represent low and high cutoff frequencies, respectively, of the band-pass filter learned by the Sinc filters. describes the rectangular function at the instant as follows:

Using the inverse Fourier transform, the reference function becomeswhere the Sinc function is defined as . Finally, to achieve an approximation of the ideal band-pass filter, a winnowing procedure is applied. This procedure multiplies the truncated function with a window function [21], intending to smooth out the abrupt discontinuities at the ends of :

Therefore, the succeeding layers learn the filter gain of each actual layer.

2.2. Pooling Layers

These layers perform a downsampling to reduce the spatial size of features, encouraging the input data’s invariance to spatial translations. In particular, a max-pooling layer reporting the -th maximum element within a rectangular frame for each feature map. Meanwhile, a global average pooling (GAP) layer replaces the fully connected layers in a CNN model [22], averaging the feature maps from previous convolutional layers. GAP aims to force correspondences between learned feature maps and classes in the previous convolutional layers.

3. MCIL Methodology

Let and denote a feature and a label space, respectively. Let be a labeled dataset with samples, where . In a classification problem, a task consists in learning a labeling function , such that . Notice that represents a deep neural network with parameters , so that . Likewise, can be expressed as a composition of two functions, , where is a feature extractor and feature labeling with parameters and , respectively; here, is a latent feature space. The feature extractor takes and produces a latent feature dataset . Then, receives as input and produces label classifications , i.e., .

We focus on multiclass incremental learning (MCIL) where the model complexity is maintained constant during incremental states, while a reduced number of samples is retained from past classes [13, 23]. We assume phases, that is, incremental phases and one initial phase . A model is learned on a dataset during the phase . Due to this, we assume a memory limitation, all samples from cannot be stored, so that exemplars are selected and stored as a replacement of with . In the -th incremental phase, dataset from classes is streamed, whereas exemplars from phases 0 to are stored in memory. The aim of MCIL is to learn a model using exemplars and data set .

Figure 1 shows the flowchart of the MCIL methodology for fault detection in IMs. In the initial phase, a model , that is 1D CNN, is trained via cross-entropy loss on the dataset , containing signals from different motor conditions. Next, exemplars are selected using Herding method [17] over in feature space, . The nearest centroid classifier (NCC) is used to classify test samples in the current phase using as training set. In the -th incremental phase, the output layer from the CNN is extended with randomly initialized weights for each new class. Then, the 1D CNN is fine-tuned over and using cross-entropy loss ; notice that imbalance data are produced in because contains a reduced set of exemplars from past classes. This procedure updates all parameters of 1D CNN. The resulting trained model in phase is used to extract features from and . Herding method is used to select the exemplar set over in feature space. Then, NCC uses as training set in feature space to classify test samples. This procedure is repeated over the different incremental phases.

3.1. CNN Architecture

The 1D CNN architecture is shown in Figure 2. 1D time-domain signals are used as inputs to the 1D CNN. One Sinc layer [9, 20] and two standard convolution layers were incorporated into the feature extractor. Conv denotes the convolution layer of filters with a size and a stride . For the lower layers, large size filters were employed to deal with high frequencies present in data. We added max-pooling layers to reduce the spatial feature dimensions. Likewise, a global average pooling layer is used to reduce the spatial dimensions of the learned features. The output layer is extended for each new class with a random initial value. The softmax activation function is used at the output layer to perform motor fault classification.

3.2. Exemplar Set Selection

The exemplar set is adjusted in each incremental phase using the Herding method [17], as shown in Algorithm 1. The exemplar selection is required when training data are available. Feature representation from dataset is obtained using the feature extractor (line 2). Each sample is normalized employing the L2 norm (line 3). Notice that exemplars are selected and stored iteratively for each class (lines 5–7). One sample is added to the exemplar set in each iteration, prioritizing that sample that makes the average feature vector better approximate the mean vector.

Inputs:: dataset of class ; : number of exemplars to select; : feature extractor
Output:: exemplar set
(1)Initialize to the empty set {}
(2) \(⊳\) Get feature representations
(3) \(⊳\) L2 norm
(4) \(⊳\) Get mean feature vector
(5)fordo
(6)
(7)end for
(8)
Inputs:: exemplar set from phase 0 to ; : feature extractor; : sample to be classified
Output:: one hot vector of the class label; : number of classes
(1) \(⊳\) Get feature representations
(2)fordo
(3)
(4)end for
(5) \(⊳\) nearest prototype
3.3. Nearest Centroid Classifier

Nearest centroid classifier (NCC) [24] is a nearest-neighbor classifier, which is used to address the bias produced on new classes by training CNN over imbalanced data. The procedure followed by NCC is described in Algorithm 2. First, feature representations from exemplars are obtained using the feature extractor (line 1). The centroid is computed as the point from which the sum of the distances of all exemplars that belong to that particular class are minimized (lines 2–4). NCC assigns the label of the most similar class centroid to the test sample (line 5) as follows:where is the centroid vector for the class , obtained from exemplars ; meanwhile, is the Euclidean distance.

4. Experimental Setup

This section first describes data from the different cases of study used in multiclass incremental learning for motor fault diagnosis. Next, the experimental protocol is described. Finally, we present the implementation details of the MCIL model.

4.1. Cases of Study

Our experiments were conducted on two cases of study based on vibration analysis: (1) motor common fault diagnosis and (2) bearing fault diagnosis. For this, we used two public benchmark datasets: asynchronous motor common fault (AMCF) [1] and Case Western Reserve University (CWRU) [25]. Tables 1 and 2 present the description of the data acquisition and studied faults for the AMCF and CWRU datasets, respectively. The AMCF dataset is composed of 8,000 samples from 8 motor conditions (1,000 per class), where each sample contains 1,024 points. For the CWRU dataset, experiments were performed under 1 hp workload. This dataset contains three types of fault locations in bearing (balls, inner race, and outrace), showing fault diameters of 0.007, 0.014, and 0.021 inches. CWRU contains 10,000 samples from 10 different motor conditions (1,000 per class), including health bearings.

4.2. Experimental Protocol

For each dataset, we evaluated the proposed MCIL model starting from a pretrained CNN over initial data of motor faults; meanwhile, the rest of the data coming in different phases are used to train CNN in a class-incremental way. First, we fix the number of stored exemplars to the smallest memory size allowed, and after, the number of incremental phases is varied. Next, we fix the number of incremental phases to 6 and 8 for AMCF and CWRU, while the number of exemplars per fault is varied considering and . In each incremental phase, faults are given in a fixed random order; 80% of the data samples in each class are used for training, and the remaining 20% for testing, performing a stratified sampling. The final model in each incremental phase is used to classify classes observed so far. Experiments were repeated five times using different random initial weights, different partitions of data, and a different fault order. We calculate the average accuracy and standard deviation only for incremental states, which are of interest for MCIL. Our comparison includes the results of CNN employing all previous data available (Full) and those using a fine-tuning procedure with a random selection of exemplars (FT + R).

4.3. Implementation Details and Model Parameter Selection

Table 3 presents the details of the 1D CNN model. We used filters with a large (101), medium (51), and small (11) size to learn features from raw signals. In addition, max-pooling layers were used with a size and stride of 3 to reduce spatial feature dimensions. The 1D CNN model employs a total of 56,281 trainable parameters. The 1D CNN model was implemented in Pytorch 1.7.0, whereas NCC was obtained from scikit-learn library (https://scikit-learn.org/stable/). Experiments were performed using a PC Intel(R) Core (TM) i7 with a graphic card GTX 1080 Nvidia on Ubuntu 20.04 LTS.

In our experiments, the 1D CNN model was trained by Adam algorithm [26] during 40 and 30 epochs for the AMCF and CWRU datasets, respectively. For both datasets, the initial learning rate was set to 0.0001 at the initial phase, whereas it was set to 0.001 for incremental phases. In addition, a learning decay of 0.1 was applied at 30 and 20 epochs. Likewise, a batch size of 30 was selected from {10, 30, 50} for both datasets. This hyperparameter setting was selected after comparing different configurations across 6 and 8 incremental phases on AMCF and CWRU; 5 exemplars from each past class (known faults) were stored in memory. For model parameter selection, we used coordinate descent [27], which changes only one hyperparameter at a time, aiming the best configuration. Fine-tuning and Herding selection (FT + H) were used for CNN retraining and exemplar selection in each incremental phase. Experimental results, as shown in Table 4, indicate that the batch size has a lower negative impact compared with learning rate. For both datasets, the 1D CNN model achieves its highest average accuracy when the learning rate is 0.001 and the batch size is 10 and 30. This last value of batch size was selected because it requires a lower number of iterations for data processing during training. Finally, as shown in Figure 3, the 1D CNN model stabilizes its training above 20 and 15 epochs on AMCF and CWRU for the different incremental phases. Using 40 and 30 epochs during training, we ensure a stabilization of the 1D CNN model.

5. Results

5.1. Case 1: Motor Common Fault Diagnostics

Table 5 shows the average accuracy and standard deviation (SD) on AMCF using a different number of incremental phases and exemplars. We observed that the most challenging scenario is presented when one exemplar per fault is retained across different incremental phases. Inversely, we can see that the most straightforward scenario is presented when a greater number of exemplars per fault is stored ( and ). Notice that the performance of FT + R (fine-tuning with a random selection) dropped when and the number of incremental phases decreased. In this scenario, the proposed MCIL framework (FT + NCC + H) obtained average accuracies beyond 94%, outperforming to FT + R at least 20 percentage points (pp). Moreover, we can see that FT + NCC + H achieved average accuracies of 98.32% and 98.85% over 6 incremental phases and a number of exemplars equal to 5 and 10, outperforming to FT + R by 5.07 and 3.79 pp.

5.2. Case 2: Bearing Fault Diagnostics

Table 6 shows the average accuracies and standard deviations (SD) on CWRU using a different number of incremental phases and exemplars. The most challenging scenario is presented when , where the performance of FT + R dropped when the number of phases increased. In this scenario, FT + NCC + H achieved average accuracies beyond 93%, outperforming FT + R at least 6.27 pp. On the other hand, we observed that FT + NCC + H outperformed FT + R by only 0.48 and 0.39 pp across 8 incremental phases, while the number of exemplars is 5 and 10.

5.3. Ablation Studies
5.3.1. Effect of Each Component

We analyzed the impact of each component to determine its contribution over the final accuracy on AMCF and CWRU. Figure 4 shows the accuracy performance on AMCF and CWRU during 6 and 8 incremental phases, although one exemplar is retained in memory. Notice that accuracy results of FT without memory also were included. For both datasets, we observed that FT reduces its accuracy performance over incremental phases if memory is not available, suggesting the presence of the catastrophic forgetting problem. We can see that fine-tuning results significantly improved when memory is incorporated (FT + H), storing representative samples from past faults. Finally, notice that NCC also had a positive impact on the final accuracy (FT + NCC + H) by reducing the bias generated by incorporating new faults.

5.3.2. Effect of the Number of Exemplars

Figure 5 shows the impact on accuracy performance by varying the number of exemplars per fault. We observed that FT + R improved its results when the number of stored exemplars increased, while FT + NCC + H obtained results above 96% starting from 1 exemplar per class. We can see that FT + NCC + H achieved a competitive performance (98.32% vs. 99.38%) than training on full data, storing at least 5 exemplars per fault, while FT + R became competitive by using more than 20 exemplars.

Regarding the CWRU results, we can see that the worst performance is obtained when the number of stored exemplars per class is 1. Moreover, FT + R and FT + NCC + H increased their accuracy performance starting from 2 samples per fault. FT + NCC + H obtained an average accuracy above 99%, storing at least 3 exemplars per fault, while FT + R achieved the same performance above 5 exemplars.

5.3.3. Effect of the Herding Method for Exemplar Selection

We studied the impact of the exemplar selection via Herding and a random selection over the accuracy performance on AMCF and CWRU. Figure 6 shows the accuracy performances on AMCF and CWRU across 6 and 8 incremental phases, while a different number of exemplars from past faults are retained. For AMCF, we observed that the Herding method (marked as + H) slightly improves the accuracy performance over random selection (marked as + R) when 1 to 10 exemplars are retained. For CWRU, only improvements can be seen when 2 to 5 exemplars are retained. We observed that random selection obtained a similar or even better performance than the herding method for the rest of the cases.

5.3.4. Effect of Noise over the Proposed Framework

In order to test the performance of the proposed framework under different noise conditions, we applied additive white Gaussian noise (AWGN) to the raw signals from the test set; 6 and 8 incremental phases were used on AMCF and CWRU, retaining 5 exemplars from past classes. Table 7 presents the classification results of FT + NCC + H under three different noise levels; accuracy results of FT + H (CNN trained incrementally) and the full model (CNN using all data) were included as reference. As expected, evaluated solutions reduced their average accuracy when a lower noise level is applied. However, we can see that FT + NCC + H obtained average accuracies beyond 92% and 94% when an SNR = 5 is applied on signals from AMCF and CWRU, respectively. For AMCF and CWRU, FT + NCC + H obtained the best average accuracies when SNR is 5 and 10, while it obtained a similar accuracy performance compared with the full model when SNR is 15.

5.3.5. Comparison of Classification Time

We analyzed the classification times of our proposed framework (FT + NCC + H) across different incremental phases; we included times of FT + H (CNN trained incrementally) as reference. For this experiment, 6 and 8 incremental phases were used for AMCF and CWRU, although 5 exemplars from each learned class were stored in memory. Table 8 shows the classification times for evaluated solutions. We can see that times increased when new classes are added to the 1D CNN classifier. Also, we observed that the times of FT + NCC + H did not significantly increase with respect to FT + H (CNN as classifier). From this, notice that NCC uses a reduced number of exemplars from past and current faults as training set.

6. Discussion

In experiments, we evaluated and validated the proposed MCIL framework on two different cases of study for motor fault detection in IMs. The evaluation was performed under scenarios where data from new faults are streamed in different time phases. From the results, we found that the MCIL framework allows the incorporation and detection of past and new motor faults from vibration signals with high accuracy across different incremental phases. Unlike previous works [14, 15], one or more faults can be added to the 1D CNN model in each incremental phase. Notice that computational requirements and memory should be bounded. In this sense, the proposed MCIL framework maintains a constant complexity while a few samples from past faults are retained. To the best of our knowledge, this is the first work that studies MCIL, based on a deep learning approach, for fault diagnosis in IMs from vibration signals.

From ablation studies, we observed that a neural network model tends to forget previously learned faults. This problem is known as catastrophic forgetting, which is produced by incorporating new faults into a pretrained model in a sequential way. In this direction, we found that the fine-tuning procedure with a memory of exemplars and the NCC classifier provides an effective solution to tackle the catastrophic forgetting problem [16] for fault diagnosis in IMs. As expected, the average accuracies of evaluated solutions significantly improved when the number of retained exemplars in memory increased. Notice that results on AMCF showed that at least 5 exemplars per fault are required across 6 incremental phases to achieve a competitive accuracy than training on full data. Also, we found that at least 3 exemplars were required across 8 incremental phases to obtain a similar performance using all data on CWRU. Notice that this amount of stored exemplars per fault represents approximately 1% of the size of the training set. Moreover, AMCF results showed that a greater number of incremental phases do not negatively impact the accuracy performance of the 1D CNN model; CWRU results showed that a greater number of incremental phases negatively impact the MCIL model’s accuracy performance. Concerning to the exemplar selection, we found that the herding method slightly improved over the accuracy results than using a random selection when a few exemplars are retained, but similar or even worst results were obtained in other cases. Regarding noise conditions, we found that FT + NCC + H provides a robustness to disturbances in signals, outperforming to the full model in accuracy performance for SNRs with low values. In particular, we found that NCC helps to face such disturbances in signals. Finally, we found that NCC does not increase the classification time because a reduced number of samples are used as training set.

7. Conclusions

This study presents a MCIL framework based on fine-tuning with a memory of exemplars and the nearest centroid classifier (NCC) over an 1D convolutional neural network (CNN), to incorporate new motor faults from vibration signals to already known. Specifically, 1D CNN is fine-tuned over samples from new faults and exemplars from known (past) faults, whereas NCC is used during testing phase to classify samples from past and new faults. The proposed framework was evaluated over two datasets for motor fault diagnosis: AMCF and CWRU. Different experimental scenarios were considered, including different numbers of incremental phases and stored exemplars. Experiments showed that the proposed framework achieved an accuracy performance beyond 93% and 94% on AMCF and CWRU, retaining one exemplar per fault and varying the number of incremental phases. We found that 5 and 3 exemplars per fault across 6 and 8 phases on AMCF and CWRU are required to achieve competitive accuracy than training with full data (98.32% vs. 99.38% and 99% vs. 100.00%). These results suggest that the catastrophic forgetting problem can be reduced by the proposed framework over AMCF and CWRU. Another interesting finding is that NCC may help to obtain a robust classifier when noise is presented in data. Using this proposed framework, we showed that a classifier, based on a deep learning model, may be trained incrementally, achieving satisfactory diagnosis results for fault detection in IMs and maintaining a constant complexity of the model. As future work, we are interested in developing an end-to-end MCIL framework, where the feature extractor and the classifier can be trained jointly. Likewise, we are planning to extend our study for the diagnostic of incipient and electrical faults.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest with respect to the research, authorship, and/or publication of this article.

Acknowledgments

This research was financially supported by National Institute For Astrophysics, Optics, and Electronics.