#### Abstract

The performance of bearing fault detection systems based on machine learning techniques largely depends on the selected features. Hence, selection of an ideal number of dominant features from a comprehensive list of features is needed to decrease the number of computations involved in fault detection. In this paper, we attempted statistical time-domain features, namely, Hjorth parameters (activity, mobility, and complexity) and normal negative log likelihood for Gaussian mixture model (GMM) for the first time in addition to 26 other established statistical features for identification of bearing fault type and severity. Two datasets are derived from a publicly available database of Case Western Reserve University to identify the capability of features in fault identification under various fault sizes and motor loads. Features have been investigated using a two-step approach—filter-based ranking with 3 metrics followed by feature subset selection with 11 search techniques. The results indicate that the set of features root mean square, geometric mean, zero crossing rate, Hjorth parameter—mobility, and normal negative log likelihood for GMM outperforms other features. We also compared the diagnostic performance of normal negative log likelihood for GMM with the established feature normal negative log likelihood for single Gaussian. The selected set of statistical features is validated using ensemble rule-based classifiers and showed an average accuracy of 96.75% with proposed statistical features subset and 99.63% with all 30 features. *F*-measure and *G*-mean scores are also calculated to investigate their performance on datasets with class imbalance. The diagnostic effectiveness of the features was further validated on a bearing dataset obtained from an operating thermal power plant. The results obtained show that our newly proposed feature subset plays a major role in achieving good classification results and has a future potential of being used in a high-dimensional dataset with multidomain features.

#### 1. Introduction

Vibration-based diagnosis and prognosis of bearing faults has been an active area of research for many years [1–3]. Machine learning algorithms have been effectively used in bearing fault detection systems. The performance of such algorithms, in terms of accuracy and efficiency, largely depends on the number of features used and their discriminating capability.

Several types of statistical features have been used so far. Zeng et al. [4] used 12 conventional time-domain and 10 frequency-domain features for their classification approach. The conventional statistical parameters, namely, mean value, maximum value, root mean square value, peak-to-peak value, root, standard deviation, variance, kurtosis, and skewness are the most commonly used dimensional parameters [5–8]. A wide set of dimensionless statistical parameters such as crest factor, shape factor, impulse factor, clearance factor, skewness factor, and kurtosis factor has also been used to enhance the performance of fault diagnosis systems [9–12]. These parameters are extracted from either raw or preprocessed vibration signals. For the latter, empirical mode decomposition (EMD) has been popularly used as a preprocessing technique [13]. It is important to note that not all the constituent signals, called intrinsic mode functions (IMFs) generated by EMD, bear fault information. Different criteria, such as correlation coefficient [14] and kurtosis [4], have been used to select significant IMFs. This has been shown to be important for designing an efficient fault diagnosis system [15, 16].

A major boost in bearing fault diagnosis research was seen when novel statistical features began to be applied in artificial intelligence techniques. Sreejith et al. [17] proposed a method for pattern recognition by extracting normal negative log likelihood and kurtosis values of time-domain signals and using them as input to an artificial neural network. Abbasion et al. [18] used Weibull negative log-likelihood function of denoised signals and support vector machine (SVM) classifier for fault classification of rolling bearings, while in [19–21] zero crossings, mean absolute deviation, and histogram features were applied in their respective works. The application of a new group of statistical parameters, called as “Hjorth parameters,” was explored for diagnosing 4 types of faults in rolling element ball bearings [22]. Lastly, different kinds of entropies have also been used for intelligent fault diagnosis in various research studies [14, 23–25]. Some recent papers have developed symmetric cross entropy measures of single-valued neutrosophic sets for fault diagnosis in bearings [26, 27].

Researchers were thus able to utilise the power of statistical features for capturing important fault information by applying popular supervised machine learning techniques. The potential of one such machine learning technique—rule-based classifiers—in the domain of bearing fault diagnosis has also been shown to be promising. An ensemble of rule-based classifiers for fault diagnosis of rotating machinery was proposed by Dou et al. [28] to predict potential faults and subsequent breakdown of rotating machinery by using 6 time-domain and 5 frequency-domain features. Application of rule-based classifiers has also been successfully extended for severity detection. Li et al. [29] proposed an association rule mining method based on equal probability discretization to perform a 10-class prediction on select time-domain and frequency-domain features for fault classification and severity prediction.

Although all of the studies cited above have applied various time-domain parameters, whether conventional or novel and dimensional or dimensionless, for bearing fault diagnosis, either exclusively or in addition to other frequency-domain and time-frequency features, none of the studies discusses and provides the reason behind the use of a set of particular statistical features. Majority of the previous studies, till now, have focussed more on optimising fault diagnosis classifiers. Moreover, few have compared established features with novel ones. Since fault diagnostic accuracy depends on the choice of both features and classifier, we believe it is important to emphasise on features too by proposing novel statistical features and/or developing a strategy to conduct an extensive analysis of new and old statistical features, so as to derive the most optimal set of fault-sensitive features.

In view of this, this paper applies statistical time-domain features, namely, Hjorth parameters (activity, mobility, and complexity) and normal negative log likelihood for Gaussian mixture model (GMM) for the first time for rolling element bearing fault type and severity diagnosis. These statistical features were added to 26 already established statistical features to investigate a comprehensive set of 30 time-domain statistical features. The features were extracted from intrinsic mode functions of bearing vibration signals obtained using EMD. An in-depth two-step model-agnostic approach, based on filter-based ranking with 3 metrics and feature subset selection with 11 search techniques including both conventional search techniques and the newly introduced swarm search techniques, has been developed to select an optimal feature subset for bearing fault type detection and severity estimation. For validation of the performance of the selected features, rule-based classifiers—PART and JRIP—and their bagging and boosting ensembles were applied. The paper also examines the selected parameters and further analyzes the replacement of the high ranked parameter normal negative log likelihood for single Gaussian with normal negative log likelihood for GMM and its effect on the diagnostic performance. Case Western Reserve University (CWRU) Bearing Dataset [30] has been used for the experiments. The performances of the selected subset of features have also been tested on a dataset obtained from an operating power plant for fault type detection. The main contribution of this paper is twofold: (1) proposing Hjorth parameters and normal negative log likelihood for Gaussian mixture model (GMM) for rolling element ball bearing fault detection and severity estimation and (2) devising a strategy to probe a comprehensive set of statistical time-domain features with the objective of identifying optimal feature subset for rolling element ball bearing fault detection and severity estimation.

The paper has been organised as follows: Section 2 describes the statistical features and filter-based feature ranking and subset selection methods. Section 3 explains the steps of our proposed methodology. Section 4 discusses in detail data preparation steps, implementation and results of feature selection, fault type detection, and fault severity detection. This section also discusses the replacement of normal negative log likelihood for single Gaussian with its Gaussian mixture model variant as well as the implementation of our methodology on a dataset obtained from an operating power plant. Finally, Section 5 discusses the drawn conclusions and future scope of our research.

#### 2. Theoretical Background

##### 2.1. Extraction of Statistical Features

Table 1 shows the list of statistical features extracted from the vibration signals. In the table, T1, T2, …, T30 are feature notations, are individual data points of sampled vibrational signal, and *N* is the length of the vibration signal. The dimensional statistical parameters T1 to T11, T18, and T19 and dimensionless statistical parameters T12 to T17 are self-explanatory from their formulae. T20 and T21 are the arithmetic means of the absolute deviations of each from mean and median, respectively. T22 is the rate of zero crossing (ZCR) and can be defined as the ratio of the number of zero crossings to the total number of points in the signal. In other words, ZCR is the rate at which a signal changes sign from negative to positive or vice versa, thus providing a good understanding of the frequency of the signal and hence very useful for separating signals with different frequencies. T23 or entropy is a measure of randomness of the values of a vibration signal. The two histogram features T24 and T25 are histogram upper and lower bounds, respectively. The next three statistical parameters are the Hjorth parameters. T26 or activity of vibration signal is the same as variance of the signal. T27 or mobility is defined as the ratio of the standard deviation of the first derivative of the vibration signal to the standard deviation of the vibration signal itself. T28 or complexity is defined as the ratio of the mobility of the first derivative of the vibration signal to the mobility of the vibration signal. The next statistic, T29 or Weibull negative log likelihood, is computed as follows:

Finally, T30 or normal negative log likelihood for single Gaussian can be computed as follows:

##### 2.2. Filter-Based Feature Selection Methods

In this paper, we have used filter-based methods for identifying important features for bearing fault diagnosis. Filter-based techniques work well with any classification model producing accurate results with shorter run times [31]. Moreover, these techniques perform well for the identification of dominant features for both fault type and severity detection [32].

###### 2.2.1. Feature Ranking

Three filter-based ranking metrics have been used to rank the features: chi-squared, information gain, and gain ratio. In chi-squared feature selection, the value of a statistic called chi-square () is calculated for the feature and response variables. Those features that have high chi-squared value are then selected. The chi-squared value is calculated as follows:where *c* = degrees of freedom, = observed value, and = expected value.

Information gain test uses the value of a statistic called information gain that is computed for each feature (*L*) in the context of a response variable (*R*), ultimately selecting features with higher information gain values. The higher the information gain score for an attribute, the more is the information that the attribute can contribute to the response variable. Information gain can be understood as the reduction in entropy (*H*) and is calculated as follows:whereand *m* = the total number of instances and = the number of instances belonging to class *k*, in which *k* = 1, …, *k*.

The third metric of gain ratio feature is a ratio of information gain to the intrinsic information:wherewhere *A* = candidate attribute, = possible values of *A*, *S* = set of examples *X*, = subset where = . In this method, the attribute with the highest gain ratio is selected as the splitting attribute.

###### 2.2.2. Feature Subset Selection

CfsSubsetEval evaluator gives a subset of features which show high correlation with ground truth labels but are very less correlated to each other. In this research, we have performed both conventional search techniques and the newly introduced swarm search techniques [33]. Best first search is a conventional greedy search technique that finds subsets by backtracking and provides the flexibility of searching in forward, backward, and bidirections. Greedy stepwise search is similar to best first search as it also searches for subsets in a greedy manner, but instead of backtracking, it searches in a stepwise fashion and stops when a decrease of evaluation occurs after an addition or deletion of an attribute.

Swarm search methods are metaheuristics-based search methods that are nature-inspired optimization algorithms. Ant search [34] is a stochastic combinatorial optimization technique that is modelled on how the ant colonies work. It finds solutions in the early stages of the search process. Bat search [35] is based on how bats determine object location by using sound waves. Bee search [36] is inspired by the foraging behaviour of honeybees. It solves optimization problems by doing exploitative neighbourhood search and random explorative search together. Cuckoo search [37] algorithm models the parasitic behaviour of cuckoos and the Levy flight behaviour of fruit flies while elephant search [38] models the behavioural characteristics of elephant herds. Firefly search [39] is inspired by the flashing patterns of fireflies and offers the advantage of automatic subdivision and the ability to deal with multimodality. Flower search [40] follows the survival of the fittest rule based on the process of pollination of flowers. Wolf search [41] models how wolves hunt for prey and survive by avoiding their enemies. It is known to give good results when used in large search spaces. Rhinoceros search [42] models the herding behaviour of rhinos and can be used for solving global continuous optimization problems.

##### 2.3. Rule-Based Classifiers

Rule-based classifiers classify using “if... then... else” rules. If the set of feature values in a test sample conforms to a set of rules, then the sample is classified as belonging to the class associated with that set of rules. The classification rules can be built by directly extracting them either from the data (direct method) or from other classification models (indirect method). These classifiers are very expressive in their results as the rules learned by them can help understand the underlying mechanics of the classification. Some of the rules generated by one of our classifiers are discussed in Section 4.3.

We have applied the rule-based classifiers: PART and JRIP for bearing fault diagnosis. PART [43] is an efficient and accurate rule-based algorithm based on partial decision trees. It builds rules in a recursive fashion by using the “separate-and-conquer” strategy. It forms a rule and then removes the instances that rule covers and again creates a rule for remaining instances, and these steps repeat until the number of remaining samples becomes zero. JRIP [44], an implementation of the RIPPER (Repeated Incremental Pruning to Produce Error Reduction) algorithm, builds association rules using reduced error pruning (REP). First, a rule is added with conjuncts to improve the information gain. Then, the rules in the set are pruned using incremental REP. This is repeated until the criteria for discretion length are achieved. The rule set is then optimized.

#### 3. Methodology

The flowchart illustrating the steps used in our proposed methodology is shown in Figure 1. First, bearing vibration signals are bifurcated into nonoverlapping signals of length 1024 each. Empirical mode decomposition is applied on these signals to obtain their intrinsic mode functions. Second, each of the parent signals is then replaced by its IMF having the highest Pearson correlation with the parent signal, for further analysis (Table 2). In this paper, we will refer to them as representative IMFs. Third, 30 statistical time-domain features (Table 1) are extracted from these IMFs. Fourth, an average ranking of all the extracted features is obtained using filter-based ranking methods. Fifth, filter-based feature subset selection methods are used to find subsets of features having low intracorrelation and high correlation with the truth labels. The results obtained from the fourth and fifth steps are analysed together to obtain a set of significant features. Finally, the selected feature subset is used to train rule-based classifiers and their ensembles to perform classification for both fault type detection (Table 3) and fault severity detection (Table 4). The two classification models are also trained on a modified feature subset, formed by replacing one of the features, namely, normal negative log likelihood of single Gaussian (T30), by its Gaussian mixture model counterpart (T30). The T30 feature values have been computed for each vibration signal. The results obtained show the performance of classifiers to be slightly improved (Section 4.7, Exp. 6). The trained models from our pipeline are also tested on a bearing dataset obtained from the water pump at an operational power plant (Section 4.8, Exp. 7).

In this paper, EMD, representative IMF selection, and feature extraction have been performed using scripts written in MATLAB R2017a. The ranking and selecting of subset of features have been conducted through the open-source tool, Waikato Environment for Knowledge Analysis (Weka) [45]. Moreover, all the rule-based classifiers and their ensembles have been trained and tested in Weka. The Matlab scripts and models have been executed on an 8 GB RAM machine with Intel Core i5-7200U CPU having 2.50 GHz speed.

#### 4. Experimental Results and Discussion

##### 4.1. Dataset Preparation

In this paper, we have used the Case Western Reserve University Bearing Dataset. The setup (Figure 2) consists of a 2 hp electric motor, transducer/encoder, dynamometer, and accelerometers. Data for normal bearings were recorded using the deep groove ball bearing SKF6205-2RS JEM with dimensions 25 mm × 52 mm × 15 mm at drive end for motor loads of 0, 1, 2, and 3 horsepower (motor speeds of 1797, 1772, 1750, and 1730 rpm, respectively). Electrodischarge machining was used to induce faults of diameters: 0.007 inches, 0.014 inches, and 0.021 inches in the SKF6205 bearing and 0.028 inches in its NTN equivalent. After reinstalling these faulted bearings into the test motor, vibration data were recorded for the same motor loads. Vibration data were collected using a 16-channel DAT recorder for single-point drive-end defects at 12,000 samples/second.

The data consist of signals corresponding to four cases: outer raceway defect, inner raceway defect, rolling element, i.e., ball defect, and the normal. The defect of the outer raceway fault was located at 6 o’clock (orthogonal to the load zone). Data samples of size 1024 points were extracted from these vibration signals in a linear nonoverlapping fashion, i.e., for each signal, the first 1024 points were extracted as its first data sample, the next 1024 points as its second data sample, and so on. A total of 17060 samples consisting of 3314 normal, 4735 inner race, 4272 outer race, and 4739 ball data samples were extracted. Each of the resulting 17060 signals, of 1024 data points, was decomposed by EMD resulting in 12–17 IMFs along with residue. The IMF having the highest Pearson correlation with the parent signal was taken as its representative signal since it retains most of the information of the original signal (Figure 3). The correlation coefficient was computed as follows:where *E* is the expectation, and are standard deviations of *X* and *Y*, and and are means of *X* and *Y*, respectively. The correlation coefficient values obtained for two sample vibration signals and their IMFs are shown in Table 2. The representative signals of all the parent signals were then used to extract 30 statistical time-domain features listed in Table 1. Min-max normalization was applied to scale the features. The values of every feature were transformed into a decimal between 0 and 1 (0,1 both inclusive) by applying the following equation:

Two kinds of datasets were prepared: dataset A for fault type detection and dataset B for fault severity detection. Dataset A has the data labelled into 4 classes—outer raceway defect, inner raceway defect, ball bearing defect, and normal whereas dataset B has the data labelled into 12 classes defined by fault types and fault sizes (in inches). Both the datasets were divided into train and test sets using 80 : 20 split. The number of samples in train and test sets for dataset A and dataset B is shown in Tables 3 and 4 respectively.

##### 4.2. Exp. 1: Feature Ranking and Feature Subset Selection

Three feature ranking techniques, based on the statistical metrics of chi-squared, info gain, and gain ratio, were applied to the data. These techniques ranked the features according to their importance in classifying the samples into 4 classes for dataset A and likewise into 12 classes for dataset B. The threshold for the search method used by the feature ranking techniques was set in a way to ensure none of the features gets discarded. The ranking of the features is shown in Tables 5 and 6. The 3rd, 4th, and 5th columns list the rank obtained by ranker methods chi-squared, info gain, and gain ratio, respectively, for the feature in the corresponding row. The last column shows the average rank score calculated by averaging out the ranks of features over all the three ranking metrics.

The evaluator named CfsSubsetEval was used for further validation. This evaluator gives a subset of features which show high correlation with ground truth labels but low correlation to each other. 11 search techniques, including both conventional search techniques (best first search and greedy stepwise search) and the newly introduced swarm search techniques, were performed. Tables 7 and 8 show the results obtained with these techniques for the two datasets.

For implementing best first search, bidirectional search was used, while in greedy stepwise, forward search was used. The number of particles in the swarm, chaotic coefficient, and mutation probability were set as 20, 4.0, and 0.01, respectively, for all searches. In bee search, the discount factor for crossover was set as 0.8, and in cuckoo search, the sigma rate was kept as 0.69. For firefly search, zero distance attractiveness for the firefly population was set as 0.33. The same value was taken for wolf search. In flower search, a pollination rate of 0.33 was used.

From Table 7, it is observed that the subset of root mean square (T2), geometric mean (T18), rate of zero crossing (T22), Hjorth parameter—mobility (T27), and normal negative log likelihood for single Gaussian (T30) is selected by majority of search techniques. It is also observed that these 5 features are among the top 10 average rank scorers as highlighted in Table 5, hence validating the importance of these 5 features for fault type diagnosis. It can be noted that the features standard deviation (T6), RSSQ (T19), and Hjorth parameter—activity (T26) are ranked well by the ranking metrics but get selected by only a few subset search techniques. This could be due to the way subset selection works. Since one of the criteria for a good subset is that the subset elements should have very less correlation among each other, features like T6, T19, and T26 got rejected by most of the search techniques as they may be showing some correlation with an already chosen member of the set. Since our aim is to select the best performing subset for fault diagnosis, we selected the set of only the 5 features stated above.

For fault severity detection, from Table 8, we observe that the subset of features—root mean square (T2), RSSQ (T19), kurtosis factor (T17), geometric mean (T18), rate of zero crossing (T22), Hjorth parameter—mobility (T27), normal negative log likelihood for single Gaussian (T30), and 75th percentile (T9)—is selected by majority of search techniques, where these 8 features are also among the top 10 average rank scorers in Table 6.

##### 4.3. Exp. 2: Fault Type Detection with Feature Selection

Rule-based classifiers, PART and JRIP, were used with ensemble techniques boosting (multiboost [46] in our case) and bagging for performing fault type detection of the 4 classes—normal, inner race, outer race, and ball bearing. The ensemble techniques make use of a single learning algorithm to train multiple models on the dataset and then average out these models to get a final prediction. While bagging helps in getting an ensemble model with less variance by assigning equal weights to individual models for averaging, boosting works by reducing both variance and bias. Both of them help in increasing the stability of the models.

For training PART, the value of pruning confidence factor was taken as 0.25, while in JRIP, the minimum of the instances in a rule was taken as 2 and two optimization runs were performed. These classifiers were trained on the reduced dataset A obtained by retaining only the 5 features that were found to be significant, as discussed in the previous section. The data augmentation achieved by the initial splitting of vibration signals into signals of 1024 length resulted in a large enough dataset, removing the need to perform cross-validation while training the classifiers. For implementing bagging and multiboost, 10 iterations were performed with a batch size of 100.

Table 9 reports the performance of the classifiers in terms of accuracy obtained on the train and test datasets. Two cases were considered as follows:(a)When the selected feature set contains T2, T18, T22, T27, and T30 corresponding to column 1 of Table 9(b)When T30 is replaced with its Gaussian mixture model equivalent T30, the feature set in this case being T2, T18, T22, T27, and T30 corresponding to column 2 of Table 9

This section focuses on discussing the results for the case (a). The results for case (b) are discussed later in Section 4.7. It is observed that PART when used with multiboost outperforms all the 5 classifiers and gives the best performance with an accuracy of 96.51% on the test dataset and 100% on the train dataset. The performance of the features is further studied by inspecting the rules generated by them. Rules are propositional formulae consisting of one or more atomic formulae (enclosed in parentheses as shown below) that are joined together using logical connectives AND and IMPLIES . The number at the end of a rule indicates the number of instances correctly classified by that rule. PART with multiboost generates a total of 87 rules from the 5 important features. With just a small number of rules, the entire training data of 13648 samples are fully and accurately classified. Some of the rules generated are listed as follows:

From the above rules, it can be observed that a total of 1635 out of 2649 dataset instances were correctly classified as normal using just one rule R4 generated by the features T2 and T22. Each of the three remaining rules also classifies significant portions of samples in the dataset, thus showing the good performance of the features forming these rules. The rules are also very simple in structure, for instance, R2 constructed with the help of just 2 atomic formulae correctly classifies 375 ball bearing instances. The ability to easily distribute the samples into the 4 classes using simple rules shows the efficiency of the selected time-domain features.

Since the number of samples belonging to normal class is 2649, which is less than the number of samples in each of the faulty classes (around 3500 for each), our dataset is slightly class unbalanced. Hence, evaluating the performance of the rule-based classifiers on accuracy alone may not be comprehensive, and therefore, the performance of the classifiers on the test dataset has been further studied by calculating 2 other metrics: *F*-measure and geometric mean (*G*-mean). *F*-measure per class is calculated as the harmonic mean of precision and sensitivity. *G*-mean is the geometric mean of specificity and sensitivity. Table 10 reports the performance of the classifiers on dataset A using *F*-measure and *G*-mean scores. The *F*-measure and *G*-mean scores for all the classifiers are significantly high; in particular, PART + multiboost classifier outperforms the other classifiers. Hence, our rule-based classifiers are able to classify our slightly unbalanced dataset very well, in turn showing the good performance of the selected features.

To further validate the good performance of the selected features in fault type detection, confusion matrices were built. Figure 4(a) shows the confusion matrix obtained on the test dataset with T30 parameter for PART with multiboost. The normal instances were accurately classified (with only 2 misclassifications), further confirming that they are well classified by our 5 important features even though normal class is slightly underrepresented. Another important observation is that most of the misclassifications involve ball bearing class. 22 instances each of inner race and outer race got misclassified as ball bearing. Also, a total of 56 ball bearing instances were incorrectly classified. To validate these observations, 3 features—T22, T27, and T30—were plotted pairwise using scatter plots as shown in Figure 5. It can be seen that for normal class, the plots show mostly nonoverlapping clusters, hereby confirming that the selected features are playing a very good role in classifying fault vs. no fault. The plots also show a lot of overlapping between ball bearing, inner race, and outer race clusters, confirming our observation about ball bearing class. This indicates a lot of diversity in the ball bearing instances that are unhandled by our features. Hence, features which can accurately distinguish this class from other classes are needed. A modification in one of the features resulting in a better classification for ball bearing instances has been discussed in Section 4.7.

**(a)**

**(b)**

**(a)**

**(b)**

**(c)**

##### 4.4. Exp. 3: Fault Type Detection without Feature Selection

Table 11 reports the performance of the classifiers in fault type detection in terms of accuracy obtained on dataset A without feature selection. These classifiers perform better than their feature selected counterparts, in terms of accuracies obtained. Again, in terms of overall performance, PART + multiboost outperforms all the 5 classifiers, giving the highest train and test accuracies of 100% and 99.13%, respectively. On comparing these with the train and test accuracies of 100% and 96.51% obtained with feature selection, we observe that, on the train data, the selected features were alone sufficient for attaining a perfect accuracy, while in the case of test data too, a major contribution in achieving the high accuracy can be credited to the selected features.

##### 4.5. Exp. 4: Fault Severity Detection with Feature Selection

This section detects the severity of fault types by performing a 12-class classification using rule-based classifiers on the selected eight features obtained for dataset B. The 12 labels used for severity detection are shown in Table 4 where the severity of a fault has been characterised by its size in inches. Table 12 shows the results obtained wherein, once again PART, when used with multiboost, outperforms all the 5 classifiers and gives the best performance accuracy of 100% on the train dataset and 97.92% on the test dataset. The results obtained with the parameter T30 are discussed later in Section 4.7.

Having few IR 028 and BB 028 samples and almost thrice the count for other severity labels, the dataset is unbalanced classwise, and hence, we obtained the *F*-measure and *G*-mean values to further study the performance of the selected features. Table 13 reports the *F*-measure and *G*-mean values obtained with PART + multiboost for fault severity detection. We observe that both the underrepresented classes and the remaining classes are accurately classified. Hence, the selected features for fault severity detection are performing well despite the data being class unbalanced.

##### 4.6. Exp. 5: Fault Severity Detection without Feature Selection

In this experiment, we fed dataset B with all the 30 statistical features to the rule-based classifiers to again study fault severity detection, results for which are shown in Table 14. In comparison with Table 12 data, the highest train accuracy of 100% remains unaltered while the test accuracy sees a gain of 0.76%. The perfect train accuracy and the small gain in test accuracy show the superior performance of our selected eight features over the other features, confirming their large contribution towards the accurate classification results. This is in line with our observation for the selected features for fault type detection in Experiment 3.

##### 4.7. Exp. 6: Replacing the Normal Negative Log Likelihood for Single Gaussian Parameter with Its Gaussian Mixture Model Variant for Exp. 2–5

In Experiment 2, the feature T30, normal negative log likelihood for single Gaussian, comes in the set of important features for both fault type and fault severity detection. For instance, in the case of fault type detection, it occupies rank 3 in the list of average ranks given by feature ranking methods and is selected by all the 5 feature subset methods. This feature gives the negative of the likelihood that a normal distribution of given mean and standard deviation fits the sample data, in our case, the sample data being the representative IMF of bearing vibration signal. However, on selecting some random samples from the set of representative IMF signals and plotting their frequency distribution histograms with fitting curves, shown in Figure 6, we found that the idea of fitting a single Gaussian curve will not work well with some of the signals. Some signals, like samples 1 and 2 in Figures 6(a) and 6(b), respectively, have more than one distinct maxima, and a single Gaussian curve will not be able to accurately model these signals. This trend was mostly seen for samples belonging to faulty classes. Hence, we replaced the T30 feature with its Gaussian mixture model variant, calling it T30. The normal negative log likelihood for Gaussian mixture model (T30) gives the negative of the likelihood that a GMM of *k* means and *k* standard deviations fits the sample data. Mathematical formula for normal negative log likelihood of GMM can be defined as follows:where is the mixing coefficient of the *k*th Gaussian. Here, we instantiated *k* by the number of local maxima in the probability distribution of the sample data. Table 15 reports the normal negative log likelihood for GMM values obtained for the 3 samples shown in Figure 6. The normal negative log likelihood for GMM for Sample 1 is −466, which is far better than its normal negative log likelihood for single Gaussian equal to −143.347, in terms of fitting of the model. The same is the case with Sample 2. Sample 3 shows a very slight change in value because it can be fitted well with a single Gaussian.

**(a)**

**(b)**

**(c)**

Tables 9 and 11 show the results obtained for fault type detection when T30 was taken as one of the 5 selected features and when it was replaced by T30. According to Table 9, PART with multiboost gives an accuracy of 100% on the train dataset for both T30 and T30 and shows a slight jump in test accuracy from 96.51% with T30 to 96.75% with T30. For the rest of the classifiers, we see a slight improvement in train and test accuracy with T30 over T30. From the two confusion matrices in Figure 4, the number of true positives for the ball bearing class has increased from 931 with T30 to 939 with T30, accompanied by a decrease in the misclassified instances of ball bearing into inner race and outer race classes from 24 to 22 and 28 to 22, respectively. This further confirms the better performance of T30 as compared to T30. Similarly, the accuracy values obtained in Table 11 show better accuracy values with parameter T30 than the values obtained with T30 for all the classifiers. This again shows the effectiveness of T30 over T30.

A similar trend is observed for fault severity detection. From Table 12, it is seen that on comparing T30 with T30, the train accuracy remains at its peak of 100% while test accuracy increases by 0.64% to become 98.56%. T30 performs better than T30 in the case of without feature selection as well, as can be seen from Table 14. A more clear picture is provided by the confusion matrices in Figure 7 where we see that for T30, the misclassifications are more than that for T30. We believe that for bearing datasets having a majority of vibration signals with multiple peaks, this feature can prove to be a key performer because of its ability to represent such signals well. Hence, normal negative log likelihood of GMMs has the potential of being a more accurate and reasonable parameter for feature selection and bearing fault diagnosis than the normal negative log likelihood of single Gaussian.

**(a)**

**(b)**

##### 4.8. Exp. 7: Testing Our Methodology on the Bearing Data of an Operating Power Plant

In this experiment, we have used bearing data from 500 MW Kosti Thermal Power Plant commissioned by Bharat Heavy Electricals Limited, India. The setup, shown in Figure 8, consists of a 7 hp vertical electric motor of service water pump. Data for normal baseline bearings were recorded using the deep groove ball bearing SKF6303-2Z/C3 with dimensions 17 mm × 47 mm × 14 mm at drive end with a motor speed of 3000 rpm. Vibration data were collected at 12,000 samples/second. Vibration data are taken for 2 different timestamps, and for every timestamp, data have been taken for both vertical and horizontal positions, leading to a total of 4 vibration signals. We performed the same data preprocessing, including bifurcating the signals to subsignals of length 1024, EMD, and feature extraction, on these samples as done for the CWRU dataset.

Since the bearing configuration matches with that used in CWRU dataset, we mixed these test samples with the 20% test set separated from the CWRU dataset A. The trained model of PART + multiboost for fault type detection with feature selection, obtained in Exp. 2, was chosen for labelling instances as outer raceway defect, inner raceway defect, ball defect, and the normal. The overall accuracy and average *F*-measure achieved were 96.75% and 0.972, respectively. Furthermore, per instance classification results were obtained and it was observed that all the normal samples taken from the power plant were correctly classified. Hence, the results were reasonably good. The mixing of power plant samples with CWRU test samples ensured that the good accuracy was not due to any bias shown by the classifier towards the normal class, thereby also leading to a good *F*-measure value. Furthermore, when T30 was replaced by T30 for this dataset, we did not see any improvements in the accuracy and *F*-measure values. This might be due to the mostly single Gaussian nature of the newly added normal vibration signals taken from the power plant.

The following points discuss and summarize the obtained results:(1)For the 4-class fault type detection problem, a subset of the features—root mean square (T2), geometric mean (T18), rate of zero crossing (T22), Hjorth parameter mobility (T27), and normal negative log likelihood for Gaussian mixture model (T30)—performs well with rule-based classifiers, giving the maximum test accuracy of 96.75% obtained with PART + multiboost.(2)For the 12-class fault severity estimation problem, a subset of the features—root mean square (T2), 75th percentile (T9), kurtosis factor (T17), geometric mean (T18), RSSQ (T19), rate of zero crossing (T22), Hjorth parameter mobility (T27), and normal negative log likelihood for single Gaussian (T30)—performs very well, producing the maximum test accuracy of 98.56% with PART + multiboost.(3)An improvement in classification accuracy is observed when normal negative log likelihood for single Gaussian (T30) is replaced by its GMM variant (T30). This is because some signals have more than one distinct maxima and a single Gaussian curve cannot model these signals accurately. Thus, instead of a single Gaussian curve, it is better to fit these signals with a Gaussian mixture model.(4)The *F*-measure and *G*-Mean scores for all the rule-based classifiers are significantly high, confirming the potential of the chosen features to accurately classify the unbalanced dataset.(5)On comparing the results with feature selection vs. without feature selection, the latter performs better for both fault diagnosis and fault severity estimation giving 99.63% and 99.12% test accuracies, respectively. Despite this observation, the good accuracy values obtained with the selected features show that a major contribution in achieving the high accuracy in the case of without feature selection can be credited to the selected features. Hence, these selected features play a major role in the diagnostic ability of the classifiers.(6)The efficacy of the selected time-domain features is further confirmed by their good performance in classifying samples taken from an operating power plant.

#### 5. Conclusions

So far, research in bearing fault diagnosis and severity estimation has majorly focussed on improving the performance of classification algorithms. However, the applicability of these approaches mainly depends on the quality of features extracted from the bearing signals. In light of this, this paper focussed instead on optimising the input features fed to these classification systems. We believe that if new input features are proposed and/or the input features are selected comprehensively using an in-depth analysis, they can contribute more in bearing fault diagnosis, covering all types of fault samples in an exhaustive manner. This can in turn lead to the existing classification methods to perform much better. In an attempt to achieve this, this paper proposed the usefulness of Hjorth parameters and normal negative log likelihood for Gaussian mixture model (GMM) for rolling element ball bearing fault detection and severity estimation. An in-depth two-step approach, based on filter-based ranking with 3 metrics and feature subset selection with 11 search techniques, was developed to select an optimal feature subset for bearing fault type detection and severity estimation. Filter-based selection techniques being model-agnostic make sure that the selected features work efficiently with any machine learning classifier. In this paper, we used rule-based classifiers and their ensembles for validation of our results in terms of diagnostic accuracy.

The obtained results show that features root mean square, geometric mean, rate of zero crossing, Hjorth parameter—mobility, and normal negative log likelihood are important statistical features for bearing fault diagnosis. For fault severity detection, these five parameters and three additional parameters—75th percentile, kurtosis factor, and RSSQ, collectively form an important and useful feature subset. We also proposed a novel improvement in the performance of one of the selected high ranked features, normal negative log likelihood for single Gaussian, by suggesting its replacement with its Gaussian mixture model (GMM) variant. This is because some vibration signals have more than one distinct maxima and a single Gaussian curve cannot model these signals accurately. This also leads us to conclude that existing features could be improvised to achieve maximum possible coverage of the underlying data.

The performance of the feature set was validated by running experiments for different conditions—with and without feature selection—and comparing their results. The dataset being class unbalanced proved that the good performance of the feature set is independent of the class distributions of the dataset. Since in real-world scenarios one can never exactly replicate the experimental conditions of a class-balanced dataset, our results guarantee that the selected features are independent of class-balanced scenarios. In addition to CWRU test rig dataset, the performance was further validated on a bearing dataset obtained from an operational power plant. The results show that the methodology is robust, aligns well with the real-world situation of class unbalanced datasets, and achieves good performance for both fault diagnosis and fault severity estimation. We also believe that the results obtained from this research can pave the way for researchers to change their current method of developing diagnostic systems by first analysing the feature set comprehensively and then shifting their focus to the next step of classification. For future studies, the selected high-performing statistical features obtained in this paper can be combined with features extracted from different domains to build a more robust multidomain feature set. This work can also be extended by considering different types of bearing specifications. This will prevent the diagnostic system from becoming dependent on the vibration dataset of a specific bearing.

#### Data Availability

The Case Western Reserve University (CWRU) Bearing Dataset is available in [30]. The bearing data taken from the operational power plant were taken for academic research and are not publicly available.

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

#### Acknowledgments

The authors are thankful to Case Western Reserve University and Bharat Heavy Electricals Limited for providing them access to their ball bearing datasets.