Journal of Sensors

Volume 2018, Article ID 7467418, 21 pages

https://doi.org/10.1155/2018/7467418

## A Novel Feature Selection Scheme and a Diversified-Input SVM-Based Classifier for Sensor Fault Classification

School of Electrical Engineering, University of Ulsan, Ulsan 44610, Republic of Korea

Correspondence should be addressed to Insoo Koo; rk.ca.naslu@ooksi

Received 26 May 2018; Accepted 29 July 2018; Published 5 September 2018

Academic Editor: Giuseppe Maruccio

Copyright © 2018 Sana Ullah Jan and Insoo Koo. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The efficiency of a binary support vector machine- (SVM-) based classifier depends on the combination and the number of input features extracted from raw signals. Sometimes, a combination of individual good features does not perform well in discriminating a class due to a high level of relevance to a second class also. Moreover, an increase in the dimensions of an input vector also degrades the performance of a classifier in most cases. To get efficient results, it is needed to input a combination of the lowest possible number of discriminating features to a classifier. In this paper, we propose a framework to improve the performance of an SVM-based classifier for sensor fault classification in two ways: firstly, by selecting the best combination of features for a target class from a feature pool and, secondly, by minimizing the dimensionality of input vectors. To obtain the best combination of features, we propose a novel feature selection algorithm that selects out of features having the maximum mutual information (or relevance) with a target class and the minimum mutual information with nontarget classes. This technique ensures to select the features sensitive to the target class exclusively. Furthermore, we propose a diversified-input SVM (DI-SVM) model for multiclass classification problems to achieve our second objective which is to reduce the dimensions of the input vector. In this model, the number of SVM-based classifiers is the same as the number of classes in the dataset. However, each classifier is fed with a unique combination of features selected by a feature selection scheme for a target class. The efficiency of the proposed feature selection algorithm is shown by comparing the results obtained from experiments performed with and without feature selection. Furthermore, the experimental results in terms of accuracy, receiver operating characteristics (ROC), and the area under the ROC curve (AUC-ROC) show that the proposed DI-SVM model outperforms the conventional model of SVM, the neural network, and the -nearest neighbor algorithm for sensor fault detection and classification.

#### 1. Introduction

The sensors in industrial systems are listed as a second major source of faults after the rolling elements (e.g., bearings) on top [1–3]. These faults lead to intolerable consequences including an increase in maintenance costs, compromising the reliability of products and, even more critical, safety [3, 4]. These issues can be avoided significantly by detecting fault appearance instantly. Therefore, the output of a sensor is monitored to promptly identify the anomaly. After detecting, it is mandatory to find out the primary reason of the fault occurrence in order to implement safety measures. For this purpose, faults are categorized into predefined classes obtained from historical data, a process referred to as classification.

Recently, machine learning (ML) techniques, such as neural networks (NN), support vector machines (SVM), and -nearest neighbors (KNN), are favored for classification problems due to an efficient performance [5–12]. However, they require a feature extraction method to overcome the curse of high dimensionality of input signals. This method characterizes the hundred- or even thousand-dimensional input signal by extracting a few features. Then, these features are used as inputs to the classifiers instead of the raw signal. Although this technique may increase the efficiency in some cases, mostly there is no or very little improvement in the performance of the classifier. A major reason is the use of features with low discriminating power between the samples of different classes. The good features are the ones that characterize the signals from all classes in a dataset such that a signal of one class is easily discriminated from others. To pick such good features from a feature pool, a feature selection (FS) algorithm is used. Therefore, the feature selection step has a dominant role in reliable performance of classifiers in pattern recognition and classification applications. Nevertheless, a combination of individually good features does not necessarily lead to a good classification performance [13]. Therefore, a good FS algorithm tries to find the best combination of features and not a combination of individually good features. Moreover, as mentioned earlier, the complexity of the system is dependent on the dimensions of input vectors in direct relations. Therefore, selecting the fewest possible features when selecting good features can lead to a classifier performance enhancement.

FS algorithms are classified as wrapper-based, embedded-based, and filter-based [14–16]. The wrapper-based methods select features by analyzing the performance of a classifier after each selection. The features that optimize the performance of the classifier are selected. These techniques require high computational resources and a long time to get the best features; even so, optimality is not ensured. Embedded-based feature selection optimizes the classifier and feature selection simultaneously. The problem with these approaches is that the features are selected for the classifier under consideration and may not be able to merge with any other classifier. In contrast, the filter-based feature selection approach selects features irrespective of classifier optimization. In this approach, mutual information (MI) is a widely used measure to select the features most relevant to the target class.

The objective of the current work is to develop an efficient detection and classification algorithm for sensor faults using a supervised ML-based classifier. For this purpose, we first propose a novel MI-based FS scheme to select the exclusive relevant features with high discriminating power from a feature pool. Then, we propose a diversified-input SVM (DI-SVM) model to reduce the dimensions of input vectors to the classifiers ultimately improving the performance.

##### 1.1. Contributions

The major contributions of this work are summarized as follows. (i)A filter-based FS algorithm is proposed to select a combination of features exclusively discriminating one class from others. For a target class, the measure of MI of the appointed feature on the nontarget class is subtracted from the measure of MI of the same feature on the target class. This technique selects the features having high capability of discriminating target class and, at the same time, having less sensitivity to nontarget classes. The results show the efficiency of the proposed FS scheme over other schemes which utilize the redundancy between the features in addition to relevance to measure the goodness of features.(ii)Furthermore, we propose a DI-SVM model to reduce the dimensions of input vectors to classifiers. This model inputs a diversified combination of features to different SVMs utilized for multiclass classification. These combinations of features input to any classifier are selected by the feature selection algorithm. For example, the classifier is fed with a combination that is selected for the class. On the other hand, in a conventional way of using SVMs for multiclass classification, all classifiers are trained with the exact similar number and set of features selected for all classes. The experimental analysis shows that the proposed DI-SVM model further improves the classification performance of the conventional SVM.(iii)The performance of the proposed methodology is analyzed using two datasets. In the first dataset, the faulty signals are obtained by keeping the index of fault insertion point fixed at 500 during simulation, whereas the second dataset is obtained using faulty signals with fault insertion point varying from 0 to 1000 in each sample. The latter case is used to replicate a practical scenario in industrial systems. The results show that it is more challenging to detect and classify faults in the second case. However, the framework of the proposed FS scheme and DI-SVM achieves satisfactory results in this case also.(iv)A series of five experiments are performed to compare the performance of the proposed DI-SVM model with those of the conventional SVM, NN, and KNN classifiers. The first two experiments are performed using the conventional SVM, NN, and KNN classifiers without applying a feature selection. In the next experiment, we prove the efficiency of the proposed FS algorithm using measures of accuracy, receiver operating characteristics (ROC), area under the ROC (AUC-ROC), and scatter plots in selected feature spaces. In the last two experiments, we deploy the proposed FS and DI-SVM model along with the abovementioned classifier and compare the performances. The results show that DI-SVM outperforms all the counter three classifiers.

This paper is organized as follows. A review of related works and the proposed feature selection algorithm are presented in Section 2. The theory of the classification model and proposed DI-SVM model is presented in Section 3. The experimental results are illustrated in Section 4. Section 5 presents the discussion about the experimental results, and finally, the study is concluded in Section 6.

#### 2. Mutual Information-Based Feature Selection

##### 2.1. Preliminaries

Before presenting a review of the related works, we present fundamental background knowledge about the MI-based feature selection schemes. For two random variables, and , the MI is defined as follows [17]: where and are probability density functions of continuous random variables and , respectively, and is their joint probability density function. For discrete random variables, the integration is replaced with summation as follows: where and are probability mass functions of discrete random variables and , respectively, and is their joint probability mass function. The MI can be expressed in terms of entropy as follows:

and represent the entropy and conditional entropy, defined as follows:

##### 2.2. Related Works

In this paper, we focus on the selection criteria of features based on the measure of MI. Different FS algorithms have been proposed in the past using different criteria to measure the goodness of features using MI. Battiti [18] selected features by calculating the MI of individual feature with class . To avoid the selection of redundant features, the MI measure between two features and is calculated. The MI-based feature selection (MIFS) is given as [18] where is the regularization parameter to weight the redundancy between a candidate feature and the already-selected features . Kwak and Choi [19] proved that a large value of can lead to the selection of suboptimal features. They improved the MIFS scheme to propose MIFS-U by making modifications as follows

In both of these algorithms, the selection of a feature is dependent on the user-defined parameter weighting the importance of redundancy between features. If is selected too large, the feature selection algorithm will be dominated by the redundancy factor. The authors of [13] proposed parameter-free criteria of feature selection, named as minimal-redundancy-maximal-relevance (mRMR), given by where is the number of features to be selected.

Estevez et al. [20] pointed out that the right-hand sides of MIFS and MIFS-U, in (6) and (7), will increase with the increase in the cardinality of the selected feature subset. This will result in the dominating right-hand side, thus forcing the FS to select nonredundant features. This may lead to the selection of irrelevant features before relevant features. Another problem is that there is no technique to optimize and its value depends highly on the problem under consideration. Although the mRMR partly solves the first problem in MIFS and MIFS-U, the performance is still comparable with those of MIFS and MIFS-U [21].

A normalized MIFS (NMIFS) is proposed to address the above problems and is given as [20]

An improved version of NMIFS (I-NMIFS) is given in [21] as follows:

The problem with NMIFS and I-NMIFS is that they both rely on the measure of entropy of both the selected and under-observation features. The entropy of a feature is totally dependent on the number of samples from each class in the dataset. If one class has a high number of samples than the other, this situation will have an effect on the value of entropy of a feature leading to degradation of performance of the classifiers.

Furthermore, all these algorithms rely on the measure of mutual information of a to-be-selected feature with a target class (relevance) and with an already-selected feature (redundancy). In this paper, we focus only on the relevance of a feature with a target class and nontarget class to assess the goodness of a candidate feature. Indeed, a feature having high MI (i.e., relevance) with a target class might not be a good choice due to a high relevance to at least one nontarget class as well. This may lead to degrading rather than improving the performance of the classifier. Therefore, we propose a feature selection scheme that relies only on the measures of relative relevance of a feature with different classes in the dataset. The measure of redundancy is eliminated making the scheme simplified yet achieving satisfactory performance compared to that of the FS schemes utilizing the redundancy factor also in the selection criteria.

##### 2.3. The Proposed Feature Selection Scheme

Given a training dataset composed of samples and features, the aim is to select out of features for each class , individually, where and is the number of total classes. The features having maximum relevance to class and minimum relevance to remaining classes are most suitable for discriminating class . Relevance is usually described by MI or correlation, of which the MI is a widely adopted measure to indicate the dependence between two random variables. Generally, the mutually independent variables have zero MI between them. The higher the dependency of two variables, the higher the MI.

The proposed feature selection algorithm selects features by considering their measures of MI with both target class and nontarget class. The features having maximum relevance with a target class are fit for target class, but it may also have high relevance with a nontarget class, making this feature less discriminating. Therefore, the features having maximum relevance with a target class and minimum relevance with nontarget classes are selected. This approach selects the subtlest features among the feature sets for the target class exclusively. The relevance of features exclusive for the target class is obtained by calculating the difference of the MI of the selected feature vector , where , on target class and the MI of on nontarget class . Mathematically, where is the set of nontarget classes. The among features with maximum values of is selected for target class . The MI of on class is calculated in terms of entropy as follows: where is the entropy of class and is the conditional entropy of given . Assuming from continuity, and are defined as where and are the probabilities of the target class and nontarget class, respectively, and can be calculated as and , respectively, where is the number of samples corresponding to class in the training dataset. The conditional probability is given as where correspond to the set of all classes.

The probability mass function can be estimated using the Parzen window method as follows [13]: where represents the set of indices of training samples corresponding to class and is the window function. The commonly used Gaussian window function is expressed as where is the matrix covariance of the -dimensional vectors of random variables, , and the width parameter is given by for positive constant . The appropriate selection of and a large can converge (16) to a true density [13]. Using (5), (6), and (7), the estimated conditional probability mass function can be obtained as

A pseudocode of steps in the proposed FS algorithm is given in Algorithm 1. A training dataset with samples and features, where is the element of the feature vector, is given input to the selection scheme. A matrix of size is the output of the algorithm where each row contains the features selected for targeting class . In the first step, a target class is selected among the sets of all classes . Then, the respective set of nontarget classes is obtained. In the next step, each feature vector is selected simultaneously. For each feature vector, the elements from class and class are stored in and , respectively. The mean values of these vectors are stored in and . After a series of calculation of the estimated conditional probability, entropy, and conditional entropy, is calculated for given class using (2). Then, the features having maximum values of in the final vector are selected and stored in the row of the matrix . Similarly, features are selected for each class and stored in the final matrix .