Computational and Mathematical Methods in Medicine

Volume 2017 (2017), Article ID 7907163, 18 pages

https://doi.org/10.1155/2017/7907163

## A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data

Department of Statistics, TU Dortmund University, 44221 Dortmund, Germany

Correspondence should be addressed to Andrea Bommert; ed.dnumtrod-ut.kitsitats@tremmob

Received 22 February 2017; Revised 3 May 2017; Accepted 5 June 2017; Published 1 August 2017

Academic Editor: Benjamin Hofner

Copyright © 2017 Andrea Bommert et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Finding a good predictive model for a high-dimensional data set can be challenging. For genetic data, it is not only important to find a model with high predictive accuracy, but it is also important that this model uses only few features and that the selection of these features is stable. This is because, in bioinformatics, the models are used not only for prediction but also for drawing biological conclusions which makes the interpretability and reliability of the model crucial. We suggest using three target criteria when fitting a predictive model to a high-dimensional data set: the classification accuracy, the stability of the feature selection, and the number of chosen features. As it is unclear which measure is best for evaluating the stability, we first compare a variety of stability measures. We conclude that the Pearson correlation has the best theoretical and empirical properties. Also, we find that for the stability assessment behaviour it is most important that a measure contains a correction for chance or large numbers of chosen features. Then, we analyse Pareto fronts and conclude that it is possible to find models with a stable selection of few features without losing much predictive accuracy.

#### 1. Introduction

In many applications of bioinformatics, the goal is to find a good predictive model for high-dimensional data. To avoid overfitting and to discover the relevant features, feature selection should be integrated into the model fitting process [1]. The feature selection should be stable; that is, the sets of chosen features should be similar for similar data sets, as an unstable feature selection would question the reliability of the results [2].

Over the past decade, a variety of frameworks for stability evaluation have been proposed. Overviews of existing stability measures are given in [3, 4]. The theoretical properties of different measures of stability are studied in [5]. Pitfalls with respect to interpreting the values of stability measures are discussed in [6] and experimental setups for stability evaluation are presented in [7]. Ensemble methods for making feature selection more stable than a single feature selection method are proposed in [8–10]. The research that has been done in all of the aforementioned aspects of stability assessment is reviewed in [11] and various feature selection methods including ensemble methods are analysed in [12–18]. It is shown that conducting a stable feature selection before fitting a classification model can increase the predictive performance of the model [19]. Most of these works consider both high stability and high predictive accuracy of the resulting classification model as target criteria but do not consider the number of selected features as a third target criterion.

In this paper, we pursue two goals. Firstly, we compare a variety of stability measures empirically. We aim at finding out which of the measures assess the stability similarly in practical applications. Also, we aim at choosing stability measures that are suitable for finding desirable models for a given data set. Secondly, we suggest a strategy for finding a desirable model for a given data set with respect to the following criteria:(i)The predictive accuracy must be high.(ii)The feature selection must be stable.(iii)Only a small number of features must be chosen.The predictive power of a predictive model is obviously important and is usually the only criterion considered in model selection. However, when trying to discover relevant features, for example, to understand the underlying biological process, it is also necessary to keep the set of selected features both small and stable. To reach all three targets simultaneously, we combine feature selection and classification methods. For these “augmented” methods, we measure the three target criteria jointly during hyperparameter tuning and we choose configurations which perform well considering all three target criteria.

The rest of the paper is organised as follows. In Section 2, we describe the measures of stability, filter methods, and classification methods which are considered in this paper. In Section 3, the data sets used in our experiments are presented. Section 4 contains the empirical comparison of stability measures. Section 5 covers our second experiment, where we search for desirable configurations with respect to the three target criteria explained above. Section 6 summarizes the conclusions of our work.

#### 2. Methods

In this section, we explain different measures of stability, filter methods for feature selection and classification methods. We also describe the concept of Pareto optimality.

##### 2.1. Measures of Stability

We use the following notation: assume that there is a data set containing observations of the features . Resampling is used to split the data set into subsets. The feature selection method is then applied to each of the subsets. Let , , denote the set of chosen features for the -th subset of the data set and let be the cardinality of this set.

###### 2.1.1. Intersection Based Stability Measures

The following intersection based stability measures consider a feature selection to be stable if the cardinalities of all pairwise intersections are high. The measures standardise the cardinalities of the intersections in different ways. Three simple stability measures based on stability indices are defined as Jaccard [20]: Dice [21]: Ochiai [22]:Extending SJ in a way that different but highly correlated variables count towards stability gives the stability measure: Zucknick et al. [23]: withwhere is the Pearson correlation between and . denotes the indicator function for a set , and

The idea of a stability measure that is corrected for chance was first proposed in [27]. The reason for a correction for chance is that necessarily becomes large if and are large. The idea is made applicable in situations in which the numbers of chosen features vary: Lustgarten et al. [24]:

###### 2.1.2. Frequency Based Stability Measures

Let , , denote the number of sets that contain feature so that is the absolute frequency with which feature is chosen. Frequency based stability measures evaluate such situations as stable in which the features are chosen for either most of the subsets or not at all. The entropy-based measure of stability relies on and is given by

Novovicová et al. [25]: with and . Davis et al. [13]:with and like before is a stability measure, where the minuend rewards frequent choices of variables, while the subtrahend penalises large sets of chosen features.

The relative weighted consistency is defined as Somol and Novovicová [26]: withand is like before. Calculating scales the positive absolute frequencies to . All scaled frequencies with are assigned the weight . The correction terms and cause the measure to lie within the range . As the correction terms depend on , this measure contains a correction for chance.

###### 2.1.3. Correlation

The Pearson correlation can be used as a stability measure. To do so, Nogueira and Brown [5] define a vector for each set of selected features to indicate which features are chosen. The -th component of is equal to 1 if contains ; that is, , . The resulting stability measure is Correlation [5]: with denoting the Pearson correlation between and . The Pearson correlation measures the linear association between continuous variables. When applied to binary data like the vectors , the Pearson correlation is equivalent to the phi coefficient for the contingency table of each two of these vectors.

###### 2.1.4. Theoretical Properties

Nogueira and Brown [5] define four properties which are desirable for stability measures:(i)Fully defined (fulfilled if the measure does not require the cardinalities to be identical)(ii)Upper/lower bounds (fulfilled if the upper and lower bounds of a measure are both finite)(iii)Maximum (fulfilled if a deterministic selection of the same features achieves the maximum value and if the maximum value is only achieved by a deterministic selection)(iv)Correction for chance (fulfilled if the expected value of the stability measure for a random feature selection is constant, that is, does not depend on the number of chosen features). When features are chosen entirely at random, uncorrected measures usually attain the higher values the more features are selected.For SJ, SD, SL, SS, and SC, these properties are analysed in [5]. We report these results in Table 1 and add the results for SO, SZ, SD-, and SN. Additionally, the theoretical ranges of the stability measures are given in Table 1. High values indicate high stability and low values indicate low stability for all measures. Note that the upper bound for SD- depends on .