BioMed Research International

Volume 2015, Article ID 320385, 10 pages

http://dx.doi.org/10.1155/2015/320385

## A Robust Supervised Variable Selection for Noisy High-Dimensional Data

^{1}Institute of Computer Science of the Czech Academy of Sciences, Pod Vodárenskou Vĕží 2, 182 07 Prague 8, Czech Republic^{2}Department of Biomedical Informatics, Faculty of Biomedical Engineering, Czech Technical University in Prague, Náměstí Sítná 3105, 272 01 Kladno, Czech Republic

Received 14 November 2014; Accepted 7 April 2015

Academic Editor: Rosalyn H. Hargraves

Copyright © 2015 Jan Kalina and Anna Schlenker. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

#### 1. Introduction

Variable selection represents an important category of dimensionality reduction methods frequently used in the analysis of multivariate data within data mining and multivariate statistics. Variable selection with the aim of finding a smaller number of key variables is an inevitable tool in the analysis of high-dimensional data with the number of variables largely exceeding the number of observations (i.e., ) [1, 2]. The requirement to analyze thousands of highly correlated variables measured on tens or hundreds of samples is very common, for example, in molecular genetics. If the observed data come from several different groups and the aim of the data analysis is learning a classification rule, supervised dimensionality reduction methods are preferable [3], because unsupervised methods such as principal component analysis (PCA) cannot take the information about the group membership into account [4].

While real data are typically contaminated by outlying measurements (outliers) caused by various reasons [5], numerous variable selection procedures suffer from the presence of outliers in the data. Robust dimensionality reduction procedures resistant to outliers were proposed typically in the form of modifications of PCA [6–9]. Still, the importance of robust variable selection increases [10] as the amount of digital information worldwide increases unimaginably.

Most of the available variable selection procedures tend to select highly correlated variables [11]. This is also the problem of various Maximum Relevance (MR) approaches [12], which select variables inefficient for classification tasks because of the undesirable redundancy in the selected set of variables [13]. As an improvement, the Minimum Redundancy Maximum Relevance (MRMR) criterion was proposed [14] with various criteria for measuring the relevance of a given variable and redundancy within the set of selected key variables. Its ability to avoid selecting highly correlated variables brings about benefits for a consequent analysis. However, the methods remain too vulnerable to outlying values and noise [15].

In this paper, we propose a new MRMR criterion combining principles of regularization and robust statistics, together with proposing a novel optimization algorithm for its computation. It is called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR). For this purpose, we recommend using a highly robust correlation coefficient [16] based on the least weighted squares regression [17] as a new measure of relevance of a given variable. Further, we define a new regularized version of the coefficient of multiple correlation and use it as a redundancy measure. The regularization allows computing it in a numerically stable way for and is advocated as a denoised method improving robustness properties.

This paper has the following structure. Section 2 describes existing approaches to the MRMR criterion. Sections 3.1 and 3.2 propose and investigate new methods for measuring redundancy and relevance. The MRRMRR method is proposed in Section 3.3. Section 4 illustrates the new method on three real high-dimensional data sets. There, we compare various approaches for finding 10 most important genes and compare their ability to discriminate between two groups of samples. The discussion follows in Section 5.

#### 2. MRMR Variable Selection

This section critically discusses existing approaches to the MRMR criterion, overviews possible relevance and redundancy measures, and introduces useful notation. The total number of -dimensional continuous data is assumed to be observed in different groups, where is allowed to largely exceed . Let denote the data matrix with denoting the th variable observed on the th sample, where and . The th variable observed across samples will be denoted by for . Let denote the vector of group labels (true group membership), which are values from the set . The aim is to find a small number of variables, which allow solving the classification task into the groups reliably.

In its habitually used form, the MRMR variable selection can be described as a forward search. The set of selected variables will be denoted by , starting with . At first, the most relevant single variable is selected to be an element of . Then, such variable is added to , which maximizes a certain criterion combining relevance and redundancy. In such a way, one variable after another is added to . Common criteria for combining relevance and redundancy include their difference or ratio [11, 14, 15] or in a more flexible way with a fixed , while choosing a fixed was recommended by [13].

Relevance of a set of variables is commonly measured as where is a specified measure of similarity (suitable for measuring association between a continuous and a discrete variable), is the number of variables in , and the sum is computed over all variables of . Common examples of include measures based on mutual information [13, 14] or other approaches requiring a discretization (or even dichotomization) of the data [15], the statistic of the analysis of variance [11], or Spearman rank correlation coefficient. Specific ad hoc measures were proposed for and cannot be easily generalized for .

Redundancy of a set of variables is commonly measured only as a sum of contributions of individual variables where is a specified measure of similarity (suitable for measuring association between two continuous variables). Common examples of include the mutual information or other measures based on information theory [11, 13, 14], test statistics or values of the Kolmogorov-Smirnov or sign tests, or very simple ad hoc criteria [15]. To the best of our knowledge, no measure able to capture the multivariate structure of the data (e.g., coefficient of multiple correlation) has been used in this context.

Disadvantages or limitations of the MRMR in the habitually used form include a high sensitivity of standard relevance and redundancy measures to the presence of outliers in the data. While nonparametric measures do not suffer from such sensitivity, they remain inefficient for data without contamination by severe noise. Moreover, the mutual information (as well as some other measures) is unsuitable for continuous data. Commonly, continuous data are discretized, which is strongly depreciated due to an unnecessary loss of information [18]. Besides, some authors performed the discretization of continuous data without giving its sufficient description [13], while the effect of discretization of the data has not been systematically examined [15]. In the next section, we propose a robust and efficient version of the MRMR criterion, which uses a suitable regularization and tools of robust statistics.

#### 3. Methodology

##### 3.1. Regularized Coefficient of Multiple Correlation

Redundancy is a measure of association between a continuous variable and the whole set of several continuous variables. The coefficient of multiple correlation is suitable to evaluate the linear association between and the variables in jointly by finding the maximal linear combination of the variables in . In order to allow the method to be feasible also for the number of variables in exceeding , we resort to a regularized coefficient of multiple correlation, which can be also interpreted as a regularized coefficient of determination in linear regression of against all variables included in . While the regularized coefficient may be used as a self-standing correlation measure, it will be used as a redundancy measure within the MRMR criterion in Section 3.3.

Within the computation of the MRMR, the set of selected variables is gradually constructed by including one variable after another, starting with selecting the most relevant single variable, which will be denoted by . In each step, it is necessary to measure the redundancy of after adding a candidate variable observed across samples to . After a certain number of steps of the algorithm, there will be exactly variables in . These will be denoted by , where the th variable contains data values . Let us now consider to be fixed and the aim is to measure association between and variables jointly. The idea of Tikhonov regularization [19, 20] will be used to obtain a formal definition of a regularized coefficient of multiple correlation.

*Definition 1. *Let denote the empirical correlation matrix computed for the data We define its regularized counterpart as where denotes a unit matrix of size .

The matrix is ensured to be regular even for . In the whole work, we will work only with the asymptotically optimal value of , which minimizes the mean square error of over . This will be denoted by and is obtained by modifying the general result of [21] to our context. For the sake of simplifying the notation, let denote the candidate variable . Then, assuming , the explicit expression for is distribution-free and is equal to where

Let us denote elements of computed with by where diagonal elements are equal to 1. We will use the components of (8) to define and by

*Definition 2. *Let the regularized coefficient of multiple correlation between the vector and the set of vectors be defined as

We stress that (9) can be computed only after computing the whole matrix . For example, depends also on and . In other words, variables with a large variability borrow information from more stable (less variable) variables in a way analogous to [22] and can be considered to be a denoised version of its classical counterpart. Besides, (5) can be interpreted also from other points of view:It can be motivated as an attempt to correct for an excessive dispersion of sample eigenvalues of the empirical correlation matrix of , similarly to [23].Equation (5) is a regularized estimator of the correlation matrix shrunken towards a unit matrix. This biased estimator with the optimal value of has a smaller quadratic risk compared to its classical counterpart thanks to Stein’s paradox [24, 25]. This explains why a regularized estimator brings about benefits also if the set is chosen to be relatively small (e.g., 10 variables).From the point of view of robust optimization [26], (5) can be interpreted as locally robust against small departures in the observed data.Equation (5) can be derived as a Bayesian estimator, assuming the inverse of the population counterpart of to follow a Wishart distribution with a diagonal expectation (cf. [21]).

*Remark 3. *The matrix is always regular. Denoting eigenvalues of the empirical correlation matrix computed from data by , the fact follows from the explicit formula for the eigenvalues of in the form for ; that is, they are positive.

*Remark 4. *An efficient computation of (10) can exploit the singular value decomposition of in the form , where is diagonal and is an orthogonal matrix. Particularly, where

##### 3.2. Robust Correlation Coefficient

In this section, some properties of the robust correlation coefficient [16] based on the least weighted squares (LWS) regression are derived and we recommend using as a relevance measure for the MRMR criterion for samples coming from groups.

The LWS estimator [17] is a robust estimator of regression parameters in linear regression model with a high finite-sample breakdown point [5, 27], that is, highly robust against severe outliers in the data. If the quantile-based adaptive (data-dependent) weights of [28] are used, the estimator attains a full asymptotic efficiency of the least squares (i.e., for noncontaminated normal data). The LWS estimator can be computed using a weighted version of the fast algorithm of [29].

Based on the LWS estimator for the linear regression, a robust correlation coefficient was proposed by [16] as a measure of linear association between two data vectors in the linear regression model Assuming data (14) to follow a continuous distribution, the appealing properties of are inherited from the LWS estimator [16]. To avoid confusion, let us introduce a special notation for various versions of the robust correlation coefficient based on different choices of weights.

*Definition 5. *One uses the notation to define with the adaptive weights of [28]. The notation is used for computed with the linearly decreasing weights and the notation is used for computed with weights defined by means of a logistic decreasing function [16].

The value of is a measure of goodness of the linear fit in (15). We will now derive some properties of , which are inherited from properties of the LWS regression estimator. The computation of requires computing an initial highly robust estimator of in (15); this can be, for example, the least trimmed squares (LTS) estimator [30].

Theorem 6. *Let be a sequence of independent identically distributed random vectors with . One assumes any two observations to give a unique determination of in the linear regression of against almost surely. Let denote the finite-sample breakdown point of an initial estimator of in (15). Then the finite-sample breakdown point of is larger than or equal to *

*Proof. *The finite-sample breakdown point of corresponds to the smallest percentage of data that may be arbitrarily contaminated causing to take an arbitrary large aberrant value (to “break down”) [31]. The robust correlation coefficient inherits the breakdown point of the LWS estimator, which was derived by [28] for the linear regression with regressors to be

*Now we study the asymptotic distribution of the robust correlation coefficient based on the LWS estimator under technical (but very general) assumptions.*

*Theorem 7. One considers the data as a random sample from a bivariate normal distribution with correlation coefficient . One assumes the assumptions of Theorem 3 of [28] to be fulfilled. Then, for , converges in distribution to a random variable following normal distribution. Specifically, the asymptotic distribution of can be approximated by under the assumption .*

*Proof. *The convergence to the normal distribution for follows from the asymptotic normality of with adaptive weights [28] and from the expression where are weights determined by the LWS regression in (15) and and are weighted means computed with these weights. The asymptotic expectation and variance of are equal to the expectation and variance of the sample correlation coefficient, which were approximated by [32].

*Pearson’s correlation coefficient is a valid relevance measure also if is binary. Indeed, robust correlation measures have been used in the context of logistic regression [33]. This makes suitable also within the MRMR criterion for measuring association between a binary vector of labels (group membership) and a continuous data vector for . In this context, ensures a high robustness with respect to outliers in the continuous variable in (2), where the vector of labels is considered to be its response.*

*3.3. MRRMRR Variable Selection*

*We introduce a new version of the MRMR criterion using a regularized redundancy measure of Section 3.1 and a robust relevance measure of Section 3.2. It is denoted as Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR) and can be interpreted as insensitive to the presence of outliers in the continuous measurements .*

*We search for the optimal value of in (1), which allows the best classification performance over all possible . Because the relevance and redundancy may not be directly comparable or standardized to the same limits, we do not require .*

*Algorithm 8. *Put . First, the most relevant variable is selected using (2) and is included in the set of variables . Further, the following procedure is repeated. Let denote the expressions of the th variable in across observations. We add such variable not included in to the set , which maximizes the criterion over all variables not included in and over all values of . Other variables are included step by step to , until contains a fixed number of variables, determined before the computations. This approach is repeatedly applied with different fixed values of and such value of is found optimal, which allows the best classification performance.

*Concerning the optimal number of selected variables, we refer to [11] for a discussion. Basically, a fixed number of the top-ranked genes are commonly selected to yield the classification error equal to a specified constant [14]. Other works applied an intuitive trial and error approach for specifying a fixed number of selected variables without supporting the choice by rigorous arguments.*

*4. Results*

*We compare the performances of various MRMR criteria on three real data sets.*

*4.1. Cardiovascular Genetic Study*

*We use gene expression data set from a whole-genome study on 24 patients immediately after a cerebrovascular stroke (CVS) and 24 control persons. This study of the Center of Biomedical Informatics in Prague (2006–2011) had the aim of finding a small set of genes suitable for diagnostics and prognosis of cardiovascular diseases. The data for gene transcripts were measured using HumanWG-6 Illumina BeadChip microarrays. The study complies with the Declaration of Helsinki and was approved by the local ethics committee.*

*We perform all computations in R software. Variable selection (gene selection) is performed by means of various MRMR criteria with a fixed with the requirement to find 10 most important genes. We use the following relevance measures: mutual information, Pearson correlation coefficient , Spearman rank correlation coefficient , and robust correlation coefficients , , and (Definition 5). Redundancy is evaluated using (3), where has the form of mutual information, , , value of the Kolmogorov-Smirnov test, value of the sign test, and .*

*Classification performance on a reduced set of variables obtained by various dimensionality reduction procedures is evaluated by means of a leave-one-out cross validation. For this purpose, the data are repeatedly divided into training (47 individuals) and validation sets (1 individual). The classification rule of the linear discriminant analysis (LDA) is learned over the training set and is applied to classify the validation set. This is repeated 48 times over all possible choices of the training set, computing the values of sensitivity and specificity of the classification procedures for each case. At the same time, we compute the classification accuracy with the optimal . Classification accuracy is equal to half of the sum of sensitivity and specificity, that is, the number of correctly classified cases divided by the total number of cases, obtained with the optimal (over ).*

*Various other classification methods are used without a prior dimensionality reduction, including Prediction Analysis for Microarrays (PAM) [22], shrunken centroid regularized discriminant analysis (SCRDA) [19], and support vector machines (SVM). For comparison, we investigate also the effect of dimensionality reduction by means of PCA.*

*Table 1 presents results for some fixed values of as well as results obtained with the optimal value of according to Algorithm 8, that is, that nonnegative maximizing the classification accuracy over all its possible values. In all versions of the MRMR approach, the optimal classification was obtained with . The results in Table 1 reveal that MRRMRR outperforms other approaches to MRMR variable selection. The mutual information turns out to perform even much worse than the correlation coefficient, which is a consequence of discretizing continuous data. Besides, we performed also additional computations, including a 12-fold cross validation, which yields analogous results.*