Abstract

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

1. Introduction

Variable selection represents an important category of dimensionality reduction methods frequently used in the analysis of multivariate data within data mining and multivariate statistics. Variable selection with the aim of finding a smaller number of key variables is an inevitable tool in the analysis of high-dimensional data with the number of variables largely exceeding the number of observations (i.e., ) [1, 2]. The requirement to analyze thousands of highly correlated variables measured on tens or hundreds of samples is very common, for example, in molecular genetics. If the observed data come from several different groups and the aim of the data analysis is learning a classification rule, supervised dimensionality reduction methods are preferable [3], because unsupervised methods such as principal component analysis (PCA) cannot take the information about the group membership into account [4].

While real data are typically contaminated by outlying measurements (outliers) caused by various reasons [5], numerous variable selection procedures suffer from the presence of outliers in the data. Robust dimensionality reduction procedures resistant to outliers were proposed typically in the form of modifications of PCA [69]. Still, the importance of robust variable selection increases [10] as the amount of digital information worldwide increases unimaginably.

Most of the available variable selection procedures tend to select highly correlated variables [11]. This is also the problem of various Maximum Relevance (MR) approaches [12], which select variables inefficient for classification tasks because of the undesirable redundancy in the selected set of variables [13]. As an improvement, the Minimum Redundancy Maximum Relevance (MRMR) criterion was proposed [14] with various criteria for measuring the relevance of a given variable and redundancy within the set of selected key variables. Its ability to avoid selecting highly correlated variables brings about benefits for a consequent analysis. However, the methods remain too vulnerable to outlying values and noise [15].

In this paper, we propose a new MRMR criterion combining principles of regularization and robust statistics, together with proposing a novel optimization algorithm for its computation. It is called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR). For this purpose, we recommend using a highly robust correlation coefficient [16] based on the least weighted squares regression [17] as a new measure of relevance of a given variable. Further, we define a new regularized version of the coefficient of multiple correlation and use it as a redundancy measure. The regularization allows computing it in a numerically stable way for and is advocated as a denoised method improving robustness properties.

This paper has the following structure. Section 2 describes existing approaches to the MRMR criterion. Sections 3.1 and 3.2 propose and investigate new methods for measuring redundancy and relevance. The MRRMRR method is proposed in Section 3.3. Section 4 illustrates the new method on three real high-dimensional data sets. There, we compare various approaches for finding 10 most important genes and compare their ability to discriminate between two groups of samples. The discussion follows in Section 5.

2. MRMR Variable Selection

This section critically discusses existing approaches to the MRMR criterion, overviews possible relevance and redundancy measures, and introduces useful notation. The total number of -dimensional continuous data is assumed to be observed in different groups, where is allowed to largely exceed . Let denote the data matrix with denoting the th variable observed on the th sample, where and . The th variable observed across samples will be denoted by for . Let denote the vector of group labels (true group membership), which are values from the set . The aim is to find a small number of variables, which allow solving the classification task into the groups reliably.

In its habitually used form, the MRMR variable selection can be described as a forward search. The set of selected variables will be denoted by , starting with . At first, the most relevant single variable is selected to be an element of . Then, such variable is added to , which maximizes a certain criterion combining relevance and redundancy. In such a way, one variable after another is added to . Common criteria for combining relevance and redundancy include their difference or ratio [11, 14, 15] or in a more flexible way with a fixed , while choosing a fixed was recommended by [13].

Relevance of a set of variables is commonly measured as where is a specified measure of similarity (suitable for measuring association between a continuous and a discrete variable), is the number of variables in , and the sum is computed over all variables of . Common examples of include measures based on mutual information [13, 14] or other approaches requiring a discretization (or even dichotomization) of the data [15], the statistic of the analysis of variance [11], or Spearman rank correlation coefficient. Specific ad hoc measures were proposed for and cannot be easily generalized for .

Redundancy of a set of variables is commonly measured only as a sum of contributions of individual variables where is a specified measure of similarity (suitable for measuring association between two continuous variables). Common examples of include the mutual information or other measures based on information theory [11, 13, 14], test statistics or values of the Kolmogorov-Smirnov or sign tests, or very simple ad hoc criteria [15]. To the best of our knowledge, no measure able to capture the multivariate structure of the data (e.g., coefficient of multiple correlation) has been used in this context.

Disadvantages or limitations of the MRMR in the habitually used form include a high sensitivity of standard relevance and redundancy measures to the presence of outliers in the data. While nonparametric measures do not suffer from such sensitivity, they remain inefficient for data without contamination by severe noise. Moreover, the mutual information (as well as some other measures) is unsuitable for continuous data. Commonly, continuous data are discretized, which is strongly depreciated due to an unnecessary loss of information [18]. Besides, some authors performed the discretization of continuous data without giving its sufficient description [13], while the effect of discretization of the data has not been systematically examined [15]. In the next section, we propose a robust and efficient version of the MRMR criterion, which uses a suitable regularization and tools of robust statistics.

3. Methodology

3.1. Regularized Coefficient of Multiple Correlation

Redundancy is a measure of association between a continuous variable and the whole set of several continuous variables. The coefficient of multiple correlation is suitable to evaluate the linear association between and the variables in jointly by finding the maximal linear combination of the variables in . In order to allow the method to be feasible also for the number of variables in exceeding , we resort to a regularized coefficient of multiple correlation, which can be also interpreted as a regularized coefficient of determination in linear regression of against all variables included in . While the regularized coefficient may be used as a self-standing correlation measure, it will be used as a redundancy measure within the MRMR criterion in Section 3.3.

Within the computation of the MRMR, the set of selected variables is gradually constructed by including one variable after another, starting with selecting the most relevant single variable, which will be denoted by . In each step, it is necessary to measure the redundancy of after adding a candidate variable observed across samples to . After a certain number of steps of the algorithm, there will be exactly variables in . These will be denoted by , where the th variable contains data values . Let us now consider to be fixed and the aim is to measure association between and variables jointly. The idea of Tikhonov regularization [19, 20] will be used to obtain a formal definition of a regularized coefficient of multiple correlation.

Definition 1. Let denote the empirical correlation matrix computed for the data We define its regularized counterpart as where denotes a unit matrix of size .

The matrix is ensured to be regular even for . In the whole work, we will work only with the asymptotically optimal value of , which minimizes the mean square error of over . This will be denoted by and is obtained by modifying the general result of [21] to our context. For the sake of simplifying the notation, let denote the candidate variable . Then, assuming , the explicit expression for is distribution-free and is equal to where

Let us denote elements of computed with by where diagonal elements are equal to 1. We will use the components of (8) to define and by

Definition 2. Let the regularized coefficient of multiple correlation between the vector and the set of vectors be defined as

We stress that (9) can be computed only after computing the whole matrix . For example, depends also on and . In other words, variables with a large variability borrow information from more stable (less variable) variables in a way analogous to [22] and can be considered to be a denoised version of its classical counterpart. Besides, (5) can be interpreted also from other points of view:It can be motivated as an attempt to correct for an excessive dispersion of sample eigenvalues of the empirical correlation matrix of , similarly to [23].Equation (5) is a regularized estimator of the correlation matrix shrunken towards a unit matrix. This biased estimator with the optimal value of has a smaller quadratic risk compared to its classical counterpart thanks to Stein’s paradox [24, 25]. This explains why a regularized estimator brings about benefits also if the set is chosen to be relatively small (e.g., 10 variables).From the point of view of robust optimization [26], (5) can be interpreted as locally robust against small departures in the observed data.Equation (5) can be derived as a Bayesian estimator, assuming the inverse of the population counterpart of to follow a Wishart distribution with a diagonal expectation (cf. [21]).

Remark 3. The matrix is always regular. Denoting eigenvalues of the empirical correlation matrix computed from data by , the fact follows from the explicit formula for the eigenvalues of in the form for ; that is, they are positive.

Remark 4. An efficient computation of (10) can exploit the singular value decomposition of in the form , where is diagonal and is an orthogonal matrix. Particularly, where

3.2. Robust Correlation Coefficient

In this section, some properties of the robust correlation coefficient [16] based on the least weighted squares (LWS) regression are derived and we recommend using as a relevance measure for the MRMR criterion for samples coming from groups.

The LWS estimator [17] is a robust estimator of regression parameters in linear regression model with a high finite-sample breakdown point [5, 27], that is, highly robust against severe outliers in the data. If the quantile-based adaptive (data-dependent) weights of [28] are used, the estimator attains a full asymptotic efficiency of the least squares (i.e., for noncontaminated normal data). The LWS estimator can be computed using a weighted version of the fast algorithm of [29].

Based on the LWS estimator for the linear regression, a robust correlation coefficient was proposed by [16] as a measure of linear association between two data vectors in the linear regression model Assuming data (14) to follow a continuous distribution, the appealing properties of are inherited from the LWS estimator [16]. To avoid confusion, let us introduce a special notation for various versions of the robust correlation coefficient based on different choices of weights.

Definition 5. One uses the notation to define with the adaptive weights of [28]. The notation is used for computed with the linearly decreasing weights and the notation is used for computed with weights defined by means of a logistic decreasing function [16].

The value of is a measure of goodness of the linear fit in (15). We will now derive some properties of , which are inherited from properties of the LWS regression estimator. The computation of requires computing an initial highly robust estimator of in (15); this can be, for example, the least trimmed squares (LTS) estimator [30].

Theorem 6. Let be a sequence of independent identically distributed random vectors with . One assumes any two observations to give a unique determination of in the linear regression of against almost surely. Let denote the finite-sample breakdown point of an initial estimator of in (15). Then the finite-sample breakdown point of is larger than or equal to

Proof. The finite-sample breakdown point of corresponds to the smallest percentage of data that may be arbitrarily contaminated causing to take an arbitrary large aberrant value (to “break down”) [31]. The robust correlation coefficient inherits the breakdown point of the LWS estimator, which was derived by [28] for the linear regression with regressors to be

Now we study the asymptotic distribution of the robust correlation coefficient based on the LWS estimator under technical (but very general) assumptions.

Theorem 7. One considers the data as a random sample from a bivariate normal distribution with correlation coefficient . One assumes the assumptions of Theorem 3 of [28] to be fulfilled. Then, for , converges in distribution to a random variable following normal distribution. Specifically, the asymptotic distribution of can be approximated by under the assumption .

Proof. The convergence to the normal distribution for follows from the asymptotic normality of with adaptive weights [28] and from the expression where are weights determined by the LWS regression in (15) and and are weighted means computed with these weights. The asymptotic expectation and variance of are equal to the expectation and variance of the sample correlation coefficient, which were approximated by [32].

Pearson’s correlation coefficient is a valid relevance measure also if is binary. Indeed, robust correlation measures have been used in the context of logistic regression [33]. This makes suitable also within the MRMR criterion for measuring association between a binary vector of labels (group membership) and a continuous data vector for . In this context, ensures a high robustness with respect to outliers in the continuous variable in (2), where the vector of labels is considered to be its response.

3.3. MRRMRR Variable Selection

We introduce a new version of the MRMR criterion using a regularized redundancy measure of Section 3.1 and a robust relevance measure of Section 3.2. It is denoted as Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR) and can be interpreted as insensitive to the presence of outliers in the continuous measurements .

We search for the optimal value of in (1), which allows the best classification performance over all possible . Because the relevance and redundancy may not be directly comparable or standardized to the same limits, we do not require .

Algorithm 8. Put . First, the most relevant variable is selected using (2) and is included in the set of variables . Further, the following procedure is repeated. Let denote the expressions of the th variable in across observations. We add such variable not included in to the set , which maximizes the criterion over all variables not included in and over all values of . Other variables are included step by step to , until contains a fixed number of variables, determined before the computations. This approach is repeatedly applied with different fixed values of and such value of is found optimal, which allows the best classification performance.

Concerning the optimal number of selected variables, we refer to [11] for a discussion. Basically, a fixed number of the top-ranked genes are commonly selected to yield the classification error equal to a specified constant [14]. Other works applied an intuitive trial and error approach for specifying a fixed number of selected variables without supporting the choice by rigorous arguments.

4. Results

We compare the performances of various MRMR criteria on three real data sets.

4.1. Cardiovascular Genetic Study

We use gene expression data set from a whole-genome study on 24 patients immediately after a cerebrovascular stroke (CVS) and 24 control persons. This study of the Center of Biomedical Informatics in Prague (2006–2011) had the aim of finding a small set of genes suitable for diagnostics and prognosis of cardiovascular diseases. The data for gene transcripts were measured using HumanWG-6 Illumina BeadChip microarrays. The study complies with the Declaration of Helsinki and was approved by the local ethics committee.

We perform all computations in R software. Variable selection (gene selection) is performed by means of various MRMR criteria with a fixed with the requirement to find 10 most important genes. We use the following relevance measures: mutual information, Pearson correlation coefficient , Spearman rank correlation coefficient , and robust correlation coefficients , , and (Definition 5). Redundancy is evaluated using (3), where has the form of mutual information, , , value of the Kolmogorov-Smirnov test, value of the sign test, and .

Classification performance on a reduced set of variables obtained by various dimensionality reduction procedures is evaluated by means of a leave-one-out cross validation. For this purpose, the data are repeatedly divided into training (47 individuals) and validation sets (1 individual). The classification rule of the linear discriminant analysis (LDA) is learned over the training set and is applied to classify the validation set. This is repeated 48 times over all possible choices of the training set, computing the values of sensitivity and specificity of the classification procedures for each case. At the same time, we compute the classification accuracy with the optimal . Classification accuracy is equal to half of the sum of sensitivity and specificity, that is, the number of correctly classified cases divided by the total number of cases, obtained with the optimal (over ).

Various other classification methods are used without a prior dimensionality reduction, including Prediction Analysis for Microarrays (PAM) [22], shrunken centroid regularized discriminant analysis (SCRDA) [19], and support vector machines (SVM). For comparison, we investigate also the effect of dimensionality reduction by means of PCA.

Table 1 presents results for some fixed values of as well as results obtained with the optimal value of according to Algorithm 8, that is, that nonnegative maximizing the classification accuracy over all its possible values. In all versions of the MRMR approach, the optimal classification was obtained with . The results in Table 1 reveal that MRRMRR outperforms other approaches to MRMR variable selection. The mutual information turns out to perform even much worse than the correlation coefficient, which is a consequence of discretizing continuous data. Besides, we performed also additional computations, including a 12-fold cross validation, which yields analogous results.

Further we investigate whether the new MRRMRR method can be accompanied by a consequent classification by tools other than LDA. The results are overviewed in Table 2. Clearly, MRRMRR does not seem to be linked to any specific classification tool. SVM as well as SCRDA seem to perform very reliably if accompanied by MRRMRR. An attempt for explanation will follow in Section 5.

In addition, we perform a sensitivity study comparing various versions of the MRMR criterion on the same data artificially contaminated by noise, which was generated as a random variable independently of variable and observation and added to each of the observed data values. For each of the following three distributional models, the noise was generated 100 times:Noise 1: normal distribution .Noise 2: contaminated normal distribution with cumulative distribution function (c.d.f.) , where , is a c.d.f. of , and is a c.d.f. of .Noise 3: Cauchy distribution with probability density function We used again various MRMR criteria to find the 10 most relevant genes. The classification accuracy of LDA and other methods is compared in a leave-one-out cross validation study.

Averaged results obtained with the optimal (requiring ) are given in Table 3. They reveal a high vulnerability of available dimensionality reduction methods to the presence of noise. Here, MRRMRR outperforms MRMR with various classical relevance and redundancy measures. Besides, MRRMRR followed by LDA performs comparably to some other standard classification methods, although it actually uses 10 genes, while the other methods (SCRDA, lasso-LR, and SVM) are allowed to use all genes. This performance is verified for noise under all three distributional assumptions and the selected 10 genes by the MRRMRR method do not suffer from noise. The difference between different weight selections for the robust correlation coefficient seems to play only a marginal role and we can say that is able to slightly outperform and .

4.2. Metabolomic Profiles Study

We analyze the prostate cancer metabolomic data set of [34], which contains metabolites measured over two groups of patients, namely, those with a benign prostate cancer (16 patients) and with other cancer types (26 patients). The task in both examples is to learn a classification rule allowing discrimination between classes of individuals.

Standard classification methods are used on raw data as well as after performing a dimensionality reduction. We use MRRMRR with as the relevance measure and as the redundancy measure, because such choice turned out to provide the most reliable results for contaminated data in the study on contaminated data in Section 4.1. Results of classification performance in a leave-one-out cross validation study are given in Table 2.

Standard classification methods are able to perform reliably in this data set [35] but do not allow a clear interpretation. Classification performed on the 20 main principal components loses its power, due to the unsupervised nature of PCA. MRRMRR with 20 selected variables allows performing a reliable classification, without losing important information for the classification task.

4.3. Keystroke Dynamics Study

Finally, we analyze our keystroke dynamics data of [36] from a study aiming at person authentication based on writing medical reports within a hospital. We proposed and implemented a software system based on keystroke dynamics measurements [37], inspired by biometric authentication systems for medical reports [38, 39].

The training data contain keystroke durations and keystroke latencies measured in milliseconds on 32 probands, who typed a short password (“kladruby”) 10 times in their habitual speed. In spite of a small value of variables, the data are high-dimensional because exceeds the number of measurements for each individual and we must expect that learning the classification rule would suffer from the curse of dimensionality. In the practical application, one of the 32 individuals identifies himself/herself (say as ) and types the password. The aim of the analysis is to verify whether the individual typing on the keyboard is or is not the person . Thus, the authentication task is a classification problem to assign the individual to one of the groups.

Results of classification performance in a leave-one-out cross validation study are given in the last column of Table 2. If the classification is performed with raw data, an SVM outperforms other methods. However, its disadvantages include the inability to find optimal values of their parameters as well as a large number of support vectors [1]. If MRRMRR is used to select 4 variables with as the relevance measure and as the redundancy measure, there seems to be no major loss of important information for the classification task.

5. Discussion

Variable selection represents an irreplaceable tool in the analysis of high-dimensional data, preventing numerous approaches of multivariate statistics and data mining from overfitting the data or even from being computationally infeasible due to the curse of dimensionality. Various versions of the Minimum Redundancy Maximum Relevance approach have been described in references as a supervised variable selection methodology tailor-made for classification purposes, while its primary disadvantage has been explained as its high sensitivity to the presence of outlying measurements [15].

This paper proposes a new version of the MRMR criterion in the form (20) capturing the multivariate structure of the data. The new criterion denoted as the Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR) is constructed from two essential tools and the robustness of the criterion is given by robustness of both tools. One of them is a relevance measure in the form of a robust correlation coefficient , for which we investigate theoretical properties. The other is a redundancy measure in the form of a new regularized version of the coefficient of multiple correlation , which can be interpreted as a regularized coefficient of determination in linear regression. They are robust to the presence of noise in the data, numerically stable, and also statistically robust in terms of the breakdown point, that is, to the presence of outliers. Our work is a first attempt to investigate robust and regularized methods within the MRMR criterion, which is limited only to two groups of samples.

Section 4 of this paper illustrates the performance of MRRMRR on three real high-dimensional data sets with different values of . Because the forward search of the MRMR criterion with various choices of relevance and redundancy depends on parameter in (1), the optimal result is obtained by maximizing the classification accuracy over different values of . MRRMRR yields very reliable results on the observed data, while there seems to be a negligible difference among the three choices of weights for the implicitly weighted relevance measure (, , and ).

To show the robustness of MRRMRR, the data of Section 4.1 are contaminated again after being contaminated by severe noise. MRRMRR performs as the most robust approach among other variable selection procedures, while the choice of the weights for the robust relevance measure seems to play a negligible role. On the other hand, the vulnerability of some approaches (e.g., mutual information within the MRMR variable selection) has not been sufficiently discussed in references.

In the numerical examples, we also inspected the question: Which classification methods are the most recommendable to accompany the MRRMRR variable selection? Based on the results, SVM, LDA, and SCRDA seem to be suitable for this context, because they allow taking the covariance structure of the data into account. They are reliable also for highly correlated variables, while a prior using of MRRMRR avoids their specific disadvantages characteristic for high-dimensional data. On the other hand, MRRMRR does not bring about benefit to classification methods which are based on one-dimensional principles. These include classification trees, PAM (i.e., diagonalized LDA), and others not used in our computations (e.g., Naïve Bayes classifier).

The regularization used in (5) is a popular tool allowing modifying statistical methods for the context of high-dimensional data. As Section 4.3 reveals, regularization brings about benefits for multivariate data also with a small number of variables. Thus, the regularization of Section 3.1 turns out to be suitable also for high-dimensional data with any . Also in a general setting, regularization has been described as a finite-sample (nonasymptotic) approach for multivariate data, not limited to the context of high-dimensional data [1, 24].

Every version of the MRMR method allows finding a set containing a fixed number of genes, which must be chosen before the computation. In the examples, we used an arbitrary choice mainly for comparison purposes. In practice, a more flexible approach would be to use the optimal number of variables according to a criterion evaluating the contribution of the variables to the classification problem taking the total number of variables into account [15].

Other possible relevance measures not studied in the references include measures based on nonparametric analysis of variance (e.g., Kruskal-Wallis, van der Waerden, and median tests [40]), logistic regression (probability of belonging to group 1 or deviance), or a coefficient of determination corresponding to ridge regression or lasso estimators [1]. A natural extension of our approach to several () groups would be to replace the robust correlation coefficient with a highly robust version of the analysis of variance.

As a limitation of the MRRMRR approach compared to other MRMR approaches, its higher computational complexity compared to simple approaches of (1) with a fixed must be mentioned. Besides, the idea of Tikhonov regularization (5) is tailor-made for data with variables of the same type, for example, variables measures in the same units and with a similar level of variability. This may not be adequate if the observed variables are very heterogeneous. Other limitations of MRRMRR include those common to all MRMR approaches. Particularly, it does not possess a high stability like other variable selection procedures [41] and a too small number of selected variables in the MRRMRR approach may be criticized for its limited classification ability [18, 42].

The MRRMRR method is primarily designed as a variable selection tool, tailor-made for data which are observed in two different groups. Thus, if the very aim of the high-dimensional data analysis is classification analysis without an explicit need for a variable selection, the user may prefer to use classification methods directly, that is, those which are reliable for . These direct classification methods not requiring a prior dimensionality reduction (regularized LDA of [19] or SVM) may yield comparable (or possibly even better) results, but we stress their different primary aim. On the other hand, if the very aim of the analysis is comprehensibility of the classification approach, the user may want to avoid the classifiers in the form of a black box. In such situations, the new MRRMRR variable selection represents a suitable tool, which is robust to the presence of outlying values.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work was financially supported by the Neuron Fund for Support of Science and Grant GA13-17187S of the Czech Science Foundation.