Computational and Mathematical Methods in Medicine

Volume 2015 (2015), Article ID 370640, 14 pages

http://dx.doi.org/10.1155/2015/370640

## A Novel Hybrid Dimension Reduction Technique for Undersized High Dimensional Gene Expression Data Sets Using Information Complexity Criterion for Cancer Classification

^{1}Department of Statistics, Faculty of Science, Firat University, 23119 Elazig, Turkey^{2}Department of Business Analytics and Statistics, The University of Tennessee, Knoxville, TN 37996, USA

Received 24 December 2014; Accepted 18 February 2015

Academic Editor: David A. Winkler

Copyright © 2015 Esra Pamukçu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Gene expression data typically are large, complex, and highly noisy. Their dimension is high with several thousand genes (i.e., features) but with only a limited number of observations (i.e., samples). Although the classical principal component analysis (PCA) method is widely used as a first standard step in dimension reduction and in supervised and unsupervised classification, it suffers from several shortcomings in the case of data sets involving undersized samples, since the sample covariance matrix degenerates and becomes singular. In this paper we address these limitations within the context of probabilistic PCA (PPCA) by introducing and developing a new and novel approach using maximum entropy covariance matrix and its hybridized smoothed covariance estimators. To reduce the dimensionality of the data and to choose the number of probabilistic PCs (PPCs) to be retained, we further introduce and develop celebrated Akaike’s information criterion (AIC), consistent Akaike’s information criterion (CAIC), and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan. Six publicly available undersized benchmark data sets were analyzed to show the utility, flexibility, and versatility of our approach with hybridized smoothed covariance matrix estimators, which do not degenerate to perform the PPCA to reduce the dimension and to carry out supervised classification of cancer groups in high dimensions.

#### 1. Introduction

The study of gene expression has been greatly facilitated by DNA microarray technology. Since DNA microarrays measure the expression of thousands of genes simultaneously, there is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data [1, 2]. With the wealth of gene expression data from microarrays being produced, more and more new prediction, classification, and clustering techniques are being used for the analysis of the data [3]. Dimension reduction techniques such as principal component analysis (PCA) and several extended forms of PCA such as probabilistic principal component analysis (PPCA), kernel principal component analysis (KPCA) have also been proposed to analyze gene expression data. For more on these methods we refer the readers to Raychaudhuri et al. [1], Yeung and Ruzzo [2], Chen et al. [4], Yang et al. [5], Ma and Kosorok [6], and Nyamundanda et al. [7]. Although these methods are commonly used in the literature, they all inherently have their own idiosyncratic statistical difficulties in analyzing undersized samples in high dimensions due to singularity of the covariance matrix, where these difficulties have not been satisfactorily addressed in the literature. For SVM type kernel methods, although they are useful tools, they have their own limitations in the sense that they are not easily interpretable since the kernel transformation is not one-to-one and onto and the transformation is not invertible. Moreover, for a given data set the choice of the optimal kernel function and the tuning parameters in kernel-based methods has been arbitrary and has remained an unresolved academic research problem in the literature until the recent work of Liu and Bozdogan [8] and Liberati et al. [9].

The main idea of the classical PCA, for example, is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible the variation present in the data set. This is achieved by transforming the data to a new set of variables, the principal components (PCs), which are uncorrelated and ordered [10]. By applying PCA, one is implicitly assuming that the desired information is exactly provided by the percent variance explained. But such an assumption has been questioned and criticized by Scholz [11] in gene expression data analysis. Other nonexhaustive limitations of PCA can be briefly described as follows.(i)When the sample size is much smaller than the number of features (i.e., genes), , that is, when we have , the maximum likelihood (ML) estimator of the covariance matrix is neither invertible nor well conditioned.(ii)Therefore, the classical PCA does not work well since the estimated covariance matrix becomes rank deficient. Such a case in the literature is known as* undersized sample problem* in high dimensions [12].(iii)PCA suffers from a probabilistic interpretation. That is, it does not have an underlying probability density model.

Estimation of the covariance matrices for small sample size and high dimensions, that is, the problem, is a difficult problem that has recently attracted the attention of many researchers. This problem is prevalent in* genomics*,* microarray data*,* gene sequencing, medical data mining,* and* other bioinformatics areas* as well as in* econometrics* and* predictive business modeling*. This problem is numerically one of the most challenging problems that require new and efficient computational methods.

Due to the curse of dimensionality, almost all of the classical multivariate statistical methods break down and degenerate. This means that the covariance matrix of the data cannot be computed and, as a result, one obtains only poor classification and cluster analysis results. The main reason for this problem is the high noise created by the irrelevant or redundant genes (i.e., features) present in the data.

In this paper, therefore, our main objectives are severalfold to address the limitations of the standard classic PCA and to develop and introduce a new and novel dimension reduction technique for cancer classification. These are as follows.(i)To resolve the problem of small sample size and large number of dimensions, that is, the problem, we introduce several smoothed (or robust) covariance estimators and their hybridized forms with the neglected maximum entropy (ME) covariance matrix.(ii)We introduce and use probabilistic principal component analysis (PPCA) as an alternative to the classic PCA. PPCA is a probabilistic formulation of PCA based on a Gaussian latent variable model. PPCA was developed in the late 1990s and popularized by the work of Tipping and Bishop [13, 14]. PPCA is flexible and has the associated likelihood measure as the quantum of information in the data, which is needed in the model selection criteria and their computations.(iii)A central issue in gene expression data is the dimension reduction before any classification or clustering procedures are meaningfully applied, especially when . In the literature, the task of dimensionality selection has not been solved in a satisfactory way for undersized gene expression data in high dimensions. To this end, we introduce and develop celebrated Akaike’s information criterion (AIC) [15], consistent Akaike’s information criterion (CAIC) of Bozdogan [16], and the information theoretic measure of complexity (ICOMP) criterion of Bozdogan [17, 18] in PPCA model to choose the number of probabilistic PC components to be retained.(iv)Later the PPCs chosen by the information criteria are used as inputs in cancer classification using linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). The performance of these methods is compared on the well-known six benchmark gene expression data sets to emphasize the importance of the role of dimension reduction to resolve the* “curse of dimensionality”* of Bellman [19] in cancer classification problems.

Our method has distinct advantages over the previously proposed methods in that we provide analytical means of choosing the number of PPCs via the novel application of information theoretic model selection criteria to reduce the dimension automatically that can be retained and utilized in cancer classification based on sound statistical modeling procedures. We use the entire data set with its high dimensions rather than clusterwise splitting or partitioning of the variables in the analysis process due to the high dimensionality. Our approach has the generalizability property to non-Gaussian latent variable models, probabilistic independent component analysis (PICA), sparse probabilistic principal component analysis (SPPCA), and other methods. The proposed approach is efficient and computationally cost effective; the results obtained are easy to interpret in the original data space and can be used in other supervised or unsupervised cancer classification procedures.

#### 2. Materials and Methods

##### 2.1. Smoothed and Hybridized Covariance Estimators

###### 2.1.1. Maximum Likelihood Covariance Estimation

Let be data matrix. When the data are modeled probabilistically as Gaussian or other elliptically contoured (EC) distributions (non-Gaussians), such as Multivariate t (Mt), multivariate power exponential (MPE), multivariate Cauchy (MC), and multivariate Laplace (MLp), we must estimate the covariance matrix, . When the sample size is much smaller than the dimension or the number of variables (genes), , the usual sample maximum likelihood (ML) estimator of , given in matrix formbecomes unstable, ill-conditioned, nonpositive definite, and even singular. In (1), denotes the transpose of , is the identity matrix, and is a column vector of one of -dimensions. In such a case, we cannot compute the inverse covariance matrix or what is referred to as the* “precision matrix,”* which is needed in practically all multivariate analysis, which includes supervised and unsupervised classification and kernel-based methods, among many others. This situation is especially true in many applications where we have undersized sample problem.

The* “precision matrix,”* that is, , depends on the determinant of and has a bias that needs to be reduced to regularize the estimated covariance matrix [20].

###### 2.1.2. Naïve Ridge Estimators of the Covariance Matrix

The usual initial resolution to singular or ill-conditioned covariance matrix problem has been the “naïve” ridge regularizationwhere is the ridge parameter and indicates the ridge or regularized covariance estimator. This estimator tries to work to counteract the ill conditioned covariance by adjusting the eigenvalues of . Usually, the ridge parameter, , is chosen to be very small. How large should be and how small can be have remained arbitrary and do not work well in* “large ** small **”* problems.

###### 2.1.3. Smoothed Covariance Estimators

As an alternative to the “naïve” ridge regularization, many methods have been proposed to improve the estimation of the covariance matrix. All these approaches rely on the concept of shrinkage estimators and perfecting them dating back to the early work of James and Stein [21] and Stein [22] which is known as the* “Steinian type shrinkage,”* which is implicit also in many Bayesian methods as well as in the maximum entropy (ME) covariance estimation.

The idea of shrinkage estimation of the covariance matrix or what we call smoothed covariance estimators (SCEs) is to take convex combination (i.e., weighted average) of the sample estimator of , , with a suitably chosen target diagonal matrix . The* shrinkage* or* smoothed estimator* of the covariance matrix then becomes a convex combination of with some chosen target given bywhere is the optimal* shrinkage coefficient* (or* intensity*) which is a parameter between 0 and 1; that is, . It can be a function of the observations. The matrix is referred to as the shrinkage target. Its naïve form can be taken to bewhere denotes the trace of the matrix, , are the eigenvalues of the estimated sample covariance matrix, and is the arithmetic mean of the eigenvalues.

The interpretation of the general form of the smoothed covariance matrix estimation in (4) is that it provides a more baseline level of variance and covariance estimation when the sample size is much smaller than the dimension of the data. By using such a weighted average, we put less weight on extremely high or low values in the estimated covariance matrix . This reduces the influence of extremely high or low values and provides a more robust and smoothed estimator. Such a structure minimizes the mean squared error (MSE); that is,where denotes the squared Frobenius norm. It is difficult to compute the MSE of without additional constraints such as the shrinkage or smoothed covariance estimator .

In what follows, in this paper, we introduce several robust, regularized, or smoothed covariance estimators of the form given in (4), which have been developed under several linear and quadratic loss functions.

Selected improved (or smoothed) estimates of the covariance matrix via shrinkage from Bozdogan and Howe [23] and Bozdogan [18] are as follows.

*(i) Maximum Likelihood/Empirical Bayes (MLE/EB) Covariance Estimator.* Considerwhere denotes the trace of the covariance matrix and is the identity matrix. covariance estimator was proposed by Haff [24]. When a small amount of perturbation is all that is required, has a certain appeal. It is clear that this is of the same form as the naïve ridge regularization.

*(ii) Maximum Entropy (ME) Covariance Matrix.* Considerwhere is the usual (nonnegative definite) dispersion matrix and is a (positive definite) diagonal matrix with positive elements on the diagonal. These positive elements take the form of a weighted sum of squared differences between successive primary midpoints of the variables and these elements serve as a* ridge* in the ME covariance matrix. In other words, the construction of ME covariance matrix automatically produces the* ridge* component directly from the data without the worry of how to choose the ridge parameter as is the case in the usual ridge type of estimators. The main motivation of introducing the ME covariance matrix estimator, which has been ignored in the statistical literature, is that it makes the singular and ill-conditioned covariance matrix positive definite when we have undersized sample data such as the case in gene expression data sets. What is also interesting about the ME covariance matrix is that it uses linear and nonlinear order statistics (OS) in its computation by fully exploiting the information in the data set. The computation of the ME covariance matrix in terms of the CPU time is fast and efficient for high dimensional data and it is not heavy. A Matlab module has been written for the computation of the ME covariance matrix and utilized in our analysis in what follows.

For more on the ME covariance matrix we refer the readers to Theil and Laitinen [25], Fiebig [12], and Theil and Fiebig [26].

*(iii) Stipulated Ridge Covariance Estimator (SRE).* ConsiderWe note that bias and .

*(iv) Stipulated Diagonal Covariance Estimator (SDE).* Considerwhere and is the correlation matrix. For SDE, we also note that the bias and .

The SRE and SDE covariance estimators are due to Shurygin [20] (last student of Kolmogorov). SDE avoids scale dependence of the units of measurement of the variables.

*(v) Convex Sum Covariance Estimator (CSE).* Preceding the series of the work of Ledoit and Wolf [27, 28], based on the quadratic loss function used by Press [29], Chen [30] proposed a convex sum covariance matrix estimator (CSE) given by where For dimensions, is chosen to bewhere data adaptively is computed:

This estimator improves upon the usual covariance by shrinking all the estimated eigenvalues toward their common mean. One obvious advantage of this estimator is that it is operational even when ; that is, the sample size is much smaller than the dimension.

*(vi) Bozdogan’s [31] Convex Sum Covariance Estimator (BCSE).* Considerwhere and is the sum of the squared deviations of each dimension and is given byAs is well known, sum of squared deviations allows the overall variability in a data set to be attributed to different types or sources of variability, with the relative importance of each being quantified by the size of each component of the overall sum of squares. We calculate the sum of squares per degree of freedom or the variance and then divide by the total degree of freedom to get (16), which is used in the estimated shrinkage target.

*(vii) Eigenvalue Stabilization of the Covariance Matrix (Thomaz [32]) (STA)*. Stabilization algorithm is as follows.(1)Find the eigenvectors () and eigenvalues () of the covariance matrix.(2)Compute the mean or average eigenvalue of the covariance matrix: (3)Form a new matrix of eigenvalues based on the following largest dispersion values: (4)Finally, reform the modified newly stabilized covariance matrix:

There are other smoothed covariance matrices. For space considerations, the ones above that we are studying in this paper will suffice for the results in this paper.

###### 2.1.4. Hybridized Smoothed ME Covariance Estimator

We can choose any of the smoothed covariance estimators and stabilize their eigenvalues with the STA algorithm above. However, in this paper more specifically we propose focussing our attention on the ME covariance matrix and stabilizing its eigenvalues using the eigenvalue stabilization of Thomaz [32]. Then, we hybridize our result with other smoothed covariance matrix estimators in reducing the dimension of the undersized data in high dimensions in the PPCA model using the information theoretic model selection criteria. The rationale and mathematical motivation of stabilization plus hybridization are to improve further in a straightforward way the smaller and less reliable eigenvalues of the estimated covariance matrix while trying to keep most of its larger eigenvalues unchanged before smoothing to guarantee that the eigenvalues of a nonnegative definite matrix do not become negative and to achieve positive definiteness via shrinkage. These hybrid regularized covariance estimators greatly enhance supervised and unsupervised classification error rates after the dimension reduction and for general inferences in multivariate modeling.

For example, we stabilize the ME covariance matrix and obtain Then, we hybridize , say, with the convex sum covariance estimator (CSE) and compute

We call such a process “hybridized covariance estimator,” . Similarly, we can hybridize other smoothed covariance estimators. These hybridized smoothed (or robust) estimators of the covariance matrix overcome the singularity of the covariance matrix for undersized gene expression data sets and avoid negative eigenvalues.

As an illustration of our proposed approach to resolve the undersized sample problem, we discard the group structure of the colon benchmark data set for the time being and compute the usual sample covariance matrix, . Then, to remedy the singularity problem, we compute the maximum entropy (ME) covariance matrix, . Later, we hybridize the ME covariance matrix with other smoothed (or robust) covariance estimators. We denote this by as in (21). Now we compute the eigenvalues of these covariance estimators and compare them with the eigenvalues of the MLE type smoothed covariance estimator. Our results for both MLE based smoothed covariance matrices and the hybridization of the ME covariance with the smoothed covariances for the colon cancer data set are shown in Figures 1(a)-1(b).