#### Abstract

DNA microarrays provide rich profiles that are used in cancer prediction considering the gene expression levels across a collection of related samples. Support Vector Machines (SVM) have been applied to the classification of cancer samples with encouraging results. However, they rely on Euclidean distances that fail to reflect accurately the proximities among sample profiles. Then, non-Euclidean dissimilarities provide additional information that should be considered to reduce the misclassification errors. In this paper, we incorporate in the -SVM algorithm a linear combination of non-Euclidean dissimilarities. The weights of the combination are learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite Programming algorithm. This approach allows us to incorporate a smoothing term that penalizes the complexity of the family of distances and avoids overfitting. The experimental results suggest that the method proposed helps to reduce the misclassification errors in several human cancer problems.

#### 1. Introduction

DNA Microarray technology provides us a way to monitor the expression levels of thousands of genes simultaneously across a collection of related samples. This technology has been applied particularly to the prediction of different types of human cancer with encouraging results [1].

Support Vector Machines (SVM) [2] are powerful machine learning techniques that have been applied to the classification of cancer samples [3]. However, the categorization of different cancer types remains a difficult problem for classical SVM algorithms. In particular, the SVM is based on Euclidean distances that fail to reflect accurately the proximities among the sample profiles [4]. Non-Euclidean dissimilarities misclassify frequently different subsets of patterns because each one reflects complementary features of the data. Therefore, they should be integrated in order to reduce the fraction of patterns misclassified by the base dissimilarities.

In this paper, we introduce a framework to learn a linear combination of non-Euclidean dissimilarities that reflect better the proximities among the sample profiles. Each dissimilarity is embedded in a feature space using the Empirical Kernel Map [5, 6]. After that, learning the dissimilarity is equivalent to optimize the weights of the linear combination of kernels. Several approaches have been proposed to this aim. In [7, 8] the kernel is learnt optimizing an error function that maximizes the alignment between the input kernel and an idealized kernel. However, this error function is not related to the misclassification error and is prone to overfitting. To avoid this problem, [9] learns the kernel by optimizing an error function derived from the Statistical Learning Theory. This approach includes a term to penalize the complexity of the family of kernels considered. This algorithm is not able to incorporate infinite families of kernels and does not overcome the overfitting of the data.

In this paper, the combination of distances is learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS following the approach of hyperkernels proposed in [10]. This formalism exhibits a strong theoretical foundation and is less sensitive to overfitting. Moreover, it allow us to work with infinite families of distances. The algorithm has been applied to the prediction of different kinds of human cancer. The experimental results suggest that the combination of dissimilarities in a Hyper Reproducing Kernel Hilbert Space improves the accuracy of classifiers based on a single distance, particularly for nonlinear problems. Besides, our approach outperforms the Lanckriet formalism specially for multicategory problems and is more robust to overfitting.

This paper is organized as follows. Section 2 introduces the algorithm proposed, the material and the methods employed. Section 3 illustrates the performance of the algorithm in the challenging problem of gene expression data analysis. Finally, Section 4 gets conclusions and outlines future research trends.

#### 2. Material and Methods

##### 2.1. Distances for Gene Expression Data Analysis

An important step in the design of a classifier is the choice of a proper dissimilarity that reflects the proximities among the objects. However, the choice of a good dissimilarity is not an easy task. Each measure reflects different features of the data and the classifiers induced by the dissimilarities misclassify frequently a different set of patterns. In this section, we comment shortly the main differences among several dissimilarities proposed to evaluate the proximity between biological samples considering their gene expression profiles. For a deeper description and definitions see [11].

Let be the vectorial representation of a sample where is the expression level of gene . The * Euclidean distance* evaluates if the gene expression levels differ significantly across different samples:

An interesting alternative is the * cosine dissimilarity*. This measure will become small when the ratio between the gene expression levels is similar for the two samples considered. It differs significantly from the Euclidean distance when the data is not normalized by the norm:
The * correlation measure* evaluates if the expression level of genes change similarly in both samples. Correlation-based measures tend to group together samples whose expression levels are linearly related. The correlation differs significantly from the cosine if the means of the sample profiles are not zero. This measure is more sensitive to outliers:
where and are the means of the gene expression profiles.

The * Spearman rank dissimilarity* is less sensitive to outliers because it computes a correlation between the ranks of the gene expression levels:
where and .

An alternative measure that helps to overcome the problem of outliers is the *Kendall*- *index* which is related to the Mutual Information probabilistic measure [11]:
where and .

Finally, the dissimilarities have been transformed using the inverse multiquadratic kernel because this transformation helps to discover certain properties of the underlying structure of the data [12, 13]. The inverse multiquadratic transformation is based on the inverse multiquadratic kernel defined as follows: where is a smoothing parameter. Considering that is the Euclidean distance, (6) can be rewritten in terms of a dissimilarity as follows: The above nonlinear transformation gives more weight to small dissimilarities, particularly when becomes small.

##### 2.2. -Support Vector Machines

Support Vector Machines [2] are powerful classifiers that are able to deal with high dimensional and noisy data keeping a high generalization ability. They have been widely applied in cancer classification using gene expression profiles [1, 14]. In this paper, we will focus on the -Support Vector Machines (SVM). The -SVM is a reparametrization of the classical -SVM [2] that allows to interpret the regularization parameter in terms of the number of support vectors and margin errors. This property helps to control the complexity of the approximating functions in an intuitive way. This feature is desirable for the application we are dealing with because the sample size is frequently small and the resulting classifiers are prone to overfitting.

Let be the training set codified in . We assume that each belongs to one of the two classes labeled by . The SVM algorithm looks for the linear hyperplane that maximizes the margin . determines the generalization ability of the SVM. The slack variables allow to consider classification errors and are defined as .

For the -SVM, the hyperplane that minimizes the prediction error is obtained solving the following optimization problem [2]: where is an upper bound on the fraction of margin errors and a lower bound on the number of support vectors. Therefore, this parameter controls the complexity of the approximating functions.

The optimization problem can be solved efficiently in the dual space and the discriminant function can be expressed exclusively in terms of scalar products: where are the Lagrange multipliers in the dual optimization problem. The -SVM algorithm can be easily extended to the nonlinear case substituting the scalar products by a Mercer kernel [2]. Besides, non-Euclidean dissimilarities can be incorporated into the -SVM via the kernel of dissimilarities [5].

Finally, several approaches have been proposed in the literature to extend the SVM to deal with multiple classes. In this paper, we have followed the one-against-one (OVO) strategy. Let be the number of classes, in this approach binary classifiers are trained and the appropriate class is found by a voting scheme. This strategy compares favorably with more sophisticated methods and it is more efficient computationally than the one-against-rest (OVR) approach [15].

##### 2.3. Empirical Kernel Map

The Empirical Kernel Map allows us to incorporate non-Euclidean dissimilarities into the SVM algorithm using the kernel trick [5, 13].

Let be a dissimilarity and a subset of representatives drawn from the training set. Define the mapping as This mapping defines a dissimilarity space where feature is given by .

The set of representatives determines the dimensionality of the feature space. The choice of is equivalent to select a subset of features in the dissimilarity space. Due to the small number of samples in our application, we have considered the whole training set as representatives. Notice that it has been suggested in literature [13] that for small samples reducing the set of representatives does not help to improve the classifier performance.

##### 2.4. Learning a Linear Combination of Dissimilarities in an HRKHS

In order to learn a linear combination of non-Euclidean dissimilarities, we follow the approach of Hyperkernels developed by [10]. To this aim, each distance is embedded in an RKHS via the Empirical Kernel Map presented in Section 2.3. Next, a regularized quality functional is introduced that incorporates an -penalty over the complexity of the family of distances considered. The solution to this regularized quality functional is searched in a Hyper Reproducing Kernel Hilbert Space. This allows to minimize the quality functional using an SDP approach.

Let and be a finite sample of training patterns where . Let be a family of semidefinite positive kernels. Our goal is to learn a kernel of dissimilarities that represents the combination of dissimilarities and minimizes the following empirical quality functional: where is a loss function, is the norm defined in a reproducing kernel Hilbert space, and is a regularization parameter that controls the balance between training error and the generalization ability.

By virtue of the representer theorem [2], we know that (11) can be written as a kernel expansion: However, if the family of kernels is complex enough it is possible to find a kernel that achieves zero error overfitting the data. To avoid this problem, we introduce a term that penalizes the kernel complexity in an HRKHS. A rigorous definition of the HRKHS is provided in the appendix: where is the norm defined in the Hyper Reproducing Kernel Hilbert space generated by the hyperkernel . is a regularization parameter that controls the complexity of the resulting kernel.

The following theorem allows us to write the solution to the minimization of this regularized quality functional as a linear combination of hyperkernels in an HRKHS.

Theorem 1 (Representer theorem for Hyper-RKHS [10]). * Let X, Y be the combined training and test set, then each minimizer of the regularized quality functional admits a representation of the form
**
for all , where , for each .*

However, we are only interested in solutions that give rise to positive semidefinite kernels. The following condition over the hyperkernels [10] allows us to guarantee that the solution is a positive semidefinite kernel.

*Property 1. *Given a hyperkernel with elements such that for any fixed , the function , with , is a positive semidefinite kernel, and for all , then the kernel
is positive semidefinite.

Now, we address the problem of combining a finite set of dissimilarities. As we mentioned in Section 2.3, each dissimilarity can be represented by a kernel using the Empirical Kernel Map. Next, the hyperkernel is defined as where each is a positive semidefinite kernel of dissimilarities and is a constant 0.

Now, we show that is a valid hyperkernel. First, is a kernel because it can be written as a dot product where Next, the resulting kernel (15) is positive semidefinite because for all is a positive semidefinite kernel and can be constrained to be 0. Besides, the linear combination of kernels is a kernel and therefore is positive semidefinite. Notice that is positive semidefinite if and are pointwise positive for training data. Both RBF and multiquadratic kernels verify this condition.

Finally, we show that the resulting kernel is a linear combination of the original . Substituting the expression of the hyperkernel (16) in (15), the kernel is written as Now the kernel can be written as a linear combination of base kernels: Therefore, the above kernel introduces into the -SVM a linear combination of base dissimilarities represented by with coefficients .

The previous approach can be extended to an infinite family of distances. In this case, the space that generates the kernel is infinite dimensional. Therefore, in order to work in this space, it is necessary to define a hyperkernel and to optimize it using an HRKHS. Let be a kernel of dissimilarities. The hyperkernel is defined as follows [10]: where and . In this case, the nonlinear transformation to feature space is infinite dimensional. Particularly, we are considering all powers of the original kernels which is equivalent to transform nonlinearly the original dissimilarities: where is the dimensionality of the space which is infinite in this case. As we mentioned in Section 2.1, nonlinear transformations of a given dissimilarity provide additional information that may help to improve the classifier performance.

As for the finite family, it can be easily shown that is a valid hyperkernel provided that the kernels considered are pointwise positive. The Inverse Multiquadratic kernel satisfies this condition. Next, we derive the hyperkernel expression for the multiquadratic kernel.

Proposition 1 (see [Harmonic Hyperkernel]). *Suppose k is a kernel with range and , , . Then, computing the infinite sum in (20), one has the following expression for the harmonic hyperkernel:
** is a regularization term that controls the complexity of the resulting kernel. Particularly, larger values for give more weight to strongly nonlinear kernels while smaller values give coverage for wider kernels.**In this paper one has considered the inverse multiquadratic kernel defined in (6). Substituting in (22), one gets the inverse multiquadratic hyperkernel:
**
where and . *

##### 2.5. -SVM in an HRKHS

In this section, we detail how to learn the kernel for a -Support Vector Machine in an HRKHS. First, we will introduce the optimization problem and next, we will explain shortly how to solve it using an SDP approach.

We start some notation that is used in the -SVM algorithm. For , let be defined as element by element multiplication, . The pseudoinverse of a matrix is denoted by . Define the hyperkernel Gram matrix by , the kernel matrix (reshaping an by vector, , to an matrix), (a matrix with on the diagonal and zero otherwise), (the dependence on is made explicit), and is a vector of ones.

The -SVM considered in this paper uses an soft margin, where . This error is less sensitive to outliers which are convenient features for microarray datasets. Let be the slack variables that allow for errors in the training set. Substituting in (13) by the one optimized by -SVM (8) the regularized quality functional in an HRKHS can be written as where is the regularization parameter that achieves a balance between training error and the complexity of the approximating functions and is a parameter that penalizes the complexity of the family of kernels considered. The minimization of the previous equation leads to the following SDP optimization problem [10]. where

The value of which optimizes the corresponding Lagrange function is , and the classification function, , is given by is the hyperkernel defined in Section 2.4 which represents the combination of dissimilarities considered. Finally, the algorithm proposed can be easily extended to deal with multiple classes via a one-against-one approach (OVO). This strategy is simple, more efficient computationally than the OVR, and compares well with more sophisticated multicategory SVM methods [15].

##### 2.6. Implementation

The optimization problem (25) were solved using SeDuMi R [16] and YALMIP [17] SDP optimization packages running under MATLAB.

As in the SDP problem there are coefficients , the computational complexity is high. However, it can be significantly reduced if the Hyperkernel is approximated by a small fraction of terms, for a given error. In particular, we have chosen an truncated lower triangular matrix which approximate the hyperkernel matrix to an error using the incomplete Cholesky factorization method [18].

##### 2.7. Datasets and Preprocessing

The gene expression datasets considered in this paper correspond to several human cancer problems and exhibit different features as shown in Table 1. We have considered both, binary and multi-category problems with a broad range of signal to noise ratio (Var/Samp.), different number of samples, and varying priors for the larger category. All the datasets are available from the Broad Institute of MIT and Harvard http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi/. Next we detail the features and preprocessing applied to each dataset.

The first dataset was obtained from patients with (diffuse large B-cell lymphoma) DLBCL ( samples) or FL (follicular lymphoma) ( samples) and they were subjected to transcriptional profiling using oligonucleotide Affymetrix gene chip *hu*68000 containing probes for genes [19]. The second dataset consists of frozen tumors specimens from newly diagnosed, previously untreated MLBCL patients ( samples) and DLBCL patients ( samples). They were hybridized to Affymetrix *hgu*133*b* gene chip containing probes for genes [20]. In both cases the raw intensities have been normalized using the rma algorithm [21] available from Bioconductor package [11]. The third problem we address concerns the clinically important issue of metastatic spread of the tumor. The determination of the extent of lymph node involvement in primary breast cancer is the single most important risk factor in disease outcome and here the analysis compares primary cancers that have not spread beyond the breast to ones that have metastasized to axillary lymph nodes at the time of diagnosis. We identified tumors as “reported negative’’ (24) when no positive lymph nodes were discovered and “reported positive’’ (25) for tumors with at least three identifiably positive nodes [22]. All assays used the human HuGeneFL Genechip microarray containing probes for genes. The fourth dataset [23] address the clinical challenge concerning medulloblastoma due to the variable response of patients to therapy. Whereas some patients are cured by chemotherapy and radiation, others have progressive disease. The dataset consists of 60 samples containing 39 medulloblastoma survivors and 21 treatment failures. Samples were hybridized to Affymetrix HuGeneFL arrays containing known genes and expressed sequence tags.

All the datasets have been standarized subtracting the median and dividing by the Inter-quantile range. The rescaling were performed based only on the training set to avoid bias.

Regarding the identification of multiple classes of cancer we have considered three different datasets. The first one consists of samples of Breast Cancer generated using -channel oligonucleotide Affymetrix HuGeneFl [1]. The second and third datasets consist of and a samples from Diffuse large B-cell lymphoma with survival data. Fourth different subclasses can be identified. Data preparatory steps have been performed by the authors of the primary study [1]. The oligonucleotides with smaller Interquantile Range were filtered to remove genes with expression level constant across samples.

##### 2.8. Performance Evaluation

In order to assure an honest evaluation of all the classifiers we have performed a double loop of crossvalidation [15]. The outer loop is based on stratified tenfold cross-validation that iteratively splits the data in ten sets, one for testing and the others for training. The inner loop perform stratified ninefold cross-validation over the training set and is used to estimate the optimal parameters avoiding overfitting. The stratified variant of cross-validation keeps the same proportion of patterns for each class in training and test sets. This is necessary in our problem because the class proportions are not equal. Finally, the error measure considered to evaluate the classifiers has been accuracy. This metric computes the proportion of samples misclassified. The accuracy is easy to interpret and allows us to compare with the results obtained by previously published studies.

##### 2.9. Parameters for the Classification Algorithm

The parameters for the -SVM and for the classifiers based on a linear combination of dissimilarities have been set up by a nested stratified tenfold crossvalidation procedure [15]. This method avoids the overfitting as is described in Section 2.8 and takes into account the asymmetric distribution of class priors.

For the -SVM we have considered both, linear and inverse multiquadratic kernels. The optimal parameters have been obtained by a grid search strategy over the following set of values: and where denotes the dimensionality of the input space.

Additionally, for the finite family of distances where is the number of dissimilarities considered, and because the misclassification errors are hardly sensitive to the regularization parameter that controls the kernel complexity. Finally, for the infinite family of dissimilarities, the regularization parameter in the Harmonic hyperkernel (22) has been set up to which gives an adequate coverage of various kernel widths. Smaller values emphasizes only wide kernels. All the base kernel of dissimilarities have been normalized so that all ones have the same scale.

Regarding the Lanckriet [9] formalism that allows to combine a finite set of dissimilarities, several values for the regularization parameter have been tried, . A grid search strategy has been applied to determine the best values for both, the kernel parameters and the regularization parameter. The kernel matrices have been normalized by the trace as recommended in the original paper.

##### 2.10. Gene Selection

Gene selection can improve significantly the classifier performance [24]. Therefore, we have evaluated the classifiers for the following subsets of genes . The -SVM is robust against noise and is able to deal with high dimensional data. However, the empirical evidence suggests that considering a larger subset of genes or even the whole set of genes increases the misclassification errors.

The genes are ranked according to the ratio of between-group to within-group sums of squares defined in [25]: where and denote “respectively’’ the average expression level of gene for class and the overall average expression level of gene across all samples, denotes the class of sample , and is the indicator function. Next, the top ranked genes are chosen. This feature selection method is simple but compares well with more sophisticated methods [24]. Finally, the ranking of genes has been carried out considering only the training set to avoid bias. Therefore, feature selection is repeated in each iteration of cross-validation.

#### 3. Results and Analysis

The algorithms proposed have been applied to the identification of several cancer human samples using microarray gene expression data.

First, we address several binary categorization problems.

Table 2 reports the accuracy for the two combination approaches proposed in this paper. The first one considers the finite set of dissimilarities introduced in Section 2.1. The second one considers an infinite family of distances obtained by transforming nonlinearly the base dissimilarities to feature space. We have compared with the -SVM based on the best distance (linear and nonlinear kernel) and the classical -SVM. The performance for the Lanckriet formalism [9] that allow us to incorporate a finite linear combination of dissimilarities is also reported.

Before computing the kernel of dissimilarities, all the distances have been transformed using the multiquadratic kernel introduced in Section 2.1. This nonlinear transformation helps to improve the accuracy for all the techniques evaluated. From the analysis of Table 2, the following conclusions can be drawn.

(i)The -SVM based on a finite set of distances improves the -SVM based on the best dissimilarity for brain prognosis and Lymphoma datasets. The error is not reduced for Lymphoma cell B and Breast LN. This may be explained because the ratio in Table 1 suggests that both datasets are quite noisy and nonlinear. The combination of a finite set of dissimilarities is not able to improve the separation between classes and increases slightly the overfitting of the data. Similarly, our algorithm helps to improve the SVM based on coordinates, particularly for the previous problems. We also report that working directly from a dissimilarity matrix may help to reduce the misclassification errors.(ii)The infinite family of distances outperforms the -SVM based on the best distance disregarding the kernel considered for all the datasets. The improvement is more relevant in brain cancer prognosis. Brain cancer prognosis is a complex problem according to the original study [23] and the nonlinear transformations of the dissimilarities help to reduce the misclassification errors. Besides, the infinite family improves the accuracy of the finite family of distances particularly for lymphoma cell B and Breast LN. This suggests that both datasets are nonlinear.(iii)The Lanckriet formalism and the finite family of dissimilarities perform similarly. However, the infinite family of distances outperforms the Lanckriet formalism particularly for brain and Lymphoma cell B which are more complex problems.(iv)The best distance depends on the dataset considered.Next we move to the categorization of multiple cancer types.

Table 3 compares the proposed algorithms with -SVM based on the best distance (linear and nonlinear kernel) and the classical -SVM. The accuracy for the Lanckriet formalism has also been reported. Our approach considers an infinite family of distances obtained by transforming nonlinearly the base dissimilarities to feature space.

Before computing the kernel of dissimilarities, all the distances have been transformed using the multiquadratic kernel introduced in Section 2.1. From the analysis of Table 3, the following conclusions can be drawn.

(i)The combination of non-Euclidean dissimilarities helps to improve the SVM based on the best dissimilarity disregarding the kernel considered for the two first datasets. The error is slightly larger for the third dataset which may suggest that the problem is linear.(ii)Our algorithm improves the SVM based on coordinates. The experimental results suggest that the nonlinear transformations of the dissimilarities help to increase the separation among classes.(iii) The Hyperkernel classifier outperforms the Lanckriet formalism for multicategory problems. As the number of classes growths the number of samples per class comes down and the Lanckriet formalism seems to be less robust to overfitting.Finally, notice that our algorithm allow us to work with applications in with only a dissimilarity is defined. Moreover, we avoid the complex task of choosing a dissimilarity that reflects properly the proximities among the sample profiles.

#### 4. Conclusions

In this paper, we propose two methods to incorporate in the -SVM algorithm a linear combination of non-Euclidean dissimilarities. The family of distances is learnt in a (Hyper Reproducing Kernel Hilbert Space) HRKHS using a Semidefinite Programming approach. A penalty term has been added to avoid the overfitting of the data. The algorithm has been applied to the classification of complex cancer human samples. The experimental results suggest that the combination of dissimilarities in a Hyper Reproducing Kernel Hilbert Space improves the accuracy of classifiers based on a single distance particularly for nonlinear problems. Besides, this approach outperforms the Lanckriet formalism specially for multi-category problems and is more robust to overfitting. Future research trends will focus on learning the combination of dissimilarities for other classifiers such as .

#### Appendix

In this section we define rigorously the Hyper-Reproducing Kernel Hilbert Spaces. First, we define a Reproducing Kernel Hilbert Space.

*Definition 1 (see [Reproducing Kernel Hilbert Space]). *Let be a nonempty set and be a Hilbert space of functions . Let be a dot product in which induces a norm as . is called an RKHS if there is a function with the following properties: (i) has the reproducing property for all , (ii) spans , that is, where is the completion of the set X; (iii) is symmetric, that is, .

Next, we introduce the Hyper Reproducing Kernel Hilbert Space.

*Definition 2 (see [Hyper-Reproducing Kernel Hilbert Space]). * Let be a nonempty set and be the Cartesian product. Let be the Hilbert space of functions with a dot product and a norm . is a Hyper Reproducing Kernel Hilbert Space if there is a hyperkernel with the following properties: (i) has the reproducing property for all ; (ii) spans ; (iii) for all .

#### Acknowledgments

The authors would like to thanks two anonymous referees by their useful comments and suggestions. Financial support from Grant S02EIA-07L01 is gratefully appreciated.