Abstract

We investigate the performance of different classification models and their ability to recognize prostate cancer in an early stage. We build ensembles of classification models in order to increase the classification performance. We measure the performance of our models in an extensive cross-validation procedure and compare different classification models. The datasets come from clinical examinations and some of the classification models are already in use to support the urologists in their clinical work.

1. Introduction

Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [1]. Early detection of prostate cancer improves the chances of a curative treatment and a lot of progress has been made in this field during the last decade. The early detection is considerably enhanced by the measurement of prostate-specific antigen (PSA) in conjunction with other clinically available data like age, digital rectal examination (DRE), and transrectal ultrasonography (TRUS) variables like prostate volume. We compared several classification models and analyzed their performance on the clinical dataset with an extended cross-validation procedure. The models were linear discriminant analysis (LDA), penalized discriminant analysis (PDA) [2], logistic regression [3], classification and regression trees (CARTs) [4], multilayer perceptron (MLP) [5], support vector machines (SVMs) [6, 7], and nearest neighbour classifiers [8]. All these models are implemented in an open-source Matlab-toolbox that is available on the internet [9].

This study will help to improve the software package ProstataClass [10] which was developed at Charité and currently uses an artificial neural network as classification engine. This program is successfully used in clinical practice for several years.

2. Data

We had access to the clinically available data of 506 patients with 313 cases of prostate cancer (PCa) and 193 non-PCa. The data were selected from a group of 780 patients randomly. The data entry for each patient included age, PSA, the ratio of free to total prostate-specific antigen (PSA-Ratio), TRUS, and the diagnostic finding from the DRE which was a binary variable (suspicious or nonsuspicious). Blood sampling and handling were performed as described in Stephan et al. [11]. The samples were taken before any diagnostic or therapeutic procedures, and sera were stored at 80°C until analyzed. After thawing at room temperature, samples were processed within 3 hours. Prostate volume was determined by transrectal ultrasound using the prolate ellipse formula. The scatter plot of the variables under investigation is shown in Figure 1. PCa and non-PCa patients were histologically confirmed by 6–8 core prostate biopsy.

3. Ensembles

The average output of several different models is called an ensemble model:where we assume that the model weights sum to one . There are several suggestions concerning the choice of the model weights (see Perrone and Cooper [12]) but we decided to use uniform weights with for the sake of simplicity and not to run into overfitting problems as reported by Krogh and Sollich [13].

The central feature of the ensemble approach is the generalization ability of the resulting model. In the case of regression models (with continuous output values), it was shown that the generalization error of the ensemble is in the average case lower than the mean of the generalization error of the single-ensemble members (see Krogh and Vedelsby 1995 [14]). This holds in general, independent of the model class, as long as the models constituting the ensemble are diverse with respect to the hypothesis of the unknown function. In the case of (binary) classification models, the situation was not so clear because the classical bias-variance decomposition of the squared error loss in regression problems (Geman et al. [15]) had to be extended to the zero-one loss function. There are several approaches dealing with this problem, see Kong and Dietterich [16], Kohavi and Wolpert [17], or Domingos [18].

The zero-one loss function is not the only possible choice for classification problems. If we are interested in a likelihood whether a sample belongs to one class or not, we can use the error loss from regression and consider the binary classification problem as a regression problem that works on two possible outcomes. In practice, many classifiers are trained in that way.

Our ensemble approach is based on the observation that the generalization error of an ensemble model could be improved if the models on which averaging is done disagree and if their residual errors are uncorrelated [13]. To see this, we have to investigate the contribution of the single model in the ensemble to the generalization error. We consider the case where we have a given dataset and we want to find a function that approximates at new observations of . These observations are assumed to come from the same source that generated the training set , that is, from the same (unknown) probability distribution . It should be noted that depends also on . The expected generalization error given a particular and a training set iswhere the expectation is taken with respect to the probability distribution . We are now interested inwhere the expectation is taken with respect to all possible realizations of training sets with fixed sample size . According to Geman et al. [15], the bias/variance decomposition of iswhere is the deterministic part of the data and is the variance of given . Balancing between the bias and the variance terms is a crucial problem in model building. If we try to decrease the bias term on a specific training set, we usually increase the variance term and vice versa. We now consider the case of an ensemble average , consisting of individual models as defined in (1). If we put this into (4), we getand we can have a look at the effects concerning bias and variance. The bias term in (5) is just the average of the biases of the individual models in the ensemble. So we should not expect a reduction in the bias term compared to single models. According to Naftaly et al. [19], the variance term of the ensemble could be decomposed in the following way:where the expectation is taken with respect to . The first sum in (6) marks the lower bound of the ensemble variance and is the weighted mean of the variances of the ensemble members. The second sum contains the cross terms of the ensemble members and disappears if the models are completely uncorrelated [13]. So the reduction in the variance of the ensemble is related to the degree of independence of the single models [19].

4. Cross-Validation and Model Selection

Our model selection scheme is a mixture of bagging [20] and cross-validation. Bagging or Bootstrap aggregating was proposed by Breiman [20] in order to improve the classification by combining classifiers trained on randomly generated subsets of the entire training sets. We extended this approach by applying a cross-validation scheme for model selection on each subset and after that we combine the selected models to an ensemble. In contrast to classical cross-validation, we use random subsets as cross-validation folds. In -fold cross-validation, the dataset is partitioned into subsets. Of these subsets, a single subset is retained as the validation data for testing the model, and the remaining subsets are used for model training. The cross-validation process is then repeated times with each of the subsets used only once as the validation data. The results from the folds then can be averaged to produce a single estimation.

If we lack relevant problem-specific knowledge, cross-validation methods could be used to select a classification method empirically [21]. This is a common approach because it seems to be obvious that no classification method is uniformly superior, see, for example, Quinlan [22] for a detailed study. It is also a common approach to select the model parameters with cross-validation [23]. The idea to combine the models from the cross-validation folds (stacking) was described by Wolpert [24].

We suggest to train several models on each CV-fold, to select the best performing model on the validation set, and to combine the selected models from the -folds. If we train models of one type but with different initial conditions (e.g., neural networks with different numbers of hidden neurons), then we find proper values for the free parameters of the model. We could extend that by combining models from different classes in order to increase the model diversity. We call this a heterogeneous ensemble or mixed ensemble and applied this method effectively to regression problems [25] and classification tasks [26].

Our model selection scheme works as follows: for the -fold CV, the data is divided -times into a training set and a test set, both sets containing randomly drawn subsets of the data without replications. The size of each test set was 25% of the entire dataset.

In every CV-fold, we train several different models with a variety of model parameters (see Section 5 for an overview of the models and the related model parameters). In each fold, we select only one model to become a member of the final ensemble (namely, the best model with respect to the test set). This means that all models have to compete with each other in a fair tournament because they are trained and validated on the same dataset. The models with the lowest classification error in each CV-fold are taken out and added to the final ensemble, receiving the weight (see (1)). All other models in this CV-fold are deleted.

We can use this model selection scheme in two ways. If we have no idea or prior knowledge, where classification or regression method should be used to cope with a specific problem, we could use this scheme in order to look for an empirical answer and to compare the performance of the different model classes. The other way is the estimation of model parameters for the different model classes described in Section 5.

5. Classification Models

In this section, we give a short overview of the model classes that we used for ensemble building. All models belong to the well-established collection of machine-learning algorithms for classification and regression tasks, so details can be found in the textbooks like, for instance, Hastie et al. [2]. The implementation of these models in an open-source toolbox together with a more detailed description can be found in [9]. The toolbox is an open-source MATLAB Toolbox which allows the integration of existing implementations of classification algorithms and it contains more then the few model classes described in the text.

5.1. Linear Discriminant Analysis

The LDA is a simple but useful classifier. If we assume that the two classes have a Gaussian distribution with mean and they share the same covariance matrix , then the linear discriminant function , is given by where denotes the frequency of occurrence of the class labels. The predicted class labels are given by We also implemented two modifications: the quadratic discriminant analysis (QDA) and the PDA, as described in detail in Hastie et al. [2]. Linear method are usually conceptually simple, robust, fast, and, in particular in high-dimensional problems, they could be very powerful.

5.2. Logistic Regression Model

Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables and is used extensively in the medical and social sciences. Hastie et al. [2] pointed out that the Logistic Regression model has the same form as the LDA, the only difference lies in the way, the linear coefficients are estimated. See Hosmer and Lemeshow for a detailed introduction [27]. We used the binary Log.Reg. to compute the probability of the dichotomic variable (PCa or non-PCa) from the independent variables : with wherein the model coefficients are estimated with a second-order gradient decent (quadratic approximation to likelihood function). This could be a critical issue in high-dimensional problems because these calculations are time and memory consuming.

5.3. Multilayer Perceptron

We train a multilayer feed-forward neural network “MLP” with a sigmoid activation function. The weights are initialized with Gaussian-distributed random numbers having zero mean and scaled variances. The weights are trained with a gradient descend based on the Rprop algorithm [28] with the improvements given in [29]. The MLP works with a common weight decay with the penalty term where denotes the -dimensional weight vector of the MLP and a small regularization parameter . The number of hidden layers, the number of neurons, and the number of regularization parameter are adjusted during the CV-training. We further applied the concept of an -insensitive error loss that we introduced in the context of cellular neural networks (CNNs) [30].

5.4. Support Vector Machines

Over the last decade, SVMs have become very powerful tools in machine learning. An SVM creates a hyperplane in a feature space that separates the data into two classes with the maximum margin. The feature space can be a mapping of the original features into a higher-dimensional space using a positive semidefinite function: The function is called the kernel function and the so-called kernel trick uses Mercer's condition, which states that any positive semidefinite kernel can be expressed as a dot product in a highdimensional space (see [31] for a detailed introduction). The theoretical foundations of this approach were given by Vapnik's statistical learning theory [6, 32] and later extended to the nonlinear case [7]. We use an implementation of SVMs that is based on the libsvm provided by Chang and Lin [33] with the standard kernels: The parameters of the model are, with respect to the kernel-type, the polynomial degree , the width of the rbf and the value concerning the cost of constrain violation during the SVM training.

5.5. Trees

Trees are conceptually simple but powerful tools for classification and regression. For our purpose, we use the classification and regression trees (CARTs) as described in Breiman et al. [4]. The main feature of the CART algorithm is the binary decision role that is introduced at each tree node with respect to the information content of the split. In this way, the most discriminating binary splits are near the tree root building an hierarchical decision scheme. A sketch of a decision tree is shown in Figure 3. It is known that trees have a high variance, so they benefit from the ensemble approach [20]. These trees ensembles are also know as random forests. The parameters of the tree models are related to splitting the tree nodes (the impurity measure and the split criterion, see [2] for a detailed description).

5.6. Nearest-Neighbor Classifier

A -nearest-neighbor classifier (KNN) takes a weighted average over the labels of those observations in the training set that are closest to the query point . This denotes as where denotes the -element neighborhood of , defined in a given metric, and is the related distance. Common choices are the , , and the metrics. The parameters of the model are the number of neighbors and the choice of the metric. KNNs offer a very intuitive approach to classification problems because they are based on the concept of similarity. This works fine in lower dimensions but leads to problems in higher dimensions, known as the curse of dimensionality [34].

6. Application to the Clinical Data

We compared the model classes described above in a unified framework under fair conditions. Thus, we trained an ensemble of each model class consisting of 11 ensembles members (11 CV-folds in the training scheme described in Section 4). The performance of each ensemble model was assessed on the 20% of data (validation set), which was initially removed and never included in model training (see Figure 2). This procedure was independently repeated 20 times. This means that all model-building processes, that is, the random removal of 20% of the data, the construction of a classification model ensemble on the remaining 80% of the data as outlined in Section 4, and the final test on the unseen validation data were performed each time. Finally, the mean average prediction values with respect to the validation set were calculated and are listed in Table 2. In some cases, it is useful to apply a kind of data preprocessing like balancing. If the distribution of the two classes differ in the sense, that one class is only represented with a small number of examples, we can balance the data in the training set. This can improve the convergence of several training algorithms and has also an impact to the classification error [35]. We apply balancing in the way that we reduce the number of samples in the one class until we have an balanced ratio of the class labels. The ratio of the class labels in the validation set was never changed because it reflects the real data distribution. Balancing was only applied to the training data. We used three different performance measures in order to compare the different classification models. Therefore, we have to define the four possible outcomes of a classification that can be formulated in a confusion matrix, as shown in Table 1. The accuracy, seems to be the canonical error measure for almost all classification problems if the dataset is balanced. Other important measures are the specificity that quantifies how well a binary classification model correctly identifies the negative cases (non-PCa patients),and the sensitivity, which is the proportion of true positives of all diseased cases (PCa patients) in the population,A high sensitivity is required when early diagnosis and treatment are beneficial, which is the case in PCa.

The precision or positive predictive value (PPV) is given byand is the proportion of patients with positive test results who are correctly diagnosed. The F-Score is the harmonic mean of precision and sensitivity,and it is useful if the classes in the classification problem are not equally distributed. Another measure is the area under curve (AUC) wherein the curve is the receiver operating characteristic (ROC-curve)curve. The ROC-curve is the graphical plot of the sensitivity versus the (1-specificity) for a binary classifier as its discrimination threshold is varied.

The ROC-curve offers the opportunity to calculate the specificity at a fixed sensitivity level and vice versa. This is important because, from the clinical point of view, a high sensitivity 95% is wanted to detect all patients with PCa first. To avoid a high false-positive rate, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROC-curve as another important performance measure.

To have an impression about the correct classified non-PCa patients in this case, we computed the specificity at the level of 95% sensitivity (SPS95) from the ROC-curve. If we compare the outcome of the statistical analysis of the model performance as listed in Table 1 for the unbalanced case and in Table 2 for the balanced case, we can state that the differences between the different classifiers are marginal. Even the more sophisticated classification models (SVMs or Mixed Ensembles) could not outperform the robust linear candidates (LDA/PDA).

Tables 2 and 3 present the main results with only small differences between the classifiers. The standard deviations of the performance measures are given except for the ROC-curve-based measures (AUC and SPS95). Most papers in this field do not discuss this really complex problem and it cannot be solved in the scope of this paper, but it should be mentioned. As an example of a special solution of this problem, see the paper of Hilgers [36].

7. Conclusions

We compared several classification models with respect to their ability to recognize prostate cancer in an early stage. This was done in an ensemble framework in order to estimate proper model parameters and to increase classification performance. It turned out that all models under investigation are performing very well with only marginal differences and are compareable with similar studies, like, for example, Finne et al. [37], Remzi et al. [38], or Zlotta et al. [39]. In future research, it should be investigated whether these results are valid for other populations of patients (e.g., screening data) and other PSA test assays and whether the performance of classification could be increased by including new variables or by splitting the groups of patients into different PSA ranges.