We investigate the performance of different classification models and their ability to recognize prostate cancer in an
early stage. We build ensembles of classification models in order to increase the classification performance. We measure
the performance of our models in an extensive cross-validation procedure and compare different classification
models. The datasets come from clinical examinations and some of the classification models are already in use to support
the urologists in their clinical work.
1. Introduction
Prostate cancer is one of the most common types of cancer among male patients in the western world. The number of expected
new cases in the USA for the year 2006 was 235,000 with 27,000 expected deaths [1]. Early detection of
prostate cancer improves the chances of a curative treatment and a lot of
progress has been made in this field during the last decade. The early
detection is considerably enhanced by the measurement of prostate-specific
antigen (PSA) in conjunction with other clinically available data like age,
digital rectal examination (DRE), and transrectal ultrasonography (TRUS)
variables like prostate volume. We compared several classification models and
analyzed their performance on the clinical dataset with an extended
cross-validation procedure. The models were linear discriminant analysis (LDA),
penalized discriminant analysis (PDA) [2], logistic regression [3], classification and
regression trees (CARTs) [4], multilayer perceptron (MLP) [5], support vector machines
(SVMs) [6, 7], and nearest neighbour
classifiers [8]. All
these models are implemented in an open-source Matlab-toolbox that is available
on the internet [9].
This study will help to improve the software package ProstataClass [10] which was
developed at Charité and currently uses an artificial neural network as
classification engine. This program is successfully used in clinical practice
for several years.
2. Data
We had access to the clinically available data of 506 patients with 313 cases of prostate
cancer (PCa) and 193 non-PCa. The data were selected from a group of 780
patients randomly. The data entry for each patient included age, PSA, the ratio
of free to total prostate-specific antigen (PSA-Ratio), TRUS, and the
diagnostic finding from the DRE which was a binary variable (suspicious or
nonsuspicious). Blood sampling and handling were performed as described in
Stephan et al. [11].
The samples were taken before any diagnostic or therapeutic procedures, and
sera were stored at 80°C until analyzed.
After thawing at room temperature, samples were processed within 3 hours.
Prostate volume was determined by transrectal ultrasound using the prolate
ellipse formula. The scatter plot of the variables under investigation is shown
in Figure 1. PCa and non-PCa patients were histologically confirmed by 6–8
core prostate biopsy.
Figure 1: A scatterplot matrix of the data. Each box shows a pair of variables and the cases are
color-coded, a red cross marks PCa, and a blue circle non-PCa. The DRE is a binary variable (suspicious or nonsuspicious).
3. Ensembles
The average output of several different models is called an
ensemble model:where we assume that the model
weights sum to one . There are several suggestions concerning the choice
of the model weights (see Perrone and Cooper [12]) but we decided to use uniform weights with for the sake of
simplicity and not to run into overfitting problems as reported by Krogh and Sollich
[13].
The central feature of the ensemble approach is the
generalization ability of the resulting model. In the case of regression models
(with continuous output values), it was shown that the generalization error of
the ensemble is in the average case lower than the mean of the generalization
error of the single-ensemble members (see Krogh and Vedelsby 1995 [14]). This holds in general,
independent of the model class, as long as the models constituting the ensemble
are diverse with respect to the hypothesis of the unknown function. In the case
of (binary) classification models, the situation was not so clear because the
classical bias-variance decomposition of the squared error loss in regression
problems (Geman et al. [15]) had to be extended to the zero-one loss function.
There are several approaches dealing with this problem, see Kong and Dietterich
[16], Kohavi and Wolpert [17], or Domingos [18].
The zero-one loss function is not the only possible
choice for classification problems. If we are interested in a likelihood
whether a sample belongs to one class or not, we can use the error loss from
regression and consider the binary classification problem as a regression
problem that works on two possible outcomes. In practice, many classifiers are
trained in that way.
Our ensemble approach is based on the observation that
the generalization error of an ensemble model could be improved if the models
on which averaging is done disagree and if their residual errors are
uncorrelated [13]. To
see this, we have to investigate the contribution of the single model in the
ensemble to the generalization error. We consider the case where we have a
given dataset and we want to
find a function that
approximates at new
observations of . These observations are assumed to come from the same
source that generated the training set , that is, from the same (unknown) probability
distribution . It should be noted that depends also on . The expected generalization error given a
particular and a training
set iswhere the expectation is taken with
respect to the probability distribution . We are now interested inwhere the expectation is taken with
respect to all possible realizations of training sets with fixed
sample size . According to Geman et al. [15], the bias/variance
decomposition of iswhere is the
deterministic part of the data and is the variance
of given . Balancing between the bias and the variance
terms
is a crucial problem in model building. If we try to
decrease the bias term on a specific training set, we usually increase the
variance term and vice versa. We now consider the case of an ensemble average , consisting of individual
models as defined in (1). If we put this into (4), we getand we can have a look at the
effects concerning bias and variance. The bias term in (5) is just the average
of the biases of the individual models in the ensemble. So we should not expect
a reduction in the bias term compared to single models. According to Naftaly et
al. [19], the variance
term of the ensemble could be decomposed in the following way:where the expectation is taken
with respect to . The first sum in (6) marks the lower bound of the
ensemble variance and is the weighted mean of the variances of the ensemble
members. The second sum contains the cross terms of the ensemble members and
disappears if the models are completely uncorrelated [13]. So the reduction in the
variance of the ensemble is related to the degree of independence of the single
models [19].
4. Cross-Validation and Model Selection
Our model selection scheme is a mixture of bagging [20] and cross-validation. Bagging or Bootstrap
aggregating was proposed by Breiman [20] in order to improve the classification by combining
classifiers trained on randomly generated subsets of the entire training sets.
We extended this approach by applying a cross-validation scheme for model
selection on each subset and after that we combine the selected models to an
ensemble. In contrast to classical cross-validation, we use random subsets as
cross-validation folds. In -fold
cross-validation, the dataset is partitioned into subsets. Of
these subsets, a
single subset is retained as the validation data for testing the model, and the
remaining subsets are
used for model training. The cross-validation process is then repeated times with each
of the subsets used
only once as the validation data. The results from
the folds then can be averaged to produce a single estimation.
If we lack relevant problem-specific knowledge,
cross-validation methods could be used to select a classification method
empirically [21]. This
is a common approach because it seems to be obvious that no classification method
is uniformly superior, see, for example, Quinlan [22] for a detailed study. It is
also a common approach to select the model parameters with cross-validation
[23]. The idea to
combine the models from the cross-validation
folds (stacking) was described by Wolpert [24].
We suggest to train several models on each CV-fold, to
select the best performing model on the validation set, and to combine the
selected models from the -folds. If we
train models of one type but with different initial conditions (e.g., neural
networks with different numbers of hidden neurons), then we find proper values
for the free parameters of the model. We could extend that
by combining models from different classes in order
to increase the model diversity. We call this a heterogeneous ensemble or mixed ensemble and applied this method effectively to regression
problems [25] and
classification tasks [26].
Our model selection scheme works as follows: for the -fold CV, the
data is divided -times into a training
set and a test set, both sets containing randomly drawn subsets of
the data without replications. The size of each test set was 25% of the entire
dataset.
In every CV-fold, we train several different models
with a variety of model parameters (see Section 5 for an overview of the models
and the related model parameters). In each fold, we select only one model to
become a member of the final ensemble (namely, the best model with respect to
the test set). This means that all models have to compete with each other in a
fair tournament because they are trained and validated on the same dataset. The
models with the lowest classification error in each CV-fold are taken out and
added to the final ensemble, receiving the weight (see (1)). All
other models in this CV-fold are deleted.
We can use this model selection scheme in two ways. If
we have no idea or prior knowledge, where
classification or regression method should be used to cope with a specific
problem, we could use this scheme in order to look for an empirical answer and
to compare the performance of the different model classes. The other way is the
estimation of model parameters for the different model classes described in
Section 5.
5. Classification Models
In this section, we give a short overview of the model classes that we used for
ensemble building. All models belong to the well-established collection of
machine-learning algorithms for classification and regression tasks, so details
can be found in the textbooks like, for instance, Hastie et al. [2]. The implementation of these
models in an open-source toolbox together with a more detailed description can
be found in [9]. The
toolbox is an open-source MATLAB Toolbox which allows the integration of
existing implementations of classification algorithms and it contains more then
the few model classes described in the text.
5.1. Linear Discriminant Analysis
The LDA is a simple but useful classifier. If we assume that the two classes have a Gaussian
distribution with mean and they share
the same covariance matrix , then the linear discriminant function , is given by
where denotes the frequency of occurrence of the class labels. The predicted class labels are given by
We also implemented two modifications: the quadratic discriminant analysis (QDA) and the PDA, as
described in detail in Hastie et al. [2]. Linear method are usually conceptually simple,
robust, fast, and, in particular in high-dimensional problems, they could be
very powerful.
5.2. Logistic Regression Model
Logistic regression (Log.Reg.) is a model for binomial distributed dependent variables
and is used extensively in the medical and social sciences. Hastie et al.
[2] pointed out that
the Logistic Regression model has the same form as the LDA, the only difference
lies in the way, the linear coefficients are estimated. See Hosmer and Lemeshow for a
detailed introduction [27]. We used the binary Log.Reg. to compute the probability
of the dichotomic variable (PCa or non-PCa) from the independent
variables :
with
wherein the model coefficients
are estimated with a second-order gradient decent (quadratic approximation to
likelihood function). This could be a critical issue in high-dimensional problems because these calculations are time
and memory consuming.
5.3. Multilayer Perceptron
We train a multilayer feed-forward neural network “MLP” with
a sigmoid activation function. The weights are initialized with Gaussian-distributed
random numbers having zero mean and scaled variances. The weights are trained
with a gradient descend based on the Rprop algorithm [28] with the improvements given
in [29]. The MLP works
with a common weight decay with the penalty term
where denotes the -dimensional
weight vector of the MLP and a small regularization parameter . The number of hidden layers, the number of neurons,
and the number of regularization parameter are
adjusted during the CV-training. We further applied the concept of an -insensitive
error loss that we introduced in the context of cellular neural networks (CNNs)
[30].
5.4. Support Vector Machines
Over the last decade, SVMs have become very powerful tools in machine learning. An SVM
creates a hyperplane in a feature space that separates the data into two
classes with the maximum margin. The feature space can be a mapping of the
original features into a higher-dimensional space using a positive semidefinite function:
The function is called the kernel function and the so-called kernel trick uses Mercer's condition,
which states that any positive semidefinite kernel can be
expressed as a dot product in a highdimensional space (see [31] for a detailed
introduction). The theoretical foundations of this approach were given by
Vapnik's statistical learning theory [6, 32] and later extended to the nonlinear case [7]. We use an implementation of
SVMs that is based on the libsvm provided by Chang and Lin [33] with the standard
kernels:
The parameters of the model are, with respect to the kernel-type, the polynomial degree , the width of
the rbf and the value
concerning the cost of constrain violation during the SVM training.
5.5. Trees
Trees are conceptually simple but powerful tools for classification and regression. For
our purpose, we use the classification and regression trees (CARTs) as
described in Breiman et al. [4]. The main feature of the CART algorithm is the binary
decision role that is introduced at each tree node with respect to the
information content of the split. In this way, the most discriminating binary
splits are near the tree root building an hierarchical decision scheme. A
sketch of a decision tree is shown in Figure 3. It is known
that trees have a high variance, so they benefit from the ensemble approach
[20]. These trees
ensembles are also know as random forests. The parameters of the tree
models are related to splitting the tree nodes (the impurity measure and the
split criterion, see [2] for a detailed description).
5.6. Nearest-Neighbor Classifier
A -nearest-neighbor
classifier (KNN) takes a weighted average over the labels of those
observations in the training
set that are closest to the query point . This denotes as
where denotes the -element
neighborhood of , defined in a given metric, and is the related
distance. Common choices are the , , and the metrics. The
parameters of the model are the number of neighbors and the choice of the
metric. KNNs offer a very intuitive approach to classification problems because
they are based on the concept of similarity. This works fine in lower
dimensions but leads to problems in higher dimensions, known as the curse of
dimensionality [34].
6. Application to the Clinical Data
We compared the model classes described above in a unified framework under fair conditions.
Thus, we trained an ensemble of each model class consisting of 11 ensembles
members (11 CV-folds in the training scheme described in Section 4). The
performance of each ensemble model was assessed on the 20% of data (validation
set), which was initially removed and never included in model training (see
Figure 2). This procedure was independently repeated 20 times. This means that
all model-building processes, that is, the random removal of 20% of the data,
the construction of a classification model ensemble on the remaining 80% of the
data as outlined in Section 4, and the final test on the unseen validation data
were performed each time. Finally, the mean average prediction values with
respect to the validation set were calculated and are listed in Table 2. In some cases,
it is useful to apply a kind of data preprocessing
like balancing. If the distribution of the two classes differ in the sense,
that one class is only represented with a small number of examples, we can
balance the data in the training set. This can improve the convergence of
several training algorithms and has also an impact to the classification error
[35]. We apply
balancing in the way that we reduce the number of samples in the one class
until we have an balanced ratio of the class labels. The ratio of the class
labels in the validation set was never changed because it reflects the real
data distribution. Balancing was only applied to the training data. We used
three different performance measures in order to compare the different
classification models. Therefore, we have to define the four possible outcomes
of a classification that can be formulated in a confusion
matrix, as shown in Table 1. The
accuracy,
seems to be the canonical error
measure for almost all classification problems if the dataset is balanced.
Other important measures are the specificity that quantifies how well a binary
classification model correctly identifies the negative cases (non-PCa
patients),and the sensitivity, which is
the proportion of true positives of all diseased cases (PCa patients) in the
population,A high sensitivity is required
when early diagnosis and treatment are beneficial,
which is the case in PCa.
Table 1: The confusion matrix for a binary classification problem.
Table 2: The average performance of several classifier
ensembles with respect to the validation set which was initially removed and
never included in model training. We show the mean and the standard deviation
values from 20 independent validation runs, no preprocessing was used.
Figure 2: For every partition of the cross-validation, the data is divided in a training and a test set. The performance of
each ensemble model was assessed on validation set which was initially removed and never included in model training.
Figure 3: A sketch of a classification tree, wherein the leaves represent classes and the branches represent conjunctions
of features that lead to those classes.
The precision or positive predictive value (PPV) is
given byand is the proportion of
patients with positive test results who are correctly diagnosed. The F-Score is
the harmonic mean of precision and sensitivity,and it is
useful if the classes in the classification problem
are not equally distributed. Another measure is the area under curve (AUC)
wherein the curve is the receiver operating characteristic
(ROC-curve)curve. The ROC-curve is the graphical
plot of the sensitivity versus the (1-specificity) for a binary classifier as
its discrimination threshold is varied.
The ROC-curve offers the opportunity to calculate the
specificity at a fixed sensitivity level and vice versa. This is important
because, from the clinical point of view, a high sensitivity 95% is wanted to
detect all patients with PCa first. To avoid a high false-positive rate, we
computed the specificity at the level of 95% sensitivity
(SPS95) from the ROC-curve as another important performance measure.
To have an impression about the correct classified
non-PCa patients in this case, we computed the specificity at the level of 95% sensitivity
(SPS95) from the ROC-curve. If we compare
the outcome of the statistical analysis of the model performance as listed in
Table 1 for the unbalanced case and in Table 2 for the balanced case, we can
state that the differences between the different classifiers are marginal. Even
the more sophisticated classification models (SVMs or Mixed Ensembles) could
not outperform the robust linear candidates (LDA/PDA).
Tables 2 and 3 present the main results with only
small differences between the classifiers. The standard deviations of the
performance measures are given except for the ROC-curve-based measures (AUC and
SPS95). Most papers in this field do not discuss this really complex problem
and it cannot be solved in the scope of this paper, but it should be mentioned.
As an example of a special solution of this problem, see the paper of Hilgers
[36].
Table 3: The average performance of several classifier
ensembles with respect to the validation set which was initially removed and
never included in model training. We show the mean and the standard deviation
values from 20 independent validation runs wherein the training data was
balanced.
7. Conclusions
We compared several
classification models with respect to their ability to recognize prostate
cancer in an early stage. This was done in an
ensemble framework in order to estimate proper model parameters and to increase
classification performance. It turned out that all models under investigation
are performing very well with only marginal differences and are compareable
with similar studies, like, for example, Finne et al. [37], Remzi et al. [38], or Zlotta et al. [39]. In future research, it
should be investigated whether these results are valid for other populations of
patients (e.g., screening data) and other PSA test assays and whether the
performance of classification could be increased by including new variables or
by splitting the groups of patients into different PSA ranges.