Abstract

Although many classification techniques exist to analyze patterns possessing straightforward characteristics, they tend to fail when the ratio of features to patterns is very large. This “curse of dimensionality” is especially prevalent in many complex, voluminous biomedical datasets acquired using the latest spectroscopic modalities. To address this pattern classification issue, we present a technique using an adaptive network of fuzzy logic connectives to combine class boundaries generated by sets of discriminant functions. We empirically evaluate the effectiveness of this classification technique by comparing it against two conventional benchmark approaches, both of which use feature averaging as a preprocessing phase.

1. Introduction

Biomedical spectroscopic modalities produce information-rich but complex, voluminous data [1]. For instance, magnetic resonance spectroscopy, which exploits the interaction between an external homogenous magnetic field and a nucleus that possesses spin, is a reliable and versatile spectroscopic modality [2, 3]. Coupled with robust multivariate discrimination methods, it is especially useful in the interpretation and classification of high-dimensional biomedical spectra (patterns) of tissues and biofluids [4]. However, the ratio of the number of features to the number of patterns for these data is typically very large; the feature space dimensionality is O(103-104) while the number of patterns is O(10–100). This “curse of dimensionality” [5, 6] is a serious challenge for the classification of complex biomedical spectra: the excess degrees of freedom tend to cause overfitting, which significantly affects the reliability of the chosen classifier by diminishing its capability to determine effective generalizations.

We present a pattern classification technique, an extension to a method described in [7], that attenuates the confounding effects of the curse of dimensionality using an adaptive network of fuzzy logic connectives to combine pattern class boundaries generated by sets of discriminant functions based on sets of feature regions possessing high discriminatory power. We empirically evaluate the effectiveness of this classification technique by comparing it against two conventional benchmark approaches, both of which use feature averaging as a preprocessing phase.

Section 2 presents a brief discussion on pattern classification including pattern mapping, validation, discriminant analysis, and dimensionality reduction approaches. Details of our technique are presented in Section 3. Datasets, experiment design, and results are discussed in Section 4 followed by some concluding remarks.

2. Biomedical Pattern Classification

2.1. Mappings and Validation

We begin by defining some formal notation to precisely describe the problem of pattern classification where is the number of patterns (samples, vectors, individuals, or cases), is the number of features (dimensions, attributes, or measurements), and is the number of classes (groups). Let be a set of labeled patterns where and . Typically, ; however, it is often advantageous [8] to use 1-of- encoding for the class labels for iterative classifiers such as artificial neural networks [2]; namely, , where, for , and (). A classifier is a system that determines a mapping, . Using , if a classifier predicts that the class label for is , then a correct classification occurs when . It is considered a misclassification (a classification error) if .

Unfortunately, many investigations involving pattern classification are biased as they use the entire dataset to determine the mapping. This approach leads to overly optimistic pattern classification results and do not take into account the possibility of overfitting; that is, the mapping becomes a simple table lookup between the given patterns and class labels, thereby possessing no generalized predictive power for new (unseen) patterns. To compensate for this bias, it is essential to perform some type of validation [9, 10]. For instance, patterns in may be randomly allocated to a design (training) subset, containing patterns, or a validation (test) subset, containing patterns (). Now, a mapping is determined using only design patterns, , but the classification performance is measured using with the validation patterns.

Classification performance is measured using the “confusion matrix” of the desired class labels versus the predicted class labels. If the class prediction for is , then the element, , of the confusion matrix is incremented by one (perfect accuracy is reflected by zeroes on the off-diagonal and nonzeroes on the diagonal). The conventional performance measure is the ratio of correctly classified patterns to the total number of patterns, , where is the number of class validation patterns predicted, by the mapping , to belong to class . While other measures exist, such as the average class-wise accuracy, receiver operating characteristics graphs (ROC curves) [11], or the kappa score (a chance corrected measure of agreement) [12], for the sake of clarity during the discussion of the experiment results, we will use .

2.2. Discriminant Functions

Linear discriminant analysis (LDA) [13] is a conventional classification approach that determines linear boundaries between classes while taking into account inter class and intra class variances. If the error distributions for the classes are the same (identical covariance matrices), LDA constructs the optimal linear boundary between the classes. In real-world situations, this optimality is seldom achieved since different classes typically give rise to different distributions.

LDA allocates a pattern, , to class for which the probability distribution, , is greatest. That is, is allocated to class , if , where is the class’ prior (or proportional) probability. The discriminant function for class is where is the mean for class and is the covariance matrix of the patterns in . The feature space hyperplane separating class from class is defined by . Figure 1 illustrates the class boundaries defined by a set of linear discriminant functions for a two-dimensional dataset with three classes (). As mentioned in Section 2.2, when LDA is used for pattern classification, it is imperative to define the discriminant functions using the design patterns, , but to validate the performance using the validation patterns, .

The support vector machine (SVM) [14, 15] is an important family of supervised learning algorithms that select models that maximize the error margin of a training subset. This approach has been successively used in a wide range of data classification problems [16]. Given a set of patterns that belong to one of two classes, an SVM finds the hyperplane leaving the largest possible fraction of patterns of the same class on the same side while maximizing the distance of either class from the hyperplane. The approach is usually formulated as a constrained optimization problem and solved using constrained quadratic programming. While the original approach [17] could only be used for linearly separable problems, it may be extended by employing a “kernel trick” [18] that exploits the fact that a nonlinear mapping of sufficiently high dimension can project the patterns to a new parameter space in which classes can be separated by a hyperplane. In general, it cannot be determined a priori which kernel will contribute to producing the best classification results for a given dataset, and one must rely on heuristic (trial and error) experimentation. Common kernel functions , for patterns and , are power, ; polynomial, ; sigmoid, ; Gaussian, .

2.3. Feature Reduction

As with any pattern classifier, LDA becomes unreliable when there are a large number of features. Even when using stable methods such as singular value decomposition, the inversion of in (1) becomes unstable, so it becomes imperative to preprocess the features. A preprocessing strategy to use when is very large (curse of dimensionality) is to reduce the dimensionality of the feature space of the patterns; that is, we find a mapping (transformation) where and . Now, the classification mapping becomes . A standard approach to feature space reduction is to take the averages of a fixed number of contiguous feature regions. Although this type of averaging may often work well in attenuating the effects of the curse of dimensionality, it also has a tendency to sometimes wash away information content. Other feature reduction approaches do not transform the original feature space but rather attempt to find those features that possess the greatest discriminatory power [1922]. One example of this type of approach is stochastic feature selection.

2.4. Stochastic Feature Selection

Stochastic feature selection (SFS) [23] is a feature selection/reduction preprocessing strategy that may be used with any homogeneous or heterogeneous set of classifiers (e.g., LDA, artificial neural networks, support vector machines). Essentially, SFS iteratively presents, in a highly parallelized fashion, many feature regions (contiguous subsets of pattern features) to the set of classifiers retaining the best set of classifier/region pairs. While SFS has a rich set of parameters to control many different aspects of the classification process, here we present only those aspects that are relevant to this discussion and refer the reader to [23] for a thorough description of this strategy. For a pattern , we define a region to be a contiguous subset of its features, . The user specifies the minimum and maximum number of regions to be selected for each classification iteration as well as the minimum and maximum length for a feature region . SFS exploits the quadratic combination of (disjoint or overlapping) feature regions. The intent is that if the original feature space has nonlinear boundaries between classes, the new (quadratic) parameter space may have boundaries that are more linear. Given the feature region , SFS has three categories of quadratic combinations: using the original feature region, ; squaring the feature values for , or using all pair-wise feature cross products from two regions, and , producing the result . The fitness function (classification performance measure) is . In this study, the only classifier that is used is LDA. When SFS is finished, it returns the best set of classifier results (the cardinality of the set is user-specified) where each result contains (i) the value of , (ii) the indices (to the original features) of the set of feature regions selected, and (iii) the discriminant functions for each class as determined by LDA using the selected feature regions.

2.5. Fuzzy Adaptive Logic Network

Our approach builds upon the fuzzy adaptive logic network (cf. [24] for a thorough description). This approach, which can be used for pattern classification, combines two different subsystems within its general architecture. A neurocomputing subsystem uses a set of perceptrons to construct class boundaries to delineate patterns from different classes. Via a set of respective weights and inputs, a perceptron is defined as where is a transfer function (any sigmoid function but often the logistic function), which describes an -dimensional hyperplane. This geometric information is then presented to the logic processing subsystem that comprises a layer of fuzzy conjunctions (“and” elements) and another layer of fuzzy disjunctions (“or” elements). The intent is to use these fuzzy logic connectives to combine the hyperplanes from the neurocomputing subsystem to form convex hull-like topologies. For instance, a convex region delineated by perceptrons may be represented by the compound logic predicate, , which produces values close to one (meaning it becomes true) when all contributing predicates are true (i.e., the respective perceptrons produce high outputs). To capture the geometric notion of disjoint regions, one may take a union (in the set theoretic sense) of the individual regions described by the ’s: or or … or . To implement these fuzzy predicates, one uses t-norms to model the and logic connectives and s-norms to model the or logic connectives. A t-norm, , is a function that is commutative, symmetric, monotonic, and satisfies the boundary conditions and , while the boundary conditions for the s-norm, , are and . The fuzzy or and and connectives may now be defined as where is the input and are the corresponding adjustable weights (connections) confined to the unit interval. In the case of , the greater the weight value the more relevant the respective input (if all weights are 1, it becomes a standard or gate). In the case of , the greater the weight value, the less relevant the respective input (if all weights are 0, it becomes a standard and gate). If we restrict ourselves to differentiable t- and s-norms, a gradient descent strategy can be used to train a fuzzy adaptive logic network (cf. [24] for details).

3. Fuzzy Logic Network with Linear Discriminants

Building upon the concepts described in Section 2, we now describe our pattern classification algorithm, FLND (fuzzy logic network with discriminants). There are four major steps to the FLND algorithm: (i) use SFS to find the best sets of feature regions using the patterns from the design subset, ; (ii) for each set of feature regions, compute the linear discriminant function for each class and then compute the discriminant values for each design pattern; (iii) use a genetic algorithm to determine the optimal weights for the fuzzy logic network given the design pattern discriminant values found in (ii); (iv) use the patterns from the validation subset, , to assess the classification performance, , using the selected feature regions and discriminant function values. Figure 2 illustrates the architecture of the FLND system.

Let us now look at each algorithmic step in more detail. In the experiments described in Section 4, SFS uses LDA as the sole classifier and is the performance measure. After a set number of iterations, SFS returns sets of feature regions, , and the respective discriminant functions for each class, , using the feature regions (feature regions are sorted by ). The set of feature regions is of the form where is the total number of regions for set and is a single contiguous feature region as described in Section 2.4. The discriminant functions are computed using rather than all features. The input space is now no longer the original features but rather the respective values of for each class and each feature region set, which is a significant reduction in the dimensionality of the input space ().

The fuzzy logic network component of FLND uses the product () and probabilistic sum () for the t- and s-norms, respectively, with (user selected) connectives and connectives. There are two deficiencies with this component that does not exist with the fuzzy adaptive logic network described in Section 2.5. First, while perceptron output maps onto the unit interval (due to the sigmoidal nature of its transfer function), which is necessary for input into a fuzzy logic connective, values from linear discriminant functions map onto . This can be easily dealt with by rescaling the linear discriminant values prior to presentation to the fuzzy logic network (, where min and max are the respective minimum and maximum for all discriminant function values).

The second, more serious, issue is that a gradient descent strategy cannot be used to minimize the network error (i.e., optimize the weights) since the weight adjustments are now based on discrete sets of discriminant functions rather than differentiable perceptron output. We deal with this issue by using a straightforward implementation of a genetic algorithm (GA) [9, 25, 26] to perform the structural optimization of the network. While much slower than a gradient descent approach, it still provides more than adequate computational performance. We implemented a conventional genetic algorithm as described in [27], but other more sophisticated GA variants could certainly be explored. The crossover rate was set to 0.10, and the mutation rate was set to 0.007 for all experiments listed in Section 4.

Finally, all performance results, using , are based on the class predictions of FLND using the patterns from the validation subset. Further, the results are also benchmarked against conventional applications of LDA and SVM.

4. Experiments and Discussion

4.1. Synthetic Datasets

We begin our experiments with the two-dimensional exclusive-or problem (). Intuitively, one expects that LDA would perform poorly in this case as no hyperplane can act as a class boundary to perfectly separate the two classes of patterns versus . Using LDA, this is actually the case with (one pattern misclassification for each class). As this is a strictly pedagogical experiment, we skip the validation exercise and do not bother with SFS and move directly to the fuzzy logic network. Setting the initial GA population to 200, the number of iterations to 100, and the number of connectives to 2, we now get perfect accuracy, . The weights for the two connectives are and . The weights for the subsequent two connectives are and .

This synthetic dataset is a variant of the exclusive-or dataset described above (). A pattern belongs to the first class, if all of its features are identical; otherwise, it belongs to the second class. Figure 3 is a plot of the first two features of this dataset. The initial GA population is 800, the number of iterations is 100, and the number of connectives is 10 (as with the previous experiment, we do not use SFS). In this case, LDA again performed extremely poorly, , while FLND produced a significantly superior classification accuracy, . Table 1 lists the confusion matrices for LDA and FLND using this dataset. For completeness, we also list the weights for the connectives, , , , , , , , , , , and the connectives, , .

4.2. Magnetic Resonance Spectra

Magnetic resonance spectra (patterns) of a biofluid () were acquired and used to measure the effectiveness of FLND for the classification of a complex, voluminous, “real world” biomedical dataset. In this case, with 89 spectra belonging to class 1 (“normal”) and 61 spectra belonging to class 2 (“abnormal”). These spectra were randomly allocated to the design subset ( with 40 normal spectra and 40 abnormal spectra) or the validation ( with the remaining 49 normal spectra and 21 abnormal spectra) subset.

For this dataset, the following SFS parameters were used with FLND: the range for the number of feature regions, 2–5; the range for the number of features within a region, 2–20; ; 104 iterations. The fuzzy logic network parameters were ; crossover rate, 0.10; mutation rate, 0.008; size of GA population, 1200; 50 GA iterations.

Table 2 lists the confusion matrices for FLND with the design patterns and validation patterns. For the design patterns, , while for the validation patterns. Moreover, 82% of the normal (class 1) validation patterns were correctly classified and 86% of the abnormal (class 2) validation patterns were correctly classified. The latter result is especially advantageous as, for many confirmatory biomedical data analysis problems, it is important to have a low false positive rate (i.e., predictions for abnormal conditions should be as accurate as possible).

Table 3 lists the best sets of discriminatory feature regions, R, found by FLND. For each entry, we list the specific regions selected, how those regions were combined, and the total number of individual features used. Interestingly, over half of the selected discriminatory regions fell in the approximate range 3050–3850, which likely indicates that the biological metabolites represented by this spectral region are particularly germane in distinguishing between normal and abnormal states for the underlying biofluid being investigated. Also important to note is that most of the entries used quadratic combinations of the corresponding feature regions, with the top three results using the pair-wise cross products of the respective regions. Finally, the dimensionality of the feature space is only 4% that of the original space (180 quadratically combined features versus 4255 original spectra features).

4.3. Benchmark Comparisons

We now compare the FLND results from Section 4.2 with two classifier benchmarks, SVM and LDA. First, we use SVM and LDA to construct mappings using all features. Subsequently, for each classifier, feature averaging is used as a preprocessing technique, which is a typical strategy for voluminous biomedical spectra, in order to reduce the complexity of the classification problem [2831]. By reducing the dimensionality of the feature space, we hope to address the curse of dimensionality. Furthermore, averaging has a tendency to attenuate noise signatures. In our specific case, the original features are contiguously averaged using varying window sizes (with no overlap) to produce six sets of averaged features of size 851, 185, 115, 37, 23, and 5, respectively. We use proportional class probabilities for LDA and all SVM kernels listed in Section 2.2. For clarity, in the case of SVM, we report only the best results for each averaged feature set. Table 4 lists the validation subset classification results (confusion matrices and ) using the benchmarks with feature averaging. In no case did the benchmarks outperform FLND. Using all of the original features, both benchmarks performed poorly: for SVM and for LDA. For each benchmark, the best results occurred with 185 averaged features: for SVM and for LDA. We also note that classification results begin to degrade as the window size increases (i.e., the number of averaged features decreases). This is not uncommon as feature averaging can cause a washing away of information content present in biomedical spectra.

5. Conclusion

We have empirically demonstrated the effectiveness of a classification technique that uses an adaptive network of fuzzy logic connectives to combine class boundaries generated by sets of discriminant functions based on collections of feature regions possessing high discriminatory power. Using a complex, voluminous “real world” biomedical dataset, FLND outperformed all classifier benchmarks in the classification of patterns from a validation subset. It achieved an 8% improvement in classification accuracy compared against the best benchmark result (0.83 versus 0.77 for SVM using feature averaging with a window size of 23). This increase in classification accuracy is achieved by taking the class boundaries described by the discriminant functions and using layers of fuzzy logic connectives to combine these boundaries into convex, nonlinear boundaries. This new method also significantly reduces the dimensionality of the input space as the original set of spectral features is replaced by a much smaller set of class discriminant values. This is a particularly useful characteristic when dealing with the curse of dimensionality (large feature to sample ratio), which is a prevalent property of many complex biomedical datasets acquired using current spectroscopic modalities.

While this classification technique has demonstrated the utility of merging fuzzy logic connectives with multivariate statistical discrimination, the investigation has also led to the identification of future areas of research to potentially improve its overall effectiveness and computational performance. First, rather than setting the number of fuzzy and connectives by the user a priori, it would be worthwhile to investigate a cascade approach to determining an optimal number of and connections that would be completely data-driven. Second, alternate structural optimizations to the fuzzy logic network need to be examined beginning with more sophisticated evolutionary computational approaches or exploiting recent advances in stochastic optimization techniques. Finally, a more intelligent rescaling strategy for the discriminant function values needs to be investigated. For instance, this may include a fuzzified (weighted) distance measure based on the proximity (belongingness) of a sample to all class boundaries.

Acknowledgments

Conrad Wiebe and Aleksander Demko are gratefully acknowledged for the implementation of the stochastic feature selection algorithm. The aythors also thank the Natural Sciences and Engineering Research Council (NSERC) for its support of this investigation.