Abstract
Feature space heterogeneity often exists in many real world data sets so that some features are of different importance for classification over different subsets. Moreover, the pattern of feature space heterogeneity might dynamically change over time as more and more data are accumulated. In this paper, we develop an incremental classification algorithm, Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH), to address this problem. In our approach, supervised clustering is implemented to obtain a number of clusters such that samples in each cluster are from the same class. After the removal of outliers, relevance of features in each cluster is calculated based on their variations in this cluster. The feature relevance is incorporated into distance calculation for classification. The main advantage of SCCFSH lies in the fact that it is capable of solving a classification problem with feature space heterogeneity in an incremental way, which is favorable for online classification tasks with continuously changing data. Experimental results on a series of data sets and application to a database marketing problem show the efficiency and effectiveness of the proposed approach.
1. Introduction
In classification problems, feature space heterogeneity is the phenomenon that a data set consists of some heterogeneous subsets, and the optimal features for classification are distinct over different subsets. The challenge of this problem is that we do not know how many heterogeneous subsets exist in the data set or which subset each sample belongs to. In the last decade, the problem of feature space heterogeneity in data has been addressed under different names, such as local feature relevance [1], casespecific feature weights [2], relevance in context [3], feature space and class heterogeneity [4], and attribute instability [5].
Feature space heterogeneity exists widely in various application fields of classification techniques, such as marketing, customs inspection decision, credit scoring, and medical diagnosis. For example, in marketing, a major concern of the market managers is to develop and implement efficient marketing programs by fully utilizing the customer databases and identifying the households that are most likely to be interested in the marketing programs. The above process can be formulated as a classification problem, in which the features (attributes) are characteristics of the households such as demographic, psychographic, and behavioral information, and the target variable is whether a household responds to the marketing messages. After the responding probability of each household is predicted by using certain classification techniques, the marketing messages are sent to those households with the highest probabilities. However, it is stated in Allenby and Rossi [6] that, as consumer preferences and sensitivities become more diverse, it becomes less and less efficient to consider the market in the aggregate. Desarbo et al. [7] also argue that predictions typically made with a single set of parameter values may not fully capture individual consumer differences in the sample. In other words, relevant features for predicting the responding probabilities may vary in different groups of customers. Therefore, it is important to take the feature space heterogeneity into consideration when solving the above marketing problems. Other similar examples can be found in customs inspection decision [8], medical diagnosis [9], rushes editing [10], and accident analysis [11].
If significant feature space heterogeneity exists in the data set and global feature selection is implemented for constructing a classification system, the resulting model is inevitably diffused by an averaging effect over the entire problem, since the best features on which to base the classification model vary in different subsets [4]. Therefore, dealing with feature space heterogeneity is an important issue in classification problems. Apte et al. [4] suggest that a logical first step is to decompose the classification problem with feature space heterogeneity into its constituent subproblems. In their study, an Important Profile Angle (IPA) is defined to indicate the degree to which the importance of each feature varies between two subproblems. The IPA is then used to guide the data set partitioning in an iterative way. However, the practical stopping criterion is still under investigation, and these ideas are not readily applicable for classification problems with numeric features. Therefore, an effective approach to classification problems with feature space heterogeneity is needed.
On the other hand, in some real world applications of classification techniques, new data are presented in sequence and added to the historical data set. Consequently, feature space heterogeneity existing in the historical data set might dynamically change over time. For example, in the custom inspection decision problem, there are thousands of declared goods waiting to be exported or imported every day. The custom officials have to decide whether an inspection is needed for declared goods by solving a classification problem [8]. Due to the variety and diversity of export/import trades, even in the same merchandise category, the relevant features may vary in different subcategories, as we have found in a research project sponsored by China Customs. Therefore, feature space heterogeneity is exhibited in this classification problem. Meanwhile, as more and more historical data are accumulated in the data base of customs, the underlying feature space heterogeneity might change and, accordingly, the classifier has to be updated for better accuracy. However, if we reconstruct the classifier once a batch of new data comes into the data base, the computational burden would be high due to the continuity of data stream, and the information and patterns learned in the past would be wasted. Moreover, constraints on the time and resource for processing the data could hardly be met. To deal with this problem, the classification approach should be capable of incremental learning, which is an active research direction [12]. The main characteristics of incremental learning are [13] as follows: (1) examples are not all available a priori but become available over time, usually one at a time; (2) since learning may need to go on (almost) indefinitely, a classifier needs to respond quickly in an online manner and process the data in a continuous way.
In this paper, we develop a novel classification approach, Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH), to address the above problems, that is, feature space heterogeneity and incremental learning. Our approach is based on the ECCAS algorithm proposed by Li and Ye [14]. The main idea of our approach is to first divide the sample set into a number of subsets by supervised clustering such that samples in each subset are with the same class label and then calculate the relevance of features in each subset. The feature relevance is then incorporated in calculating the distances used for classification. The main advantage of SCCFSH lies in the fact that it is capable of solving a classification problem with feature space heterogeneity in an incremental way, which is favorable for online classification tasks with continuously changing data set. Experimental results on a series of data sets show that the proposed SCCFSH could achieve favorable classification performance and be capable of fast and incremental learning.
The rest of this paper is organized as follows. In Section 2, we briefly review the previous researches on feature space heterogeneity and incremental learning. Section 3 presents the classification approach SCCFSH we develop. The experimental results on a series of benchmark data sets and a real world application are reported in Section 4. Conclusion and discussion are made in Section 5.
2. Related Works
The phenomena that relevant features for classification vary across the data set have been observed by many researchers and practitioners [4, 15–17]. Until recently, a number of classification methods have been developed, which can be divided into two categories. In the first category, one of the best known methods is “bagging” [18]. In this approach, subsets are generated by randomly sampling from the original set of samples. Consequently, relevant features might be different in the obtained subsets. Based on this approach, Puuronen et al. [19] proposed a MetaLevel Classification (MLC) method, which can be used to deal with the problem of feature space heterogeneity. MLC first divides the training sample set into some subsets and obtains the component classifiers based on these subsets. In the application phase, testing samples are put into the training sample set, and MLC dynamically selects the optimal component classifier for a testing sample by comparing the performance of different classifiers in its neighborhood. Different from the method of sample partitioning, the Random Subspace Method (RSM) [20] divides the whole feature set into a number of feature subsets and constructs different classifiers based on the whole training samples with different feature subsets obtained. Feature space heterogeneity in testing samples is considered through synthesizing (usually by voting) the application results of all classifiers. These methods deal with the problem of feature space heterogeneity by firstly dividing sample set or feature set into different subsets in a random way and then training component classifiers in the subsets. These component classifiers are then combined for classification, mostly by major voting or selecting the optimal one. A major problem of these methods lies in the random set (sample set or feature set) partitioning, which may result in seriously biased component classifiers due to the feature redundancy and irrelevance in some subsets, especially for high dimensional data sets.
In the second category, modified lazy learning methods are applied to classification problems with feature space heterogeneity. Friedman [1] addresses the problem of feature space heterogeneity by investigating the variability of feature relevance in different data subsets. In his method, the local relevance of features in each subset is measured by the estimated reduction in classification error. Hastie and Tibshirani [15] develop an adaptive form of nearest neighbor classification method for dealing with feature space heterogeneity. In their approach, distance metric for each sample is adaptively calculated in an iterative process using local discriminative information of features. Therefore, different relevant features are taken into account for classification in different subsets. Although both works report favorable results on their local approaches compared to global ones, both of them are computationally expensive [16]. Paredes and Vidal [21] propose a locally weighted lazy learning approach for better classification accuracy. In their method, different samples would have different feature weights obtained by approximately minimizing the LeavingOneOut (LOO) classification error of the given training set. However, the computational complexity of this method is high because of the gradient descent algorithm employed to search for the optimal weights.
In spite of the fact that many researches have been carried out for dealing with feature space heterogeneity in classification, we have not found any for incremental learning among them. Researches on incremental classification are mainly focused on statistical methods [22], neural networks [23–25], and evolutionary algorithm [26]. Instancebased learning, especially nearest neighbor (NN) learning, is a widely used nonparametric incremental classification approach where training or learning does not take place until a query is made. In contrast to complex learning algorithms such as neural networks or support vector machines, NN learning does not require a complex function fitting process or model training procedure. Thus, it is easy to do incremental learning [27]. Nevertheless, once a query point with unknown class label is presented, conventional NN learning traverses the whole data set to find the nearest neighbors of the query point. Therefore, the computational time and requirement of computer storage space of NN are not scalable to large amounts of data. To solve this problem, Li and Ye [14] propose a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for classification. This algorithm enables a scalable and incremental learning of patterns from data with both numeric and nominal variables. However, it calculates the feature relevance by using squared correlation coefficient between predictor variables and target variable over the entire data set, regardless of the possible heterogeneity that exists in the feature space.
3. The Proposed Approach
The Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH) proposed in this paper is based on the ECCAS [14]. However, SCCFSH differs significantly from ECCAS in that it takes feature space heterogeneity into consideration. SCCFSH first divides the data set into a number of subsets in a supervised way and then explores the feature relevance in each subset obtained. The main steps of SCCFSH include gridbased supervised clustering, supervised grouping of clusters, removal of outliers, calculation of feature relevance in each cluster, and distancebased classification.
3.1. GridBased Supervised Clustering
Consider a data set in which is a dimensional sample. The label set of samples in is . Without loss of generality, we only consider the binary classification problem; that is, , .
The gridbased supervised clustering procedure first divides the dimensional space of samples into grid cells and then generates clusters within the grid cells, as suggested in Li and Ye [14]. This procedure aims to avoid the problem that different presentation order of the same data points may generate different cluster structures. For example, a number of data points of the same class may appear consecutively in the data set. Without the above gridbased procedure, these data points would be grouped into one cluster, even though they are not close to each other at all. Consequently, the cluster structure is not robust to the presentation order of data points. With the above gridbased procedure, these data points can be prevented from joining into one single cluster.
In the proposed SCCFSH, the procedure of gridbased supervised clustering is similar to that in ECCAS. The main difference is that SCCFSH employs gridbased supervised clustering only for decomposing the classification problem into its constituent subproblems, and feature relevance in individual subproblems is considered in later steps. Thus, we simply use the conventional Euclidean distance metric in supervised clustering without imposing any weights on the features in distance calculation. In comparison, ECCAS first calculates feature relevance by using the squared correlation coefficient between predictor variable and class label over the entire data set, and then it incorporates them in distance calculation by using the following weighted Euclidean distance metric: In other words, ECCAS does not take any possible feature space heterogeneity into consideration.
3.2. Supervised Grouping of Clusters and Removal of Outliers
Supervised grouping of clusters plays an important role in the proposed SCCFSH. If some underlying clusters cover the area of several grid cells, gridbased supervised clustering would divide these large clusters into small clusters. Therefore, refinement of the clustering results is needed. In SCCFSH, we iteratively group the clusters obtained from supervised clustering in the same way as ECCAS. In this grouping procedure, a single linkage method is used, where the distance between two clusters is defined as the distance between their nearest “points” [28]: In (3), is the newly formed cluster by merging clusters and and is an old cluster. Note that, in (3), the Euclidean distance metric represented in (1) is still used for distance calculation in SCCFSH, since our main aim is to decompose the classification problem into its subproblems according to the spatial distribution of samples. The main procedure of supervised grouping of clusters is shown in Algorithm 1. In step 3 of Algorithm 1, Label refers to the class label of cluster .

By applying the gridbased supervised clustering and supervised grouping of clusters, a number of data groups can be obtained. In each group, samples are with the same class label. Nevertheless, in some groups, there may be only a few samples. These groups often represent noises (outliers) in training data samples and thus should be removed. A common way is to check the number of samples in a cluster. Those clusters whose number of samples is no more than the threshold are removed from the clusters. However, how to choose this threshold strongly depends on the characteristic of the data set. For example, in some data set where the number of samples in one class is much larger than that in another class, the threshold value should be different for clusters with different class labels. In this paper, we simplify our work by setting this threshold value to be equal to 1.
3.3. Calculation of Feature Relevance for Classification
After the above three steps, we could obtain a number of clusters, and samples in each cluster are with the same class label. Vucetic and Obradovic [29] argue that some features with the same values may result in quite different outputs (class labels) in different regions. Therefore, spatial characteristics of samples should be explored for better classification performance. Motivated by this statement, we investigate the feature relevance in the obtained clusters that represent different spatial distributions of samples. It is stated in Gennari et al. [30] that features are relevant if their values vary systematically with class membership. Theodoridis and Koutroumbas [28] also argue that relevant features would have large betweenclass variance and small withinclass variance. Accordingly, in our approach, the variances of features in a cluster are utilized to calculate their relevance to the class label. If the variance of a feature is small, it is implied that this feature is informative for the class label and thus should carry more weight in classification. Therefore, we calculate the relevance of feature in the cluster by
Once the relevance of each feature in is obtained, we standardize them by requesting that the sum of with respect to be equal to , in accordance with the sum of weights in conventional Euclidean distance metric (the weight of each feature is 1). The relevance of feature in cluster is finally determined by
With all the cluster centroids and feature relevance in each cluster, we could consider each cluster centroid as a sample in the feature space and classify a new sample simply using the nearest neighbor (NN) rule by defining the weighted distance metric as where are defined in (4). Clearly, if the relevance of a feature is strong, it carries high weight in distance calculation for classification.
In summary, the main steps of the proposed SCCFSH are shown in Algorithm 2.

4. Experimental Study
To verify the effectiveness of the proposed approach for classification problems with feature space heterogeneity, we implement two sets of experiments. In the first set of experiments, we artificially construct some mixed data sets with feature space heterogeneity and show that SCCFSH is effective and timeefficient. In the second set of experiments, we apply the proposed classification algorithm to a real world customer targeting problem.
4.1. Experiments on Benchmark Data Sets
In this set of experiments, we select 13 benchmark data sets from the ELENA project [31] and UCI machine learning repositories (available at http://archive.ics.uci.edu/ml/ and https://www.elen.ucl.ac.be/neuralnets/Research/Projects/ELENA/elena.htm). Basic characteristics of these selected data sets are briefly summarized in Table 1.
As shown in Table 1, these benchmark data sets are from different backgrounds and have different number of features with different units of feature values. To avoid the dominance of features with large values over those with small values, we standardize the selected data sets such that the value of each feature satisfies a random distribution with zero mean and unit standard deviation. In this paper, we mainly focus on binary classification problems. Thus, for data sets with more than two classes, we convert them into a number of twoclass subproblems and choose one of the resulting subproblems for the experimental study. For example, the data set Letter Recognition has 26 classes. We select a subset of samples in “A” and “D” classes and denote it as Letter Recognition (A versus D).
Since the main aim of this study is to investigate the effectiveness of our classification algorithm on data sets with significant feature space heterogeneity, we construct 12 mixed data sets by merging some benchmark data sets selected from the data sets listed in Table 1. In case the component (benchmark) data sets to be merged have different numbers of features, we add some random features into the data sets with fewer features to achieve equal dimensionalities. Values of each added random feature are generated from a normal distribution with zero mean and unit standard deviation. Because some data sets have much more samples than others, we randomly select 1,000 samples from those component data sets with more than 1,000 samples to balance the component proportions in each mixed data set. Structures of the 12 mixed data sets are described in Table 2.
To investigate the possible heterogeneity that exists in the feature space of these mixed data sets, we evaluate the feature relevance in each component data set using the squared correlation coefficients between the features and class label. The results are shown in Table 3.
Table 3 indicates that the order of feature relevance varies in different component (benchmark) data sets. Therefore, it can be concluded that most mixed date sets will have significant feature space heterogeneity, which is suitable for our experimental study.
We next apply the proposed SCCFSH to the mixed data sets. To demonstrate the effectiveness of SCCFSH on classification problems with feature space heterogeneity, we compare its performance with that of the ECCAS and Class Prototype Weight (CPW) learning method proposed by Paredes and Vidal [21]. In CPW learning, different training samples have different optimal feature weights. These weights are determined by approximately minimizing the LeavingOneOut NN classification error of the given training set. Therefore, CPW learning can be employed for classification problems with feature space heterogeneity. In CPW learning, there are several parameters that need to be set. As suggested in Paredes and Vidal [21], we set , , and .
In the experiments, Fold Cross Validation [32] is applied to estimate the error rates. Each mixed data set is first divided into subsets randomly, and subsets are used as the training set and the remaining subset is used as the testing set. For simplicity, we arbitrarily set the . Trainingtesting experiment for each mixed data set is run 30 times using different random 5fold partitions. The classification error rates of ECCAS, CPW, and SCCFSH averaged over 30 runs and the results of paired test are shown in Table 4.
It can be observed from Table 4 that, in comparison with ECCAS and CPW learning method, the proposed classification approach SCCFSH could obtain comparative or equivalent results in 12 mixed data sets. It is noteworthy that, in comparison to ECCAS, the proposed SCCFSH achieves uniformly better classification performance over most (11 out of 12) data sets. This may be attributed to the fact that SCCFSH takes the feature space heterogeneity that exists in the mixed data sets into consideration.
Since one main aim of our study is to develop a classification approach with incremental learning ability, we compare the average computational times of SCCFSH and CPW on each mixed data set over 30 runs. The results are shown in Table 5.
Table 5 shows that the average computational time of our SCCFSH is much less than that of the CPW learning. This is because CPW learning employs gradient descent method to search for approximately optimal feature weights by minimizing the LOO NN error rate. After these weights are adjusted in one iteration, the algorithm has to traverse the whole data set to find the nearest neighbors for each sample in the next iteration. Besides, the convergence rate for the gradient descent method is dependent on the condition number of the Hessian (the ratio of the largest eigenvalue to the smallest one of the Hessian) and can be very slow, as presented, for example, by Luenberger and Ye [33]. In contrast, the computational complexity of SCCFSH is , the same as that of ECCAS, where is the number of features, is the number of samples, and is the number of clusters.
4.2. Application to a Real World Customer Targeting Problem
The data set used in this set of experiments is taken from a solicitation of 9822 European households to buy insurance for their recreational vehicles (RV) (available online at http://www.liacs.nl/~putten/library/cc2000/). In this data set, each household’s record contains a target variable indicating whether they buy insurance and 93 predictor variables indicating information on both sociodemographic characteristics and ownership of various types of insurance policies. A more detailed description of the data set is presented in Kim and Street [34].
In the experiments, we use two separate data sets: a training set with 5,822 households and an evaluation set with 4,000 households. Of the 5,822 prospects in the training data set, 348 purchased RV insurance, resulting in a hit rate of . From the manager’s perspective, he/she would like to increase this hit ratio by selecting those households with highest responding probabilities and sending mails to them. Therefore, efficient classification model based on the training set is needed to predict the responding probability of each household. The evaluation data is used to validate the predictive classification model. Hereafter, we define the households who purchase and do not purchase the RV insurance as positive and negative, respectively. Therefore, through supervised clustering, the households are partitioned into a number of clusters which are either positive or negative.
Since we are interested in the top of customers with the highest probability to buy RV insurance in the evaluation data set, the method’s predictive accuracy is examined by computing the hit rate among the selected households. Consequently, we modified the SCCFSH shown in Algorithm 2 as follows. Instead of finding the nearest neighbor of each sample in the testing data set , we find two nearest neighbors (clusters) from different classes. Denote by and the distances between and its nearest positive cluster and negative cluster, respectively. The probability of belonging to a specific class (e.g., positive class) can be calculated by where indicates that household would buy RV insurance. Equation (6) means that a further distance between a testing sample and its nearest negative cluster implies higher probability that this testing sample belongs to positive class. Equation (6) is employed to modify the output of SCCFSH in application to the customer targeting problem in this experiment.
Similar to the evaluation mechanism of prediction accuracy in Kim et al. [35], we estimate the probability of buying new insurance for each household in the evaluation data with SCCFSH. After sorting the households in descending order of the estimated probability , we compute the cumulative hit rate of a model over various target points where . A comparison of cumulative hit ratios obtained by SCCFSH, ECCAS, and the method proposed in Kim et al. [35] is shown in Figure 1. Note that, for ECCAS, formula (7) is also used to modify the outputs to estimate the probability .
It can be observed from Figure 1 that the proposed SCCFSH shows uniformly better performance at target point . Considering that, in this application, the market managers are more interested in targeting fewer customers with higher hit ratio, the result obtained by our approach is quite favorable. Moreover, our method takes much less time (less than 10 minutes on a computer with 1.5 GHz CPU and 256 M RAM) than that in ELSE/ANN (more than ten hours).
5. Conclusion
Feature space heterogeneity is the phenomenon that the optimal features for classification are distinct in different subsets of samples, but prior knowledge about these underlying subsets is unavailable. Moreover, in some real world applications of classification techniques, new data are presented in sequence and added to the historical data set after they are processed. Consequently, feature space heterogeneity existing in the historical data set might dynamically change and, accordingly, the classification system has to be updated for better accuracy. In this paper, we develop a Supervised Clustering for Classification with Feature Space Heterogeneity (SCCFSH) to address this problem. Our approach consists of four main steps: gridbased supervised clustering, supervised hierarchical grouping of clusters, feature relevance evaluation in each cluster, and weighted distance calculation for classification. The main advantage of the proposed SCCFSH is that it is enabled to deal with feature space heterogeneity in classification problems in a scalable and incremental way. Computational results in the experiments verify the efficiency and effectiveness of the proposed approach. In spite of the fact that we only consider binary classification problems in this paper, our approach can be easily extended to multiclass classification problems.
In the proposed SCCFSH, a cluster with only one sample is considered as an outlier and removed. When some samples from different classes overlap heavily in the data space, it might be inappropriate to consider a cluster with one data as an outlier. A possible direction for future research is to improve the proposed approach to dealing with overlapping samples.
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The author is grateful to the editor and the anonymous reviewer for providing many helpful comments and suggestions, which have significantly improved the exposition and focus of this paper. This research is supported by the National Natural Science Foundation of China (NSFC Grant no. 71001112), the Fundamental Research Funds for the Central Universities (Project no. CQDXWL2013083), and the Social Science Research Fund for Young Teachers in Chongqing University (Project no. CDSK200911).