Abstract
The classification problem for imbalance data is paid more attention to. So far, many significant methods are proposed and applied to many fields. But more efficient methods are needed still. Hypergraph may not be powerful enough to deal with the data in boundary region, although it is an efficient tool to knowledge discovery. In this paper, the neighborhood hypergraph is presented, combining rough set theory and hypergraph. After that, a novel classification algorithm for imbalance data based on neighborhood hypergraph is developed, which is composed of three steps: initialization of hyperedge, classification of training data set, and substitution of hyperedge. After conducting an experiment of 10fold cross validation on 18 data sets, the proposed algorithm has higher average accuracy than others.
1. Introduction
The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes. The minority class is usually more interesting from the point of view of the learning task. There are many situations in which imbalance occurs between classes, such as satellite image classification [1], risk management [2], and medical diagnosis [3, 4]. When studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake [5]. At present, the solutions for the problem of imbalanced dataset classification are developed at both the data and algorithmic levels [6]. At the data level, the objective is to rebalance the class distribution by resampling the data space, such as oversampling the minority class and undersampling the prevalent class. At the algorithm level, solutions try to adapt existing classifier learning algorithms to strengthen learning with regard to the minority class, such as costsensitive learning, ensemble learning, and hypernetwork [7].
Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms. SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [8]. Its main idea is to form new minority class samples by interpolating between several minority class samples that lie together. Thus, the overfitting problem is avoided and the decision space for the minority class spread further; meanwhile, it reduces the decision space for the majority class, so many researchers proposed different improved methods. Dong and Wang [9] proposed the RandomSMOTE, which is different from SMOTE, which obtained new minority class samples by interpolating among three minority class samples. Yang et al. [10] proposed ASMOTE algorithm which chose not only the minority class samples but also the majority class samples that are near to minority class sample, avoiding synthetic sample overlapping the majority class samples. Han et al. [11] proposed the BorderlineSMOTE. Hu and Li [12] proposed NRSBoundarySMOTE algorithm which can expand the decision space for the minority class; meanwhile, it will shrink the decision space for the majority class.
While in recent years, with the rapid developing of ensemble methods for classification, they have been applied to imbalanced data classification, ensemble learning is a machine learning paradigm where multiple learners (called base learners) are trained to solve the same problem [13]. Due to the outstanding performance of ensemble methods, they are applied to imbalanced dataset by combining with other techniques. Chawla et al. have developed SMOTEBoost algorithm by integrating Adaboost (the most famous boosting algorithm) and synthetic minority oversampling technique (SMOTE) [14]. Similarly to SMOTEBoost, RUSBoost also introduces data sampling into the Adaboost algorithm, while it applies random undersampling to the majority class; but SMOTEBoost creates synthetic new minority class instances by operating in the feature space [15]. Błaszczyński et al. integrate a selective data preprocessing method SPIDER with Ivotes ensemble algorithm developing the framework called IIvotes [16]. Besides, costsensitive learning becomes an effective tool to solve class imbalanced problem, which involves two types: binary classification problem and multiclassification problem. It can be implemented by two ways that are rescaling and reweighted, respectively. Both of them aim at making the trained classification algorithms cost sensitive. Rescaling changes the distribution of samples in training data. It has been used in costbased sampling [17], REBALANCE [18], Rescalenew [19], and so on. Differing from rescaling, reweighted adjusts the class probability distribution in classifier based on costs. It has been used in MetaCost [20] proposed by Domings and AdaCost [21], which is improved by Fan et al. according to AdaBoost.
In the 1970s, Rumelhart and Norman proposed three types of human learning: accretion, tuning, and restructuring [22]. Based on their study, professor Zhang proposed three basic principles of cognitive learning [23]: continuity, glocality, and compositionality. He used a hypergraph as presentation form and proposed a hypernetwork model, which can be used for cognitive learning and memory. Hypernetwork is a probabilistic graph with numbers of hyperedges. A hyperedge can be regarded as a component, a subject, or even a circuit. From the perspective of data, a hyperedge is the combination of sample attributes and class. So far, hypernetwork can just deal with discrete data. Dataset must be discrete before using for building a hypernetwork classifier. A hypernetwork model includes three steps: initializing of a hypernetwork model according to training dataset, evolutionary learning of a hypernetwork, and classification of test dataset using the evolutionary hypernetwork. In step , a sample is used to generate many hyperedges through inheriting some attributes of the sample and its class. In step , operations of match, selection, and replacement are repeatedly executed for hyperedges. It is started from a randomly initialized hypernetwork; in each iteration, fitness of a hyperedge is calculated for evaluating and ordering. Hyperedges which own low fitness are replaced with new generated hyperedges. In this way, hyperedges with high class discernibility in pattern space can be found out by the hypernetwork [7]. After the above steps, a hypernetwork model is built and classifies the test data through joint probability.
Although hypernetwork has been widely used in solving various machine learning problems, it usually produces poor overall classification performance when dealing with class imbalance problems. Like most of the traditional classification algorithms, hypernetwork assumes that the class distribution of datasets is balanced. The goal of the hypernetwork learning is to extract hyperedges (or decision rules) that can cover as many samples as possible. Hyperedges are critical for differentiating class membership which are copied and added while hyperedges with poor differential ability are discarded. However, within the context of class imbalance, many samples in minority class are usually viewed as noises. Therefore, the number of hyperedges corresponding to the majority significantly surpasses that of hyperedges corresponding to the minority. As a result, most of the minority samples are misclassified in a traditional hypernetwork. Thus, this paper attempts to combine hypernetwork with rough set to address the problem.
Rough set theory is a powerful mathematical tool introduced by Pawlak [24–27] to deal with imprecise, uncertain, and vague information. It has been successfully applied to such fields as machine learning, data mining, intelligent data analysis, and control algorithm acquiring. Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region. The rough set theory puts the uncertain samples in the boundary region and the boundary region can be calculated by upper approximation minus lower approximation, and they all can be calculated. Until now, there are many researchers who brought rough set theory to process imbalanced data [28, 29].
The remainder of this paper is organized as follows. The basic concepts on neighborhood rough set models are shown in Section 2. The neighborhood hypergraph algorithm is developed in Section 3. Section 4 presents the experimental evaluation on 18 imbalanced UCI datasets [30] by 10fold cross validation, which shows the validity of the proposed method. The paper is concluded in Section 5.
2. Hypergraph and Neighborhood Hypergraph Model
2.1. The Definition of Hypergraph
In 1970, Berge [31] used hypergraph to define hypernetwork. And it was the first time to establish undirected hypergraph theory systematically and it was applied on the operations research by matroid.
Definition 1 (see [31, 32]). Given is finite set, if(a),(b),
then the binary relation is defined as a hypergraph. The elements of are defined as vertices of the hypergraph; is defined as the edge set of hypergraph. is defined as hyperedge (see Figure 1).
2.2. The Neighborhood Hypergraph
Neighborhoods and neighborhood relations are a class of important concepts in topology. Lin [33] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology. Hu et al. [34] discussed the properties of neighborhood approximation spaces and proposed the neighborhoodbased rough set model. And then they used the model to build a uniform theoretic framework for neighborhood based classifiers. For the convenience of description, some basic concepts of the neighborhood rough set model are introduced here at first.
Definition 2 (see [34]). Given arbitrary and , the neighborhood of in the subspace is defined as where is a metric function. , it satisfies(1),(2) if and only if ,(3),(4).
Consider that and are two objects. is a sampledimensional space, where denotes the value of sample on the th dimension . Then, a general metric, named Minkowsky distance, is defined as When , it is the Euclidean distance .
But Euclidean distance can only be used to compute continuous features; the nominal features are invalid. Here, we compute them by using value difference metric (VDM) proposed by Stanfill and Waltz [35] in 1986. The distance between two corresponding feature values is defined as follows:
In the previous equation, and are the two corresponding feature values. is the total number of occurrences of feature value and is the number of occurrences of feature value for class . A similar convention can be also applied to and . is continuous, which is usually set to 1.
Definition 3. Given is a neighborhood hypergraph, then is the vertex set of , indicating that it has vertices, where denotes a sample. is hyperedge set, and each in is a hyperedge which connects vertices . is the attribute set, and is the decision set.
Vertices of hypergraph represent the attribution of samples in some literatures like literature [36] and so on. However, in this paper, vertices of hypergraph are denoted as samples and different samples on one hyperedge have the same attributes set. An example of neighborhood hypergraph is as in Figure 2.
Definition 4. Given is a sample, where denote the values of at the attributes set , denotes the decisions of , and is the radius of a neighborhood.
Definition 5. Given and the attribute set , the hyperedges which are included by sample , consisting of the set of hyperedges, are defined as
Definition 6. Given and attributes set , for arbitrary , the sample set which related to is defined as . Given arbitrary and attributes set , the sample set related to is defined as .
Definition 7. Given , for arbitrary , one knows , where denotes minority decision and denotes majority decision. Then sets of decision and decision in hyperedge set are, respectively, defined as thus, the degree of imbalance for is defined as
Definition 8. Given , for arbitrary , assume , where denotes minority decision and denotes majority decision. is sample set related to hyperedge on attributes set . According to decisions , is divided into equivalence classes: ; when , the confidence degree of is defined as follows.(1)If , then (2)If , then
Definition 9. Given , is the attitudes set of samples, and is the samples decision. For arbitrary hyperedge set , according to decisions , the hyperedge set is divided into equivalence classes: . For arbitrary , the upper approximation, lower approximation, boundary region, and negative domains of decision related to set of attributes are, respectively, defined as
The lower approximation of decision that related to the set of attributes is also called positive domain, denoted by . The size of positive domain reflects the separable degree of classification problem in a given attribute space; the bigger the positive region, the smaller the border.
To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (Example 1) in Figure 3.
In Figure 3, the hyperedge is simultaneous in the neighborhood of samples and ; in other words, it links and . From the graph, we can know easily that whether a hyperedge is in the neighborhood is up to the fact that whether the symbol of the hyperedge is inside of the neighborhood of the sample.
First, one calculates the sample set of each hyperedge: , , according to Formula (4) and Definition 6. Second, one calculates the confidence degrees of each hyperedge, according to Formula (8): , , . Third, according to Formula (9), one gets the final result on Figure 3: the upper approximation , the lower approximation , and the boundary region .
3. Neighborhood Hypergraph Classification Algorithm
Traditional hypernetwork model has limit on some aspects as follows: discretized datasets. There is no special processing to the samples in boundary region. However, some advantages will appear when rough set theory is combined with hypernetwork: hypernetwork can directly deal with numeric data, which avoids information loss of data. In the process of hyperedge learning, hyperedge set is divided into three parts that are upper approximation, lower approximation, and boundary region. In addition, hyperedges in boundary region will be processed specially, which will result in the improvement of classification accuracy.
The proposed algorithm aims at tackling imbalanced data classification problem including two aspects as follows.
Improve the degree of imbalance of hyperedge set. The class of traditional hyperedge is inherited from samples directly, which is helpless to improve the degree of imbalance of hyperedge set. In the paper, when initializing hyperedges, classes of fractional hyperedges are determined according to the classes of samples, which reduces the degree of hyperedge set to some extent.
Set classification condition. The classification process of traditional Hypernetwork does not take the degree of hyperedge set into consideration, resulting in a low accuracy of minority class. However, one sets a threshold, which equals the square of the degree of imbalance, as a classification condition. This method makes the classifier pay more attention to minority class and thus can deal with class imbalance problem appropriately.
The flow chart of the algorithm is shown in Figure 4. Then, one analyzes each part of the flow chart of the remaining section specifically as follows.
3.1. Hyperedge Initialization
Hyperedges are generated based on the samples, which reserve the real distribution of the sample set and thereby provide a foundation for hyperedge selection. Meanwhile, one can change some attribute values while generating hyperedge. Thus, more decision rules are generated for sample classification, which can improve the accuracy of sample classification to some extent.
In this paper, attribution set of hyperedges is (namely, the universal set of attributions for samples); that is to say, a hyperedge is exactly a sample and denoted by a small dot in the figure. In hyperedge initialization, we can assign a value on each attribute and determine the classification of each hyperedge.
In order to process imbalanced dataset, two classes will be considered in the following definitions.
Definition 10. Given , for arbitrary , assume , where denotes minority decision and denotes majority decision. Then sets of decision and decision in hyperedge set are, respectively, defined as So the degree of imbalance for is defined as
In this paper, the process of hyperedge generation consisted of two stages: attribution inheritance and class confirmation.
Attribution inheritance: hyperedge and sample have the same attribution set, and the attribute values of the hyperedge are assigned partly based on the sample. One selects 7/10 of all attributes of one hyperedge which are selected randomly and they inherit the corresponding attribution value of the sample. The remaining attribute values are generated randomly in the range of the corresponding attribute values of the sample. In Figure 5, is a sample and is a hyperedge.
Definition 11. Given , for arbitrary , the majority sample set and the minority sample set related to hyperedge in the sample set are defined as
Class confirmation: the class of a hyperedge is confirmed by the whole dataset. There are two cases as follows.(a)According to formula (11), if , then ;(b)otherwise, generate a random number in . If (, , respectively, represent majority sample set and minority sample set), then ; else, .
3.2. Classification on Training Set
One uses the generated hyperedge set to classify the training set. Through analyzing the classification result, one can know the classification accuracy of hyperedge set and determine whether to replace the hyperedge for the hyperedge set or not. By repeating the process of training sample classification and hyperedge replacement, we can make the distribution of hyperedge set approach the real distribution of training sample set gradually.
Definition 12. Given , the majority hyperedge set and minority hyperedge set in neighborhood of sample are defined as
One should consider the factor when we use neighborhood hypergraph to classify the samples: the amounts of majority hyperedge and minority hyperedge in the neighborhood of a sample; the degree of the imbalance of hyperedge set. Combining with the above, one presents the classification method.
Given sample set , for arbitrary and attribution set ,
(1) if , then ;
(2) if , then .
One uses the classification rules above to classify the training set. If the accuracy is higher than 0.95, one can output the hyperedge set. Otherwise, the hyperedge replacement operation should be adopted (see Section 3.3).
In experimental evaluation, we conclude that is a good choice to enhance the accuracy of minority class, while is a poor one to classify the minority class samples.
3.3. Hyperedge Replacement
In the process of hyperedge initialization, one generates part of the attribute values randomly. As a result, some of hyperedges are not suitable for sample classification. In order to acquire better performance, one should replace the poor hyperedges by generating new hyperedges, namely, hyperedge replacement.
The algorithm divided hyperedge set into upper approximation, lower approximation, boundary region, and negative region. The confidence degree of hyperedges in lower approximation is 1. The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region. Hyperedges in lower approximation are all retained because they are very helpful for classification. On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced. For the hyperedges in boundary region, they will be dealt with by a threshold . When the confidence degree of a hyperedge is less than , it will be replaced. Through the above, one can enhance the pertinence and validity of hyperedge replacement.
It is composed of three steps.(1)Set the confidence degree threshold of each hyperedge (in this paper ).(2)Find out those hyperedges whose confidence degree is under the threshold from the hyperedge set.
According to Definitions 7 and 8, one can calculate the confidence degree of each hyperedge following the three cases below ( denotes a hyperedge).
Case 1. If , then is not in any neighborhood of samples, as shown in Figure 6.
In this case, one can assume that is in the overlapped neighborhood of the nearest five samples. Then the confidence degree of can be calculated according to Formula (6).
Case 2 (). It means that samples surrounding have the same class with . Thus is helpful for the sample classification and should remain.
Case 3 (). Now, let us give an example below to explain the situation (see Figure 7).
According to Formula (4) and Definition 6, we know , , and . Then according to formula (8), and can also be calculated. However, as it is difficult to determine the nearest five samples surrounding , we cannot calculate the confidence degree of .
This kind of hyperedges has the same class with few samples surrounding them, which results in the poor effect on classification. Thus, they should be replaced.
Generate new hyperedge and hyperedge replacement.
One selects a hyperedge from the hyperedge set which should be replaced and generates a sample . Then a new hyperedge is initialized by using . After that, one can replace with . Repeat the process above until no hyperedge needs to be replaced.
3.4. Neighborhood Hypergraph Algorithm
In this paper, sample classification and hyperedge replacement are based on the neighborhood radius of sample. According to Definition 1, the computational formula of the neighborhood radius of sample, denoted by , is as follows [34]: where is a training sample, denotes the minimal value of distance between and the remaining samples excluding , and denotes the value domain of .
Here we give the NHyperGraph (see Algorithm 1).

There are two main parameters in the algorithm: the radius of neighborhood ; threshold of the confidence degree . The former is important to control the number of hyperedges. The number is increasing with the increasing of . The latter is vital to ensure the quality of hyperedges. The higher is, the better hyperedges can be obtained. Of course, when the is too big, there is no sense that almost all the hyperedges will be replaced.
4. Experimental Designing and Analysis
4.1. Datasets
In order to test the proposed algorithm in this paper, one selects 18 UCI datasets which are downloaded from the machine learning data repository, University of California, at Irvine. The imbalanced rate is from 1.37 to 28.10. There are seven multiclass datasets and eleven twoclass datasets. Multiclass datasets are modified to obtain twoclass imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class. For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently. The datasets are outlined in Table 1 and sorted by imbalanced rates from low to high.
4.2. Experimental Evaluation in Imbalanced Domains
The traditional evaluation usually uses Confusion Matrix, showed in Table 2, where TP means the number of positive samples that are classified into positive, TN means the number of negative samples that are classified into negative, FN means the number of positive samples that are misclassified, and FP means the number of negative samples that are misclassified.
From Table 2, one could get some useful evaluation as follows.
; ; , where and refer to and , respectively.
There are three evaluations as the formulas called Precision, Recall, and value. Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. From the previous formulas, we can decrease FP to increase Precision and increase TP to increase Recall. But in fact they conflicted. So we use the value to consider them comprehensively. Only when Precision and Recall are both higher, value will be higher.
Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the receiver operating characteristic (ROC) graphics [37]. In these graphics, the tradeoff between the benefits (TP) and costs (FP) can be visualized, and it acknowledges the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [38] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise. AUC provides a single number summary of the performance of learning algorithms.
4.3. Experimental Methods
In order to evaluate the performance of NHyperGraph in this paper, one compares it with some other algorithms in related literatures: SVM and J48 (C4.5) [39] implemented on Weka [40], NRSBoundarySMOTE [12]+C4.5, SMOTERSB* [29]+C4.5, CSENHN [7], and NHyperGraph implemented by Java programming language. Among these algorithms, SVM classifies the source datasets directly. J48 (C4.5) classifies those datasets after oversampling by NRSBoundarySMOTE and SMOTERSB. The oversampling rate is 100%. CSENHN is an ensemble method of costsensitive hypernetwork. It deals with datasets which are discretized by optimal classdependent discretization (OCDD) [41]. Ones take the value of win the range [0.001–0.6] in NHyperGraph. We use 10fold cross validation as validation method.
4.4. Experimental Results and Analysis
Contrastive experiment results on Precision, Recall, value, means, and AUC among each algorithm are shown in Table 3 to Table 7.
In order to view the performance on 5 algorithms, the average accuracies of different indicator of 5 algorithms are showed in Figure 8.
Tables 3, 4, 5, 6, and 7 point out that NHyperGraph has high performance than other four algorithms on most datasets mentioned above. The average result of Precision is enhanced to 0.7374 while it ranges between 0.5233 and 0.6816 for the other four algorithms. The average value of Recall increases to 0.9785 while it changes between 0.2858 and 0.8068. The average value of value is up to 0.8173 while it ranges between 0.3235 and 0.6695. Meanwhile, the average values of means and AUC increase to 0.8968 and 0.9152 while their values take from 0.3118 to 0.7475 and 0.6398 to 0.8509, respectively.
We can find out from Tables 3–7 and Figure 8 that the Precision performance is unsteady for the proposed algorithms NHyperGraph. As it is mentioned before, the process of hyperedge initialization is based on the degree of imbalance of training set. The generation of hyperedges depends on the imbalanced degree, which results in the fact that the generated hyperedges incline to the minority sample. Thus, the proposed algorithm is not steady on Precision. But, it can work better than SVM and J48 (C4.5) for Recall, value, means, and AUC.
In total, the experimental results of NHyperGraph are better than all of the other algorithms. Since the rough set theory is used in NHyperGraph, it is more efficient to process the uncertain samples, especially in boundary region of hyperedge set. What is more, weights are calculated through the neighborhood rough set model; thus it makes more hyperedges involve in the class decision of a hyperedge, improving the accuracy. Due to the two aspects above in the proposed algorithm, the results of classification are improved.
As SMOTE oversamples all minority class samples, it decreases the decision space of majority class. Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, thereby resulting in the decreasing of Precision. SMOTERSB filters the synthetic samples more strictly than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced. Thus, compared with SMOTE, its improvement is not obvious. RSBoundarySMOTE takes neighborhood rough set into consideration and emphasizes resampling for minority class samples which belong to boundary region and thus improves the value. However, NHyperGraph replaces hyperedges repetitively according to neighborhood rough set. The distribution of hyperedge set draws near to the true distribution of samples gradually, which makes a more obvious improvement on classification performance. Besides, since CSENHN can just deal with discrete data, too much information loss of data makes its performance worse than NHyperGraph.
5. Conclusion
In this paper, one proposed a new algorithm based on hypernetwork called NHyperGraph to solve the problem of classifying imbalance dataset. At first, hyperedge set is divided according to rough set theory. Then, some poor hyperedges are replaced by combining with the imbalanced degree, in order to improve the accuracy. The experimental results on 18 UCI datasets with different degree of imbalance show that the classification result of the proposed algorithm NHyperGraph improves obviously in contrast with another four algorithms. However, the algorithm NHyperGraph will cost much time, due to calculating the distance between hyperedge and sample. Thus, how to reduce the running time of the algorithm is our future work.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (NSFC) under Grant nos. 61309014 and 61379114, Natural Science Foundation Project of CQ CSTC under Grant no. cstc2013jcyjA40063, and Doctor Foundation of Chongqing University of Posts and Telecommunications under Grant no. A201208.