A Novel Algorithm for Imbalance Data Classification Based on Neighborhood Hypergraph

Hu, Feng; Liu, Xiao; Dai, Jin; Yu, Hong

doi:https://doi.org/10.1155/2014/876875

The Scientific World Journal

On this page

Abstract Introduction Analysis Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 876875 | https://doi.org/10.1155/2014/876875

A Novel Algorithm for Imbalance Data Classification Based on Neighborhood Hypergraph

Feng Hu,¹Xiao Liu,¹Jin Dai,¹and Hong Yu¹

Academic Editor: Maria Jose del Jesus

Received25 Mar 2014

Revised29 Jun 2014

Accepted21 Jul 2014

Published11 Aug 2014

Abstract

The classification problem for imbalance data is paid more attention to. So far, many significant methods are proposed and applied to many fields. But more efficient methods are needed still. Hypergraph may not be powerful enough to deal with the data in boundary region, although it is an efficient tool to knowledge discovery. In this paper, the neighborhood hypergraph is presented, combining rough set theory and hypergraph. After that, a novel classification algorithm for imbalance data based on neighborhood hypergraph is developed, which is composed of three steps: initialization of hyperedge, classification of training data set, and substitution of hyperedge. After conducting an experiment of 10-fold cross validation on 18 data sets, the proposed algorithm has higher average accuracy than others.

1. Introduction

The imbalanced dataset problem in classification domains occurs when the number of instances that represent one class is much larger than that of the other classes. The minority class is usually more interesting from the point of view of the learning task. There are many situations in which imbalance occurs between classes, such as satellite image classification [1], risk management [2], and medical diagnosis [3, 4]. When studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake [5]. At present, the solutions for the problem of imbalanced dataset classification are developed at both the data and algorithmic levels [6]. At the data level, the objective is to rebalance the class distribution by resampling the data space, such as oversampling the minority class and undersampling the prevalent class. At the algorithm level, solutions try to adapt existing classifier learning algorithms to strengthen learning with regard to the minority class, such as cost-sensitive learning, ensemble learning, and hypernetwork [7].

Previous research improved resampling methods in many aspects and proposed some effective resampling algorithms. SMOTE is an intelligent oversampling algorithm that was proposed by Chawla et al. [8]. Its main idea is to form new minority class samples by interpolating between several minority class samples that lie together. Thus, the overfitting problem is avoided and the decision space for the minority class spread further; meanwhile, it reduces the decision space for the majority class, so many researchers proposed different improved methods. Dong and Wang [9] proposed the Random-SMOTE, which is different from SMOTE, which obtained new minority class samples by interpolating among three minority class samples. Yang et al. [10] proposed ASMOTE algorithm which chose not only the minority class samples but also the majority class samples that are near to minority class sample, avoiding synthetic sample overlapping the majority class samples. Han et al. [11] proposed the Borderline-SMOTE. Hu and Li [12] proposed NRSBoundary-SMOTE algorithm which can expand the decision space for the minority class; meanwhile, it will shrink the decision space for the majority class.

While in recent years, with the rapid developing of ensemble methods for classification, they have been applied to imbalanced data classification, ensemble learning is a machine learning paradigm where multiple learners (called base learners) are trained to solve the same problem [13]. Due to the outstanding performance of ensemble methods, they are applied to imbalanced dataset by combining with other techniques. Chawla et al. have developed SMOTEBoost algorithm by integrating Adaboost (the most famous boosting algorithm) and synthetic minority oversampling technique (SMOTE) [14]. Similarly to SMOTEBoost, RUSBoost also introduces data sampling into the Adaboost algorithm, while it applies random undersampling to the majority class; but SMOTEBoost creates synthetic new minority class instances by operating in the feature space [15]. Błaszczyński et al. integrate a selective data preprocessing method SPIDER with Ivotes ensemble algorithm developing the framework called IIvotes [16]. Besides, cost-sensitive learning becomes an effective tool to solve class imbalanced problem, which involves two types: binary classification problem and multiclassification problem. It can be implemented by two ways that are rescaling and reweighted, respectively. Both of them aim at making the trained classification algorithms cost sensitive. Rescaling changes the distribution of samples in training data. It has been used in cost-based sampling [17], REBALANCE [18], Rescalenew [19], and so on. Differing from rescaling, reweighted adjusts the class probability distribution in classifier based on costs. It has been used in MetaCost [20] proposed by Domings and AdaCost [21], which is improved by Fan et al. according to AdaBoost.

In the 1970s, Rumelhart and Norman proposed three types of human learning: accretion, tuning, and restructuring [22]. Based on their study, professor Zhang proposed three basic principles of cognitive learning [23]: continuity, glocality, and compositionality. He used a hypergraph as presentation form and proposed a hypernetwork model, which can be used for cognitive learning and memory. Hypernetwork is a probabilistic graph with numbers of hyperedges. A hyperedge can be regarded as a component, a subject, or even a circuit. From the perspective of data, a hyperedge is the combination of sample attributes and class. So far, hypernetwork can just deal with discrete data. Dataset must be discrete before using for building a hypernetwork classifier. A hypernetwork model includes three steps: initializing of a hypernetwork model according to training dataset, evolutionary learning of a hypernetwork, and classification of test dataset using the evolutionary hypernetwork. In step , a sample is used to generate many hyperedges through inheriting some attributes of the sample and its class. In step , operations of match, selection, and replacement are repeatedly executed for hyperedges. It is started from a randomly initialized hypernetwork; in each iteration, fitness of a hyperedge is calculated for evaluating and ordering. Hyperedges which own low fitness are replaced with new generated hyperedges. In this way, hyperedges with high class discernibility in pattern space can be found out by the hypernetwork [7]. After the above steps, a hypernetwork model is built and classifies the test data through joint probability.

Although hypernetwork has been widely used in solving various machine learning problems, it usually produces poor overall classification performance when dealing with class imbalance problems. Like most of the traditional classification algorithms, hypernetwork assumes that the class distribution of datasets is balanced. The goal of the hypernetwork learning is to extract hyperedges (or decision rules) that can cover as many samples as possible. Hyperedges are critical for differentiating class membership which are copied and added while hyperedges with poor differential ability are discarded. However, within the context of class imbalance, many samples in minority class are usually viewed as noises. Therefore, the number of hyperedges corresponding to the majority significantly surpasses that of hyperedges corresponding to the minority. As a result, most of the minority samples are misclassified in a traditional hypernetwork. Thus, this paper attempts to combine hypernetwork with rough set to address the problem.

Rough set theory is a powerful mathematical tool introduced by Pawlak [24–27] to deal with imprecise, uncertain, and vague information. It has been successfully applied to such fields as machine learning, data mining, intelligent data analysis, and control algorithm acquiring. Basically, the idea is to approximate a concept by three description sets, namely, the lower approximation, upper approximation, and boundary region. The rough set theory puts the uncertain samples in the boundary region and the boundary region can be calculated by upper approximation minus lower approximation, and they all can be calculated. Until now, there are many researchers who brought rough set theory to process imbalanced data [28, 29].

The remainder of this paper is organized as follows. The basic concepts on neighborhood rough set models are shown in Section 2. The neighborhood hypergraph algorithm is developed in Section 3. Section 4 presents the experimental evaluation on 18 imbalanced UCI datasets [30] by 10-fold cross validation, which shows the validity of the proposed method. The paper is concluded in Section 5.

2. Hypergraph and Neighborhood Hypergraph Model

2.1. The Definition of Hypergraph

In 1970, Berge [31] used hypergraph to define hypernetwork. And it was the first time to establish undirected hypergraph theory systematically and it was applied on the operations research by matroid.

Definition 1 (see [31, 32]). Given is finite set, if(a),(b),

then the binary relation is defined as a hypergraph. The elements of are defined as vertices of the hypergraph; is defined as the edge set of hypergraph. is defined as hyperedge (see Figure 1).

2.2. The Neighborhood Hypergraph

Neighborhoods and neighborhood relations are a class of important concepts in topology. Lin [33] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology. Hu et al. [34] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model. And then they used the model to build a uniform theoretic framework for neighborhood based classifiers. For the convenience of description, some basic concepts of the neighborhood rough set model are introduced here at first.

Definition 2 (see [34]). Given arbitrary and , the neighborhood of in the subspace is defined as where is a metric function. , it satisfies(1),(2) if and only if ,(3),(4).

Consider that and are two objects. is a sample-dimensional space, where denotes the value of sample on the th dimension . Then, a general metric, named Minkowsky distance, is defined as When , it is the Euclidean distance .

But Euclidean distance can only be used to compute continuous features; the nominal features are invalid. Here, we compute them by using value difference metric (VDM) proposed by Stanfill and Waltz [35] in 1986. The distance between two corresponding feature values is defined as follows:

In the previous equation, and are the two corresponding feature values. is the total number of occurrences of feature value and is the number of occurrences of feature value for class . A similar convention can be also applied to and . is continuous, which is usually set to 1.

Definition 3. Given is a neighborhood hypergraph, then is the vertex set of , indicating that it has vertices, where denotes a sample. is hyperedge set, and each in is a hyperedge which connects vertices . is the attribute set, and is the decision set.

Vertices of hypergraph represent the attribution of samples in some literatures like literature [36] and so on. However, in this paper, vertices of hypergraph are denoted as samples and different samples on one hyperedge have the same attributes set. An example of neighborhood hypergraph is as in Figure 2.

Definition 4. Given is a sample, where denote the values of at the attributes set , denotes the decisions of , and is the radius of a neighborhood.

Definition 5. Given and the attribute set , the hyperedges which are included by sample , consisting of the set of hyperedges, are defined as

Definition 6. Given and attributes set , for arbitrary , the sample set which related to is defined as . Given arbitrary and attributes set , the sample set related to is defined as .

Definition 7. Given , for arbitrary , one knows , where denotes minority decision and denotes majority decision. Then sets of decision and decision in hyperedge set are, respectively, defined as thus, the degree of imbalance for is defined as

Definition 8. Given , for arbitrary , assume , where denotes minority decision and denotes majority decision. is sample set related to hyperedge on attributes set . According to decisions , is divided into equivalence classes: ; when , the confidence degree of is defined as follows.(1)If , then (2)If , then

Definition 9. Given , is the attitudes set of samples, and is the samples decision. For arbitrary hyperedge set , according to decisions , the hyperedge set is divided into equivalence classes: . For arbitrary , the upper approximation, lower approximation, boundary region, and negative domains of decision related to set of attributes are, respectively, defined as

The lower approximation of decision that related to the set of attributes is also called positive domain, denoted by . The size of positive domain reflects the separable degree of classification problem in a given attribute space; the bigger the positive region, the smaller the border.

To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (Example 1) in Figure 3.

In Figure 3, the hyperedge is simultaneous in the neighborhood of samples and ; in other words, it links and . From the graph, we can know easily that whether a hyperedge is in the neighborhood is up to the fact that whether the symbol of the hyperedge is inside of the neighborhood of the sample.

First, one calculates the sample set of each hyperedge: , , according to Formula (4) and Definition 6. Second, one calculates the confidence degrees of each hyperedge, according to Formula (8): , , . Third, according to Formula (9), one gets the final result on Figure 3: the upper approximation , the lower approximation , and the boundary region .

3. Neighborhood Hypergraph Classification Algorithm

Traditional hypernetwork model has limit on some aspects as follows: discretized datasets. There is no special processing to the samples in boundary region. However, some advantages will appear when rough set theory is combined with hypernetwork: hypernetwork can directly deal with numeric data, which avoids information loss of data. In the process of hyperedge learning, hyperedge set is divided into three parts that are upper approximation, lower approximation, and boundary region. In addition, hyperedges in boundary region will be processed specially, which will result in the improvement of classification accuracy.

The proposed algorithm aims at tackling imbalanced data classification problem including two aspects as follows.

Improve the degree of imbalance of hyperedge set. The class of traditional hyperedge is inherited from samples directly, which is helpless to improve the degree of imbalance of hyperedge set. In the paper, when initializing hyperedges, classes of fractional hyperedges are determined according to the classes of samples, which reduces the degree of hyperedge set to some extent.

Set classification condition. The classification process of traditional Hypernetwork does not take the degree of hyperedge set into consideration, resulting in a low accuracy of minority class. However, one sets a threshold, which equals the square of the degree of imbalance, as a classification condition. This method makes the classifier pay more attention to minority class and thus can deal with class imbalance problem appropriately.

The flow chart of the algorithm is shown in Figure 4. Then, one analyzes each part of the flow chart of the remaining section specifically as follows.

3.1. Hyperedge Initialization

Hyperedges are generated based on the samples, which reserve the real distribution of the sample set and thereby provide a foundation for hyperedge selection. Meanwhile, one can change some attribute values while generating hyperedge. Thus, more decision rules are generated for sample classification, which can improve the accuracy of sample classification to some extent.

In this paper, attribution set of hyperedges is (namely, the universal set of attributions for samples); that is to say, a hyperedge is exactly a sample and denoted by a small dot in the figure. In hyperedge initialization, we can assign a value on each attribute and determine the classification of each hyperedge.

In order to process imbalanced dataset, two classes will be considered in the following definitions.

Definition 10. Given , for arbitrary , assume , where denotes minority decision and denotes majority decision. Then sets of decision and decision in hyperedge set are, respectively, defined as So the degree of imbalance for is defined as

In this paper, the process of hyperedge generation consisted of two stages: attribution inheritance and class confirmation.

Attribution inheritance: hyperedge and sample have the same attribution set, and the attribute values of the hyperedge are assigned partly based on the sample. One selects 7/10 of all attributes of one hyperedge which are selected randomly and they inherit the corresponding attribution value of the sample. The remaining attribute values are generated randomly in the range of the corresponding attribute values of the sample. In Figure 5, is a sample and is a hyperedge.

Definition 11. Given , for arbitrary , the majority sample set and the minority sample set related to hyperedge in the sample set are defined as

Class confirmation: the class of a hyperedge is confirmed by the whole dataset. There are two cases as follows.(a)According to formula (11), if , then ;(b)otherwise, generate a random number in . If (, , respectively, represent majority sample set and minority sample set), then ; else, .

3.2. Classification on Training Set

One uses the generated hyperedge set to classify the training set. Through analyzing the classification result, one can know the classification accuracy of hyperedge set and determine whether to replace the hyperedge for the hyperedge set or not. By repeating the process of training sample classification and hyperedge replacement, we can make the distribution of hyperedge set approach the real distribution of training sample set gradually.

Definition 12. Given , the majority hyperedge set and minority hyperedge set in neighborhood of sample are defined as

One should consider the factor when we use neighborhood hypergraph to classify the samples: the amounts of majority hyperedge and minority hyperedge in the neighborhood of a sample; the degree of the imbalance of hyperedge set. Combining with the above, one presents the classification method.

Given sample set , for arbitrary and attribution set ,

(1) if , then ;

(2) if , then .

One uses the classification rules above to classify the training set. If the accuracy is higher than 0.95, one can output the hyperedge set. Otherwise, the hyperedge replacement operation should be adopted (see Section 3.3).

In experimental evaluation, we conclude that is a good choice to enhance the accuracy of minority class, while is a poor one to classify the minority class samples.

3.3. Hyperedge Replacement

In the process of hyperedge initialization, one generates part of the attribute values randomly. As a result, some of hyperedges are not suitable for sample classification. In order to acquire better performance, one should replace the poor hyperedges by generating new hyperedges, namely, hyperedge replacement.

The algorithm divided hyperedge set into upper approximation, lower approximation, boundary region, and negative region. The confidence degree of hyperedges in lower approximation is 1. The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region. Hyperedges in lower approximation are all retained because they are very helpful for classification. On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced. For the hyperedges in boundary region, they will be dealt with by a threshold . When the confidence degree of a hyperedge is less than , it will be replaced. Through the above, one can enhance the pertinence and validity of hyperedge replacement.

It is composed of three steps.(1)Set the confidence degree threshold of each hyperedge (in this paper ).(2)Find out those hyperedges whose confidence degree is under the threshold from the hyperedge set.

According to Definitions 7 and 8, one can calculate the confidence degree of each hyperedge following the three cases below ( denotes a hyperedge).

Case 1. If , then is not in any neighborhood of samples, as shown in Figure 6.

In this case, one can assume that is in the overlapped neighborhood of the nearest five samples. Then the confidence degree of can be calculated according to Formula (6).

Case 2 (). It means that samples surrounding have the same class with . Thus is helpful for the sample classification and should remain.

Case 3 (). Now, let us give an example below to explain the situation (see Figure 7).

According to Formula (4) and Definition 6, we know , , and . Then according to formula (8), and can also be calculated. However, as it is difficult to determine the nearest five samples surrounding , we cannot calculate the confidence degree of .

This kind of hyperedges has the same class with few samples surrounding them, which results in the poor effect on classification. Thus, they should be replaced.

Generate new hyperedge and hyperedge replacement.

One selects a hyperedge from the hyperedge set which should be replaced and generates a sample . Then a new hyperedge is initialized by using . After that, one can replace with . Repeat the process above until no hyperedge needs to be replaced.

3.4. Neighborhood Hypergraph Algorithm

In this paper, sample classification and hyperedge replacement are based on the neighborhood radius of sample. According to Definition 1, the computational formula of the neighborhood radius of sample, denoted by , is as follows [34]: where is a training sample, denotes the minimal value of distance between and the remaining samples excluding , and denotes the value domain of .

Here we give the N-HyperGraph (see Algorithm 1).

Input: the training sample set: , the radius of neighborhood: .
Output: hyper-edge set: .
Step 1. (Initialization)
; //Initialize the hyper-edge set.
According to the formula (2), (14), calculate the radius of each sample.
FOR each in X DO
; //Count the number of hyper-edges generated by each sample.
WHILE () //Each of samples generates five hyper-edges.
Generate hyper-edge according to and seven tenth attributions of inherit attribution
values of ;
Calculate the distance between and each sample according to Formula (2) respectively;
Calculate and respectively, according to Definitions 10 and Definition 11;
IF THEN ;
ELSE ; // is a random number in .
Calculate and respectively, according to Definition 10;
IF THEN ;
ELSE ;
END IF
END IF
++;
END WHILE
END FOR
Step 2. (Training Set Classification)
Calculate of hyper-edge set according to Definition 7 and Formula (6);
FOR each in DO
Calculate and according to Definition 12;
IF THEN ;
ELSE ;
END IF
END FOR
Calculate the classification accuracy of training data set: Train-accuracy.
IF Train-accuracy > 0.95 THEN GOTO Step 4;
ELSE GOTO Step 3;
END IF
Step 3. (Hyper-edge Replacement)
; //Number of hyper-edges that should be replaced.
FOR each in DO
Calculate the confidence-degree of according to Definition 8 and Formula (8);
IF THEN ++;
END IF
END FOR
WHILE ()
Generate a new hyper-edge according to Step 1;
−−;
END WHILE
GOTO Step 2;
Step 4. (Return Hypergraph)
RETURN .

There are two main parameters in the algorithm: the radius of neighborhood ; threshold of the confidence degree . The former is important to control the number of hyperedges. The number is increasing with the increasing of . The latter is vital to ensure the quality of hyperedges. The higher is, the better hyperedges can be obtained. Of course, when the is too big, there is no sense that almost all the hyperedges will be replaced.

4. Experimental Designing and Analysis

4.1. Datasets

In order to test the proposed algorithm in this paper, one selects 18 UCI datasets which are downloaded from the machine learning data repository, University of California, at Irvine. The imbalanced rate is from 1.37 to 28.10. There are seven multiclass datasets and eleven two-class datasets. Multiclass datasets are modified to obtain two-class imbalance problems, by the union of one or more classes of the minority class and the union of one or more of the remaining classes which are labeled as the majority class. For the missing values, if they are continuous features, we fill them with average values; if they are nominal features, we fill them with values that appear most frequently. The datasets are outlined in Table 1 and sorted by imbalanced rates from low to high.

4.2. Experimental Evaluation in Imbalanced Domains

The traditional evaluation usually uses Confusion Matrix, showed in Table 2, where TP means the number of positive samples that are classified into positive, TN means the number of negative samples that are classified into negative, FN means the number of positive samples that are misclassified, and FP means the number of negative samples that are misclassified.

From Table 2, one could get some useful evaluation as follows.

; ; , where and refer to and , respectively.

There are three evaluations as the formulas called Precision, Recall, and -value. Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. From the previous formulas, we can decrease FP to increase Precision and increase TP to increase Recall. But in fact they conflicted. So we use the -value to consider them comprehensively. Only when Precision and Recall are both higher, -value will be higher.

Another appropriate metric that could be used to measure the performance of classification over imbalanced datasets is the receiver operating characteristic (ROC) graphics [37]. In these graphics, the tradeoff between the benefits (TP) and costs (FP) can be visualized, and it acknowledges the fact that the capacity of any classifier cannot increase the number of true positives without also increasing the false positives. The area under the ROC curve (AUC) [38] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise. AUC provides a single number summary of the performance of learning algorithms.

4.3. Experimental Methods

In order to evaluate the performance of N-HyperGraph in this paper, one compares it with some other algorithms in related literatures: SVM and J48 (C4.5) [39] implemented on Weka [40], NRSBoundary-SMOTE [12]+C4.5, SMOTE-RSB* [29]+C4.5, CS-EN-HN [7], and N-HyperGraph implemented by Java programming language. Among these algorithms, SVM classifies the source datasets directly. J48 (C4.5) classifies those datasets after oversampling by NRSBoundary-SMOTE and SMOTE-RSB. The oversampling rate is 100%. CS-EN-HN is an ensemble method of cost-sensitive hypernetwork. It deals with datasets which are discretized by optimal class-dependent discretization (OCDD) [41]. Ones take the value of win the range [0.001–0.6] in N-HyperGraph. We use 10-fold cross validation as validation method.

4.4. Experimental Results and Analysis

Contrastive experiment results on Precision, Recall, -value, -means, and AUC among each algorithm are shown in Table 3 to Table 7.

In order to view the performance on 5 algorithms, the average accuracies of different indicator of 5 algorithms are showed in Figure 8.

Tables 3, 4, 5, 6, and 7 point out that N-HyperGraph has high performance than other four algorithms on most datasets mentioned above. The average result of Precision is enhanced to 0.7374 while it ranges between 0.5233 and 0.6816 for the other four algorithms. The average value of Recall increases to 0.9785 while it changes between 0.2858 and 0.8068. The average value of -value is up to 0.8173 while it ranges between 0.3235 and 0.6695. Meanwhile, the average values of -means and AUC increase to 0.8968 and 0.9152 while their values take from 0.3118 to 0.7475 and 0.6398 to 0.8509, respectively.

We can find out from Tables 3–7 and Figure 8 that the Precision performance is unsteady for the proposed algorithms N-HyperGraph. As it is mentioned before, the process of hyperedge initialization is based on the degree of imbalance of training set. The generation of hyperedges depends on the imbalanced degree, which results in the fact that the generated hyperedges incline to the minority sample. Thus, the proposed algorithm is not steady on Precision. But, it can work better than SVM and J48 (C4.5) for Recall, -value, -means, and AUC.

In total, the experimental results of N-HyperGraph are better than all of the other algorithms. Since the rough set theory is used in N-HyperGraph, it is more efficient to process the uncertain samples, especially in boundary region of hyperedge set. What is more, weights are calculated through the neighborhood rough set model; thus it makes more hyperedges involve in the class decision of a hyperedge, improving the accuracy. Due to the two aspects above in the proposed algorithm, the results of classification are improved.

As SMOTE oversamples all minority class samples, it decreases the decision space of majority class. Although it can improve Recall of minority class, many majority class samples will be misclassified as minority class, thereby resulting in the decreasing of Precision. SMOTE-RSB filters the synthetic samples more strictly than SMOTE, because few synthetic samples are generated when datasets are highly imbalanced. Thus, compared with SMOTE, its improvement is not obvious. RSBoundary-SMOTE takes neighborhood rough set into consideration and emphasizes resampling for minority class samples which belong to boundary region and thus improves the -value. However, N-HyperGraph replaces hyperedges repetitively according to neighborhood rough set. The distribution of hyperedge set draws near to the true distribution of samples gradually, which makes a more obvious improvement on classification performance. Besides, since CS-EN-HN can just deal with discrete data, too much information loss of data makes its performance worse than N-HyperGraph.

5. Conclusion

In this paper, one proposed a new algorithm based on hypernetwork called N-HyperGraph to solve the problem of classifying imbalance dataset. At first, hyperedge set is divided according to rough set theory. Then, some poor hyperedges are replaced by combining with the imbalanced degree, in order to improve the accuracy. The experimental results on 18 UCI datasets with different degree of imbalance show that the classification result of the proposed algorithm N-HyperGraph improves obviously in contrast with another four algorithms. However, the algorithm N-HyperGraph will cost much time, due to calculating the distance between hyperedge and sample. Thus, how to reduce the running time of the algorithm is our future work.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant nos. 61309014 and 61379114, Natural Science Foundation Project of CQ CSTC under Grant no. cstc2013jcyjA40063, and Doctor Foundation of Chongqing University of Posts and Telecommunications under Grant no. A2012-08.

References

J. M. Malof, M. A. Mazurowski, and G. D. Tourassi, “The effect of class imbalance on case selection for case-based classifiers: an empirical study in the context of medical decision support,” Neural Networks, vol. 25, pp. 141–145, 2012.
View at: Publisher Site | Google Scholar
G. Wang, “Asymmetric random subspace method for imbalanced credit risk evaluation,” in Software Engineering and Knowledge Engineering: Theory and Practice, vol. 114 of Advances in Intelligent and Soft Computing, pp. 1047–1053, Springer, Berlin, Germany, 2012.
View at: Publisher Site | Google Scholar
S. Suresh, N. Sundararajan, and P. Saratchandran, “Risk-sensitive loss functions for sparse multi-category classification problems,” Information Sciences, vol. 178, no. 12, pp. 2621–2638, 2008.
View at: Publisher Site | Google Scholar | MathSciNet
A. Addis, G. Armano, and E. Vargiu, “Experimentally studying progressive filtering in presence of input imbalance,” in Knowledge Discovery, Knowledge Engineering and Knowledge Management, vol. 272, pp. 56–71, Springer, Berlin, Germany, 2013.
View at: Google Scholar
F. Provost, “Machine learning from imbalanced data sets,” in Proceedings of the Workshop Learning from Imbalanced Data Sets (AAAI '02), pp. 1–3, 2000.
View at: Google Scholar
Y. Sun, M. S. Kamel, A. K. C. Wong, and Y. Wang, “Cost-sensitive boosting for classification of imbalanced data,” Pattern Recognition, vol. 40, no. 12, pp. 3358–3378, 2007.
View at: Publisher Site | Google Scholar
K. W. Sun, Evolutionary Hypernetwork Based Classification of High Dimensional and Imbalanced Data, Chongqing University of Posts and Telecommunications, Chongqing, China, 2013 (Chinese).
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. 1, pp. 321–357, 2002.
View at: Google Scholar
Y. Dong and X. Wang, “A new over-sampling approach: random-SMOTE for learning from imbalanced data sets,” in Proceedings of the 5th International Conference on Knowledge Science, Engineering and Management (KSEM '11), pp. 343–352, Springer, Berlin, Germany, 2011.
View at: Google Scholar
Z. M. Yang, L. Y. Qiao, and X. Y. Peng, “Research on data mining method for imbalanced dataset based on improved SMOTE,” Acta Electronica Sinica B, vol. 35, no. 2, pp. 22–26, 2007 (Chinese).
View at: Google Scholar
H. Han, W. Y. Wang, and B. H. Mao, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Advances in Intelligent Computing, vol. 2, no. 5, pp. 878–887, 2005.
View at: Google Scholar
F. Hu and H. Li, “A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE,” Mathematical Problems in Engineering, vol. 2013, Article ID 694809, 10 pages, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
Z. H. Zhou, “Ensemble learning,” in Encyclopedia of Biometrics, S. Z. Li, Ed., pp. 270–273, Springer, Berlin, Germany, 2009.
View at: Google Scholar
N. V. Chawla, A. Lazarevic, L. O. Hall, and K. W. Bowyer, “SMOTEBoost: improving prediction of the minority class in boosting,” in Knowledge Discovery in Databases: PKDD, vol. 2838 of Lecture Notes in Computer Science, pp. 107–119, Springer, Berlin, Germany, 2003.
View at: Google Scholar
C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, and A. Napolitano, “RUSBoost: improving classification performance when training data is skewed,” in Proceedings of the 19th International Conference on Pattern Recognition (ICPR '08), pp. 1–4, Tampa, Fla, USA, December 2008.
View at: Google Scholar
J. Błaszczyński, M. Deckert, J. Stefanowski, and S. Wilk, “IIvotes ensemble for imbalanced data,” Intelligent Data Analysis, vol. 16, no. 5, pp. 777–801, 2012.
View at: Publisher Site | Google Scholar
P. K. Chan and J. S. Stilfo, “Toward scalable learning with non-uniform class and cost distribution: a case study in credit card fraud detection,” in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 164–168, 1998.
View at: Google Scholar
C. Elkan, “The foundations of cost-sensitive learning,” in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI '01), pp. 973–978, New York, NY, USA, August 2001.
View at: Google Scholar
Z. Zhou and X. Liu, “On multi-class cost-sensitive learning,” Computational Intelligence. An International Journal, vol. 26, no. 3, pp. 232–257, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
P. Momingos, “A general method for making classifiers cost-sensitive,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164, 1999.
View at: Google Scholar
W. Fan, S. Stolfo, and J. Zhang, “AdaCost: misclassification cost-sensitive boosting,” in Proceedings of the 16th International Conference on Machine Learning, pp. 97–105, 1999.
View at: Google Scholar
D. E. Rumelhart and D. A. Norman, Accretion, Tuning and Restructuring: Three Modes of Learning, Defense Technical Information Center, Fort Belvoir, Va, USA, 1976.
B. Zhang, “Hypernetworks: a molecular evolutionary architecture for cognitive learning and memory,” IEEE Computational Intelligence Magazine, vol. 3, no. 3, pp. 49–63, 2008.
View at: Publisher Site | Google Scholar
Z. Pawlak, “Rough sets,” International Journal of Computer and Information Sciences, vol. 11, no. 5, pp. 341–356, 1982.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Pawlak and A. Skowron, “Rudiments of rough sets,” Information Sciences, vol. 177, no. 1, pp. 3–27, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Pawlak and A. Skowron, “Rough sets: some extensions,” Information Sciences, vol. 177, no. 1, pp. 28–40, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Pawlak and A. Skowron, “Rough sets and Boolean reasoning,” Information Sciences, vol. 177, no. 1, pp. 41–73, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
J. Liu, Q. Hu, and D. Yu, “A weighted rough set based method developed for class imbalance learning,” Information Sciences, vol. 178, no. 4, pp. 1235–1256, 2008.
View at: Publisher Site | Google Scholar | MathSciNet
E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “SMOTE-RSB_*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowledge and Information Systems, vol. 33, no. 2, pp. 245–265, 2012.
View at: Publisher Site | Google Scholar
C. Blake, E. Keogh, and C. J. Merz, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, Calif, USA, 1998.
C. Berge, Graphs and Hypergraphs, Elsevier, New York, NY, USA, 1973.
View at: MathSciNet
C. Berge, Hypergraphs: The Theory of Finite Sets, Elsevier Science, Amsterdam, The Netherlands, 1989.
View at: MathSciNet
T. Y. Lin, “Granular computing on binary relations I: data mining and neighborhood systems,” in Rough Sets in Knowledge Discovery, vol. 1, pp. 289–318, 1998.
View at: Google Scholar
Q. Hu, D. Yu, and Z. Xie, “Neighborhood classifiers,” Expert Systems with Applications, vol. 34, no. 2, pp. 866–876, 2008.
View at: Publisher Site | Google Scholar
C. Stanfill and D. Waltz, “Toward memory-based reasoning,” Communications of the ACM, vol. 29, no. 12, pp. 1213–1228, 1986.
View at: Publisher Site | Google Scholar
J. Wang, L. X. Jin, and K. W. Sun, “Chinese text categorization based on evolutionary hyper network,” Journal of Jiangsu University, vol. 34, no. 2, pp. 196–201, 2013.
View at: Google Scholar
A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145–1159, 1997.
View at: Publisher Site | Google Scholar
J. Huang and C. X. Ling, “Using AUC and accuracy in evaluating learning algorithms,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005.
View at: Publisher Site | Google Scholar
X. Wu, V. P. Kumar, and R. S. Quinlan, “Top 10 algorithms in data mining,” Knowledge and Information Systems, vol. 14, no. 1, pp. 1–37, 2008.
View at: Publisher Site | Google Scholar
“Wikipedia Weka (machine learning),” http://en.wikipedia.org/wiki/Weka.
View at: Google Scholar
L. L. Liu, A. K. C. Wang, and Y. Wang, “A global optimal algorithm for class-dependent discretization of continuous data,” Intelligent Data Analysis, vol. 8, no. 2, pp. 151–170, 2004.
View at: Google Scholar

Copyright

Copyright © 2014 Feng Hu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2396

Downloads

1027

Citations