Abstract

The problem of classification in incomplete information system is a hot issue in intelligent information processing. Hypergraph is a new intelligent method for machine learning. However, it is hard to process the incomplete information system by the traditional hypergraph, which is due to two reasons: (1) the hyperedges are generated randomly in traditional hypergraph model; (2) the existing methods are unsuitable to deal with incomplete information system, for the sake of missing values in incomplete information system. In this paper, we propose a novel classification algorithm for incomplete information system based on hypergraph model and rough set theory. Firstly, we initialize the hypergraph. Second, we classify the training set by neighborhood hypergraph. Third, under the guidance of rough set, we replace the poor hyperedges. After that, we can obtain a good classifier. The proposed approach is tested on 15 data sets from UCI machine learning repository. Furthermore, it is compared with some existing methods, such as C4.5, SVM, NavieBayes, and NN. The experimental results show that the proposed algorithm has better performance via Precision, Recall, AUC, and -measure.

1. Introduction

A great deal of information system in reality life is incomplete information system [1]. When the precise value of some attributes in an information system is not known, that is, missing or known partially, such a system is called an incomplete information system (IIS). The problem of classification in incomplete information systems is a hot issue in intelligent information processing field. There are several approaches to deal with incomplete information systems. One of them is to remove samples with missing values. Another approach is to replace the missing value with the most common value [2]. These approaches are simple but they might destroy the original distribution of the data [3]. Other more complex approaches were presented in some literatures. Among these different data analysis theories and methods, rough sets [4] are the most frequently used. There are some extension models [5, 6] in rough set to deal with incomplete information system, such as tolerance relation, limited tolerance relation, and nonsymmetric similarity relation.

Hypernetwork was first proposed by Sheffi [7]. It has been presented as a probabilistic model of learning higher-order correlations using hypergraph structure consisting of a large number of hyperedges. Hypernetwork can be represented as hypergraph. Previous studies have shown that hypernetwork can be evolved to solve various machine learning problems. Segovia-Juarez et al. and Wang et al. [8, 9] use the hypernetwork model to realize DNA molecules; Kim and Zhang [10] use hypernetwork for pattern classification.

Previous researches have shown that the original hypergraph model has a good performance in classification. However, it still has some shortcomings: (1) the conventional hypergraph can only deal with the discrete data, and it still needs to discretize the continuous data. (2) The traditional hypergraph model has randomness in the process of creating new hyperedge. For incomplete information system, it is essential to supplement the missing value in the new hyperedge. The hypergraph takes measures like attribute value random filled, hyperedge random replacement strategies during the process; it is more likely to impact the decision and classification ability of the training set.

To improve the problems mentioned above, we introduce the neighborhood rough set. Rough set theory, proposed by Pawlak in 1982 [1113], can be seen as a new mathematical approach to vagueness. It has been successfully applied to various fields such as pattern recognition, machine learning, signal analysis, intelligent systems, decision analysis, knowledge discovery, and expert systems. The core concepts of rough set theory are approximations. Using the concepts of lower and upper approximations, knowledge hidden in information systems may be discovered and expressed in the form of decision rules. In other words, certain rules can be induced directly from the lower approximation, and possible rules can be derived from the upper approximation. So the study of approximation space has been developed widely. Once we apply rough set theory into hypergraph, we can supervise the hyperedge replacement process and improve the generalization ability of hyperedges as well. Lin [14] pointed out that neighborhood spaces are more general topological spaces than equivalence spaces and introduced neighborhood relation into rough set methodology. Hu et al. [15] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model. Then they used the model to build a uniform theoretic framework for neighborhood based classifiers. The neighborhood-based rough set solves the problem that classic rough set theory can not deal with the continuous data.

In this paper, we employ hypergraph model and rough set theory to build a neighborhood hypergraph model. After that, we propose a classification algorithm for incomplete information systems based on neighborhood hypergraph. This algorithm is composed of the following three steps. (1) Initialize the hyperedge set: generate hyperedges for every sample in the training set and process distinctively with samples which have missing values. (2) Classify training set: classify the training set with hyperedge set and determine whether to replace the hyperedges according to the accuracy of the classification. (3) Replace hyperedges: under the guidance of rough set, replace the unsuitable hyperedges. Compared to the algorithms implemented on WEKA platform with the existing methods, the experimental results show that the proposed classification algorithm is better than other algorithms.

The remainder of the paper is organized as follows. The basic concepts on hypergraph and neighborhood hypergraph models are shown in Section 2. The neighborhood hypergraph classifier algorithm for incomplete information system is developed in Section 3. Section 4 presents the experimental analysis. Finally, the paper is concluded in Section 5.

2. Hypergraph and Neighborhood Hypergraph Model

2.1. The Definition of Hypergraph

In 1970, Berge and Minieka [16] used hypergraph to define hypernetwork. It was the first time to establish undirected hypergraph theory systematically and it was applied on the operations research by matroid.

Definition 1 (hypergraph [16]). Given is finite set, if(1);(2),then the binary relation is defined as a hypergraph. The elements of are defined as vertices of the hypergraph and is defined as the edge set of hypergraph. (; ) is defined as hyperedge (see Figure 1).

2.2. Neighborhood Hypergraph Based on IIS

Definition 2 (the relevant degree between samples [17]). Let and be two samples in an incomplete system; each sample has attributes; and have the same attribute value in attributes, different value in attributes, and uncertain value in attributes (at least one of them has missing value at the attribute). We define as the homogeneous degree of and , as the antagonisms degree of and , and as the discrepant degree of and . We employ to represent the relevant degree of and ; namely,where , , , denotes the relevant degree of and .
However, the relevant degree mentioned in paper [17] can only be used to compute discrete features and the continuous features are invalid. Here, we define that two continuous attribute values are considered equal if they fluctuate within an certain range in comparison.
For instance, , is an attribute of , and denotes the value on attribute of sample ; , , , , and . So for the attribute , if only , is equal to on attribute . Obviously, we can find that , , and are equivalent on attribute , while they are not equal to on attribute .

Definition 3 (the neighborhood of a sample in IIS). Lin [14] pointed out the neighborhood model in 1988. Hu et al. [15] discussed the properties of neighborhood approximation spaces and proposed the neighborhood-based rough set model. Hu et al. [18] employed the neighborhood rough set to classify data by using the Minkowski distance to calculate the samples neighborhood threshold which involves all the attribute values of the one calculated. However, for incomplete information system, it is difficult to compute the distance for the sake of the missing value. Thus, we present an extension neighborhood of sample in incomplete information system by combining with relevant degree.
Given arbitrary and , the neighborhood of on attribute set is defined aswhere is the relevant degree between and . denotes the sample set within the neighborhood of .
According to the definition, we can find out easily that(1);(2);(3).

Combining with the neighborhood rough set theory, we define the neighborhood hypergraph as follows.

Definition 4 (the neighborhood hypergraph of IIS). Let be a neighborhood hypergraph of incomplete information system, referred to as neighborhood hypergraph. Then is the vertex set of , indicating that it has vertices, is hyperedge set, and each in is a hyperedge which connects vertices (). is the attribute set, and is the decision set, where denotes sample.

Vertices of hypergraph represent the attribution of samples in some literatures like [19] and so on. However, in this paper, vertices of hypergraph are denoted as samples and different samples on one hyperedge have the same attributes set (see Figure 2).

Definition 5 (the sample in neighborhood hypergraph). Given , denotes the attribute set of hyperedge, is a sample, where denotes the values of on the attribute , denotes the decisions of , and is the threshold of a neighborhood.

Definition 6 (the neighborhood hyperedge set of a sample). Given and the attribute set , the hyperedge set which is included by sample is defined as

Definition 7 (the sample set related to a hyperedge). Given , , and attributes set , for arbitrary , the sample set related to is defined as . Given arbitrary and attributes set , the sample set related to is defined as

Definition 8 (the confidence degree of a hyperedge). Given , for arbitrary , assume that , where denotes decision set. is the sample set related to hyperedge on attributes set . According to the decisions , is divided into equivalence classes: ; when , the confidence degree of is defined as follows:

Definition 9 (the upper approximation, lower approximation, boundary region, and negative domains of hyperedge set for the sample decision set). Given , is the attitudes set of samples and is the decision set of samples. For arbitrary hyperedge set , according to the decisions , the hyperedge set is divided into equivalence classes: . For arbitrary , the upper approximation, lower approximation, boundary region, and negative domains of decision related to set of attributes are, respectively, defined as

The lower approximation of decisions related to attribute set is also called positive domain. The size of positive domain reflects the separable degree of classification problem in a given attribute space and the bigger the positive region is, the smaller the border is.

To explain how to divide the upper approximation, lower approximation, and boundary region, here we give an example (see Example 10).

Example 10. Given , where , , (see Figure 3).

First, we calculate the sample set of each hyperedge: , , and , according to formula (4) and Definition 7. Second, we calculate the confidence degree of each hyperedge in terms of formula (5) and Definition 8: , , and . Third, according to formula (6) and Definition 9, we can obtain the upper approximation, the lower approximation, and the boundary region of , respectively: , , and .

3. A Classification Algorithm Based on Neighborhood Hypergraph for IIS

The process of the hypergraph model classification is generally divided into 3 steps. (1) Step 1: initialize hypergraph. (2) Step 2: classify the training set. (3) Step 3: replace hyperedges: to improve the accuracy of classification by replacing the unsuitable hyperedges. Steps 2 and 3 are iterative process, which will stop as far as the accuracy reaches a specific threshold. Generating hyperedge is a random process in the traditional hypergraph classification method, which makes it difficult to search and replace the bad performance hyperedges at the hyperedge replacement phase. Therefore, we introduce the rough set theory to separate the hyperedge set into three parts, including the upper approximation, lower approximation, and boundary region. In addition, the hyperedges in the boundary region will be tackled specially, which will result in the improvement of classification accuracy. The flow chart of this algorithm can be seen in Figure 4.

3.1. Initialize Hypergraph

Initialize hypergraph, namely, to generate hyperedges based on samples. For each new hyperedge, we need to consider the initialization of both condition attributes and decision attributes.

In incomplete information system, there are two kinds of samples. One does not have missing values, while the other has missing values. We may call them type 1 and type 2, respectively. For type 1, we create three hyperedges for each sample while we create five new hyperedges for type 2. Furthermore, we need to fill the missing value in the new hyperedge. The specific process of creating new hyperedge is as follows.

(1) Inherit the Condition Attributes Randomly. The attribute number of hyperedge is the same as the sample. The attribute value of hyperedge is inherited from the sample. Given a sample, we select 70% of attributes in hyperedge randomly and inherit their values from the sample. For the values of the rest of attributes in hyperedge, we can process as follows. If the attribute is continuous, this attribute in the hyperedge will inherit the average of the attribute values from all samples. If the attribute is discrete, we generate the value randomly in the domain of the attribute value.

For the samples having missing values, we need to fill the values in the new hyperedge if the value is inherited from the sample which is missing. If the attribute value is continuous, we can fill the attribute value with the average value of all samples whose decision value is equal to the hyperedge. Otherwise, we can fill the attribute value with the most frequent attribute value.

(2) Inherit Decision Attribute Directly. The decision attribute of the new hyperedge is inherited directly from the sample which creates the hyperedge.

3.2. Classify the Training Set

Given a simple , the neighborhood threshold of is defined as follows:where , is a fixed value.

For each sample in the training set, its classification can be determined by the voting result among its neighborhood hyperedge set. In order to get better hyperedge, we can compute the accuracy by analyzing the classification result of the training set. If the accuracy is higher than the threshold, we can output the hyperedge set. Otherwise, the hyperedge should be replaced.

Definition 11 (the decision subset of ). Given , is the neighborhood hyperedge set of sample ; the set whose decision is in is defined as follows:Given a sample , its classification process is as follows:(1)Compute the neighborhood hyperedge set of sample : .(2)Classify the hyperedges in the set . For instance, add the hyperedge of classification into .(3)Determine the classification of , , where is the decision attribute value.

Example 12. As it is shown in Figure 5, given a sample and hyperedges , we have the following.

According to Definition 6, we can obtain .

According to Definition 11, we can obtain , .

So, and the classification of sample is 1.

We use the classification rules above to classify the training set. Once the accuracy is no less than 0.95 or the iterations are not less than 10 times, we output the hypergraph. On the contrary, we should replace the unsuitable hyperedges.

3.3. Replace Hyperedges

In the process of hyperedge initialization, we fill the missing value and replace some attribute value in hyperedge. As a result, some of hyperedges are not suitable for sample classification. In order to acquire better performance, we should replace the poor hyperedges by generating new hyperedges, namely, replacing hyperedge.

The rough set theory divides hyperedge set into upper approximation, lower approximation, boundary region, and negative region. The confidence degree of hyperedges in lower approximation is 1. The confidence degree of hyperedges in boundary region is between 0 and 1. Hyperedges whose confidence degree is 0 belong to negative region. Hyperedges in lower approximation are all retained because they are very helpful for classification. On the contrary, since hyperedges in negative region are counteractive for classification, they will be replaced. For the hyperedges in boundary region, they will be dealt with by a threshold. When the confidence degree of a hyperedge is less than the threshold, it will be replaced. Through the above, we can enhance the pertinence and validity of hyperedge replacement. While we replace the hyperedge, it is prior to replace the hyperedge which is generated by the sample with missing values.

3.3.1. Search the Unsuitable Hyperedges

Consider the following(1)Set the confidence degree threshold for hyperedge: .(2)Find out the hyperedges whose confidence degree is under the threshold from the hyperedge set.

Given a hyperedge , according to Definitions 7 and 8, we can calculate the confidence degree of hyperedge by the following three cases.

Case 1. If , then no sample is related to the hyperedge (see Figure 6).
In this situation, we find out five samples whose relevant degree with is the maximum (not all the five simples are shown in Figure 6) and we assume that is related to this five samples. Then calculate the confidence degree according to formula (5). If , keep ; otherwise replace .

Case 2. If , we note that we can keep hyperedge .

Case 3. If , we note that hyperedge e needs to be replaced.

Now, we present an example to illustrate the process (see Example 13).

Example 13. Given , where attribute set , , (see Figure 7).

According to the formulas (3) and (4), we know and . After that, we calculate the confidence degree in terms of formula (5); and .

This kind of hyperedges has the same classification with minor samples surrounding them which results in the poor effect on classification. Thus, they should be replaced.

3.3.2. Replace the Unsuitable Hyperedges

Take one hyperedge out from the hyperedge replacement set and get the sample which generates the hyperedge. Generate a new hyperedge using this sample and replace with .

It is worth mentioning that we need to preferentially replace those hyperedges created by sample with missing values when replacing.

3.4. Algorithm Description

See Algorithm 1.

Input: Training set
Output: Hyper-edge set
Step 1. (Initialize hypergraph)
FOR each in X DO
 Create one hyper-edge of sample : First, Inherit attributes from the sample randomly
 and replace the values randomly on the rest attribute. Second, inherits the decision
 attribute of . Third, if has missing value, we fill the attribute value in terms with
 continuous attributes or discrete attribute.
;
END FOR
Step 2. (Classify the training set)
According to formula (7), calculating the neighborhood threshold for each sample;
FOR each in X DO
 FOR each in E DO
  According to formula (1), calculate the relevant degree of and , .
  IF THEN ; END IF
 END FOR
 FOR each in DO
  IF THEN
  ; // is the decision attribute value
  END IF
 END FOR
 Compute the classification of , .
END FOR
Compute the correctly classified ratio of the training set: accuracy;
IF or THEN GOTO Step ;
ELSE GOTO Step ;
END IF
Step 3. (Replace hyper-edge)
; //the initialize of hyper-edge replacement set
FOR each in DO
According to Definition 7 and formula (5), calculate the confidence degree of : .
IF THEN ; ; END IF
END FOR
While we replace the hyper-edge, it is prior to replace the hyper-edge which is generated by
the sample with missing values.
WHILE ()
Generate a new hyper-edge through the process similar to Step .
; ;
END WHILE
GOTO Step ;
Step 4. (Return)
RETURN E;

4. Experimental Analyses

4.1. Data Sets

Experimental analysis is conducted on 15 UCI [20] data sets. There are four incomplete data sets, namely, Mammographic, Credit Approval, Hungarian, Postoperative, and eleven complete data sets. Complete data sets are modified to obtain incomplete data sets, by random missing some attribute values. The missing degree ranges from 0.6% to 18.97%. The data sets are outlined in Table 1 (sorted by the size of the data set).

4.2. Experimental Evaluation

In order to evaluate the performance of the developed approach, we use accuracy, Precision, Recall, -measure [21], and area under ROC curve to evaluate the performance of classifier.

We assume that is the number of the relevant samples that are predicted; is the number of the irrelevant samples that are predicted; is the number of the relevant samples that are not predicted; is the number of the irrelevant samples that are not predicted.

Precision or confidence denotes the proportion of predicted samples that are relevant samples:

Recall or sensitivity denotes the proportion of relevant samples that are correctly predicted:

We expect the Precision and the Recall value are both high. In fact, they conflicted. The two values can not be high at the same time. So we use -measure to consider them comprehensively; -measure is the harmonic mean of Precision and Recall:

Another appropriate metric that could be used to measure the performance of classification is Receiver Operating Characteristic (ROC) [22] graphics. In these graphics, the tradeoff between the benefits and costs can be visualized. The area under the ROC curve (AUC) [23] corresponds to the probability of correctly identifying which of the two stimuli is noise and which is signal plus noise. AUC provides a single number summary of the performance of learning algorithms. Most of time, the value of AUC is from 0.5 to 1.0: the higher, the better. When the AUC value is under 0.5, it means the classifier has no positive effect on classification and it should be abandoned.

4.3. Experimental Method and Results

In order to evaluate the performance of HyperGraph we point out, we compared it with some other algorithms in related literatures: C4.5, SVM, NaiveBayes, and NN [24]. C4.5, SVM, and NBC are all classic classification algorithms. They have great performance for the majority of data sets in both theory and practice. NN is a simple and classic algorithm; we choose it because the main idea of NN is majority voting with near samples, which are very similar to the proposed methods. Their source codes are afforded by Weka software [25]. What is more is that we also compared to two rough set methods whose source codes are implemented in RIDAS [26]. One of the rough set algorithms directly handles the data sets by using rough set theory (labeled as “Rough (incomplete)” in Tables 2, 3, 4, 5, and 6) [27, 28]. The other algorithm based on rough set theory is to fill the missing values of the data sets before classification (labeled as “Rough (complete)” in Tables 2, 3, 4, 5, and 6) [2931].

The proposed HyperGraph algorithm is implemented by JAVA. All the results are obtained from 10-fold cross validation [32].

Contrastive experiment results on accuracy, Precision, Recall, -measure, and AUC within each algorithm are shown in Tables 2 to 7 and Figure 8.

we also compare the accuracy values of cases of whether missing values exist or not by using 22 data sets. There are 11 complete data sets and 11 incomplete data sets that derive from each complete data set by a random missing process (the missing degree is 5%). The final results are shown in Table 7.

In order to view the performance on 5 algorithms, the average value of different indicator of 5 algorithms is shown in Figure 8.

From Tables 2 and 4 and Figure 8, we can figure out that HyperGraph has higher average accuracy, Recall, and -measure. Furthermore, the average AUC value is superior to majority algorithms as shown in Tables 3 and 6 and Figure 8. And it is also indicated in Table 7 that the proposed method is suitable for incomplete information system. It still has good performance when the data set has missing values.

Tables 3 and 6 and Figure 8 show that NaiveBayes has higher average Precision and AUC value than HyperGraph, because NaiveBayes classifier classifies data by using the NBC model. In theory, when the properties of sample are independent of each other, NBC model has the minimum misclassification error rate compared with other classification methods [33]. Majority of the data sets we use in this paper have independent properties, which makes the NaiveBayes have average lower misclassification error rate. Moreover, Precision and AUC are generally inversely proportional to the misclassification error rate, according to their definition. Thus, NaiveBayes has higher average Precision and average AUC than HyperGraph.

However, as indicated in Tables 3, 5, and 6, the proposed classifier on Precision, -measure, and AUC value for lenses data set is poorer than most of the other algorithms. By analyzing the distribution of the lenses data set, we find out most attributes value are extremely approximate between the two classes in lenses. Thus, for one sample, the relevant degree of this sample and all the hyperedges are always high and higher than the sample’s neighborhood threshold that makes many hyperedges which have negative effect on classification in . And the classification for the sample is the voting result of the hyperedges in ; as a result, the error rate increases.

5. Conclusions

The processing of incomplete information system is rather complex. It is hard to process the incomplete information system by the traditional hypergraph, due to two reasons: (1) the hyperedges are generated randomly; (2) there are many missing values in incomplete information system. In this paper, we propose a new algorithm based on hypergraph and rough set theory to solve the problem of classifying incomplete data set. The rough set theory can be used to supervise the classifying and replacing of hyperedges. The experimental results on 15 UCI data sets show that the classification result of the proposed algorithm improves in comparison to another six algorithms. However, the algorithm HyperGraph will cost much time, due to calculating the relevant degree between hyperedge and sample. Thus, how to reduce the running time of the algorithm is our future work.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This project is supported by the National Natural Science Foundation of China (NSFC) under Grant nos. 61309014, 61379114, and 61472056 and Natural Science Foundation Project of CQ CSTC under Grant no. cstc2013jcyjA40063.