Abstract

This paper discusses and proposes a rough set model for an incomplete information system, which defines an extended tolerance relation using frequency of attribute values in such a system. It first discusses some rough set extensions in incomplete information systems. Next, “probability of matching” is defined from data in information systems and then measures the degree of tolerance. Consequently, a rough set model is developed using a tolerance relation defined with a threshold. The paper discusses the mathematical properties of the newly developed rough set model and also introduces a method to derive reducts and the core.

1. Introduction

Rough set theory [1, 2] was first proposed by Pawlak as a means to analyze vague descriptions of items. The original rough sets approach presupposes that all objects in an information system have precise attribute values. Problems arise when some of the values are unknown, which sometimes happens in the real world. Therefore, it is necessary to develop a theory which enables classifications of objects even if there is only partial information available. The rough set model proposed by Kryszkiewicz [3, 4], for example, introduced indiscernibility based on tolerance relation to deal with missing values in the information system. In these approaches, a missing value was considered as a special value that may take any possible value.

However, tolerance relation sometimes leads to a poor result with respect to approximation. Stefanowski and Tsoukiàs [5, 6] discussed the limitation and introduced similarity relation to refine the results obtained by using tolerance relation approach. Wang [7] gave some examples to prove that similarity relation may results in lost information and proposed limited tolerance relation. Yang et al. [8] also generalized a reasonable and flexible classification in incomplete information system by “new binary relation.”

In fact, there is an array of methods to handle incomplete objects [9, 10]. Some approaches replace missing values with the most common value [11], while the other considers “unknown” itself as a new value for the attribute and treats it in the same way as ordinary values [10]. Actually, the method of handling missing values should be chosen depending on the characteristics and requirements of applications. In general, approaches deal with unavailable values based on one of the following two interpretations [12]. The first is “lost value” in which unknown values of attributes are already lost. Similarity relation [5] is one example of this semantics. The second is “do not care,” which may be potentially replaced by any value in the domain. Such incomplete decision tables were broadly studied in numerous researches [3, 4]. Grzymala-Busse [1317] built a characteristic relation based on both “lost value” case and “do not care” case.

In this paper, we study “probability of matching” and propose a new method of handling missing values in incomplete information systems based on tolerance degree. Our approach adopts the “lost value” interpretation. The approach is useful in knowledge acquisition from incomplete information systems, in which some object values appear frequently and others do not.

The paper is organized as follows. Section 2 discusses tolerance relation to deal with incomplete information and its drawback. This section also introduces some extensions to avoid the issues of tolerance relation. The next section—Section 3—is to find out how the frequency of attributes values affect the probability of matching among objects on an attribute. From this, in Section 4, we propose a tolerance relation called extended tolerance relation and discuss the advantage of the approach. Section 5 introduces approximation spaces based on three ways. Finally, methods to derive reduct and core shall be explained in Section 6.

2. Rough Set in Incomplete Tables

In this section, we discuss several rough set extensions in incomplete decision tables with their issues. An information system is defined as a pair , where is a nonempty finite set of objects called the universe and is a nonempty finite set of attributes. For every , there is a mapping from into a space,, and is called the value set of    [1, 2].

If   contains at least one object with an unknown (missing or null) value, then is called an incomplete information system, otherwise complete [3, 4]. In incomplete information systems, objects may contain several unknown attribute values, but we do not assume the case where all objects take the unknown value for an attribute. Unknown values are denoted by special symbol “*” in incomplete information systems and are supposed to be contained in the set.

A decision table defined by is an information system, where is a distinguished attribute called decision [18]. In a similar manner to information systems, a decision table may be incomplete, otherwise complete. However, all decision values are known both in complete and incomplete decision tables.

In a complete decision table, the relation , denotes a binary relation between objects that are equivalent in terms of values of attributes in [1]. The equivalence relation is reflexive, symmetric, and transitive. Let be the set of all objects that are equivalent to   by , and let it be called equivalence class.

2.1. Tolerance Relation

A tolerance relation , denotes a binary relation between objects that are possibly equivalent in terms of values of attributes. In incomplete information systems [3, 4], tolerance relation is defined by where denotes disjunction.

The relation is reflexive and symmetric but does not need to be transitive. Let be the set of objects which are in relation with in terms of in the sense of the above tolerance relation. Due to the symmetric property, is also tolerant to elements in .

Rough sets based on tolerance relation in incomplete information systems are defined in a similar way to those in complete information systems [1]. Let  and  . Then, is the lower approximation [3, 4] of in terms of ,  if and only if is the upper approximation of in terms of , if and only if

Now, we illustrate the above concepts with an incomplete decision table from [10]. The decision table is shown in Table 1.

From Table 1, we can induce approximation space for -group of people such that the value of flu is no based on all condition attributes:

The approximations are quite poor. Moreover, there exist objects which intuitively could be classified in , while they are not in the lower approximation. Take, for instance, object; we have its complete description, and intuitively there is no other object perceived as very tolerant to it. However, it is not included into the lower approximation of . This is due to missing attribute values of objects  ,  which is actually tolerant toaccording to Equation (1).

2.2. Similarity Relation

In the approach proposed by Stefanowski [5, 6], it is assumed that an object can be considered as similar to another object only if all known attribute values of are the same as those of . Such a relation shall not be symmetric. If one object has more complete description than the other, the inverse relation shall not hold. More formally, given an information system and an attribute set, the similarity relation is defined as follows:

It is easy to observe that this relation is reflexive and transitive although not necessarily symmetric. Now for each object, we can induce two similarity sets:, the set of objects similar to (note that the arguments of is not ), and, the set of objects to which is similar.

Clearly, and are two different sets. We can now introduce the definitions of approximation space of a set as follows:

By the definition of similarity relation and tolerance relation introduced in this section, we can see that the conditions for which similarity relation holds are a subset of the conditions for which tolerance relation holds (we can see that if then ). Hence, tolerance classes of elements in shall be “wider” than the respective similarity classes [5, 6].

2.3. Limited Tolerance Relation

Lets compare the attributes of with those of in Table 1. According to our intuition, seems similar to due to the same description in temperature and nausea. However, it is actually not, though is similar to according to the Equation (5). In a huge system, two objects may be considered as distinct, in terms of similarity relation, because of little missing information. For example, objects with and with , where the vectors are abbreviate representation of attribute values of the objects, are tolerant according to Equation (1) and intuitively similar to each other. However, they do not satisfy the nonsymmetric similarity relation. To avoid such problem, Wang [7] developed a novel limited tolerance (LT) relation.

Let; limited tolerance relation is defined on as follows:

In the formula, the condition that is equivalent to. Thus, the two objects that satisfy but not are only those satisfying .

Generally speaking, two objects are in limited tolerance relation if they are in one of the two cases. The first case is that all attribute values of the two objects are missing. The second is a case where there is at least an attribute having an ordinary value for both objects and the two objects have the same value for those attributes. Obviously, limited tolerance relation is reflexive and symmetric but not necessarily transitive.

Thus, limited tolerance class is denoted by

Based on that, approximation space is defined as follows:

Wang [7] also proved that tolerance relation and similarity relation are the two extremities for extending indiscernibility relation, and limited tolerance relation happens to be between tolerance and similar relations,

3. Probability of Matching

“The most common attribute value of an attribute” is a method of handling missing value summarized by Grzymala-Busse [9, 10]. In this method, missing values are replaced by the most common value of the attribute. In different words, a missing attribute value is replaced by the most probable known attribute value, where such probabilities are represented by frequencies of corresponding attribute values. This method of handling missing attribute values is implemented, for example, in well-known machine learning algorithm CN2 [11]. Grymala-Busse illustrated the method [10] by the example from Table 1. For case , the value of headache is replaced by yes since in Table 1 the attribute headache has four values yes and two values no. Similarly, for case , the value of temperature is high since the attribute temperature has the value very high once, normal twice, and high three times.

Using this notion, suppose that the value domains are known, first, we define minimum probability that each value of an attribute appears based on the frequency in the dataset for each concept. Then, the minimum probability that two objects have the same values is defined in order to propose a tolerance relation.

The probability that a value appears as a value of a certain object is between and , where and are sets of objects whose value of attribute “” is “” and “,” respectively. If , that is, the attribute value of an object is not missing, the probability that appears is between and .

Let us define probabilities and, which are the minimum probabilities that a value of attribute “” is “,” is an object whose value of attribute “” is “,” and the minimum probability that an attribute value appears, respectively. The minimum probabilities are given as follows:

The minimum probabilities take a value in in general, but they are greater than zero if   and and less than one when there is at least a missing value for “” in .

The minimum probabilities of attribute values are illustrated in the Table 2. From this table, we can see that in the information system in Table 1, the value high of temperature occurs more frequently than the other values. The most frequent values of headache and nausea happen to be “yes.

Now, we define the probability of matching between objects and on an attribute if one of their attribute values is missing.

Definition 1. Let be an incomplete information system. Given that and, if the value of either or is missing on “,” probability of matching between and on “” denoted by is defined as the minimum estimation of probability that and take the same value on “” and is given by the following equation:

when . Otherwise, .

If one of the two objects has a certain value, ,  for example, the least probability value that appears in attribute of is assuming that the other objects with missing values on take another value . If both of them are missing, we take the sum of joint probability on all values in attribute domain within the same explanation.

It should be noted that and in the case because , and that at least for a value because we do not assume the case where all objects take the unknown value for an attribute. They are also less than 1.0, because or/and takes the missing value. Thus, in the case of is guaranteed.

Take the attribute = temperature, for example, in Table 1. The minimum probability that value is the same as is , and the minimum probability that the value is similar to is .

4. Extended Tolerance Relation

To define whether objects and are tolerant or not, we introduce the concept tolerance degree between two objects by combining two relation indexes. One takes a binary value representing a binary equivalence relation defined by attributes with a known value in both the objects. The other is an index defined by attributes with the missing value in either of the objects. It is obtained from probability of matching assuming that is independent of each other among attributes.

Limited tolerance relation was defined basically using attributes whose values are available in both and . We define a binary function that represents that LT relation can hold between the objects in the case of and utilize it.

Definition 2. Let be an incomplete information system and attribute set and . Let; the equivalence existence is defined by the following function: is assumed in any case.

It is clear that objects have LT relation if  , but does not hold necessarily even if have LT relation; for example, in the case where for all   for different and .

Now, we define a tolerance degree between and by combining the equivalence existence with probability of matching defined in the previous section.

Definition 3. Let be an incomplete information system and attribute set and . The parameterized tolerance degree of and in terms of is defined as follows:

where is a parameter taking a value in (0, 0.5]. If , is assumed. Thus, in the case. Obviously, and . Then, the reason why shall be explained below.

In (14), when , that is, when for all , it is satisfied that holds only in two cases; one is the case where , and the other is the case where and , .

When , there are two cases; one is a case where there is such that . In this case, . The other is the case where . In this case, , considering that for . Therefore, could be understood as a value that separates the following cases:(a): and , have the same value for all attributes in ; (b):  ;(c): and    have a different value at least for an attribute in .

In order to separate the cases between (a) and (b), should satisfy . From those above, we have the constraint of . If , never takes a value between and . Hence, we define the tolerance degree by fixing , though never takes the value of 0.5 as known from the conditions of (a) and (b):

The tolerance degree with lets us differentiate the three cases discussed before by seeing whether the degree is greater/less than 0.5 or whether it is greater than/equal to zero. This feature might be useful, because the users can control conditions of the tolerance based on equivalence existence and probability of matching with just a threshold value. This process shall be discussed in the next step.

Table 3 shows the tolerance degree among objects in terms of all attributes.

In fact, we can choose another probability of matching on an attribute for (14) and (15). For example, instead of using defined in (12), we can choose [5, 6]. The choice might depend on probability distribution of attribute values in each system.

The probabilistic terms in our tolerance degree look similar to those used by Stefanowski [6]. However, our approach uses probabilistic terms as pieces of evidence to derive tolerance relations. Furthermore, this term is combined with equivalence existence to define the relation. On the other hand, in probabilistic approach proposed in [6], the authors suppose a priori assumption that there exists a uniform probability distribution on every attribute domain and compute tolerance classes based on the joint probability distribution. Their aim seems to define approximation spaces applicable in many cases. Such tolerance classes could be used in some applications, but we believe not in most.

Now, we define extended tolerance relation by controlling tolerance degree with a threshold.

Definition 4. Given that incomplete information system and attribute set and given a threshold , the extended tolerance relation is defined as follows:

It is easy to observe that this relation is reflexive and symmetric but not necessarily transitive. In Table 3, if a threshold is given, is tolerant to based on this relation.

By changing the threshold, we are able to get the same results as those by the relations discussed in the previous sections. For example, in the case of tolerance relation, the set of objects tolerant to is in Table 1. From Table 3, we also get as the set of objects tolerant to using extended tolerance relation with . Similarly, we have the same result as limited tolerance relation: , if .

Now, we can formalize these connections by the following propositions.

Proposition 5. Let be an incomplete information system. Given that and , if then .

This proposition shows that with , extended tolerance relation can get the same results as tolerance relation.

Proof. When is obtained as

Proposition 6. Let be an incomplete information system. Given that and , if , then for any , and except the case such that.

This proposition notices that with , extended tolerance relation is an expansion of limited tolerance relation.

Proof. When , is obtained as
Then, it is evident that except the case where .

Proposition 7. Let be an incomplete information system. Given that and , if   , then if .

This proposition shows that with , extended tolerance relation is able to get the same results as equivalence relation.

Proof. Consider that
As discussed before, holds only in the two cases where and where and for all,, which is equivalent to .

It should be noted that similarity/tolerance relations discussed in this paper are introduced to cope with incomplete information. However, we could also define those relations even in complete information tables. For example, the relation “subclass-of” is a similarity relation. It is clearly transitive, but not necessarily symmetric. We can also take the relation “friend-of” as an example of tolerance relation and examine its properties in the same way.

Definition 8. Let   be an incomplete information system. Given that   and , if , then    is more tolerant to    than    based on extended tolerance relation.

Property 1. Let be an incomplete information system. Given that and , if , and , then is more tolerant to than based on extended tolerance relation.

Proof. Since,, from (15) we have . Also, from (15) with , we have . Hence, . This is defined as    is more tolerant to    than    based on extended tolerance relation.

Now, with a relation, we can derive a neighbourhood, which consists of successor and predecessor sets, of an object [19, 20]. Due to symmetric property of extended tolerance relation, successor is the same set as predecessor. Hence, for this relation, we can introduce for any object a tolerant set:

In the example shown in Table 1, given the threshold,  , from Table 3, we have

Property 2. Let be an incomplete information system,  . Then, for all , if , then .

Proof. Consider the following:

Hence, the cardinality of the tolerance set of shall decrease if we increase the threshold to control the tolerance degree.

5. Lower and Upper Approximations

For complete decision tables, lower and upper approximations are defined on the basis of indiscernibility relation [1, 2]. They can also be defined in different ways, for example, using set elements or concepts represented by subsets. In the case of nonequivalence relations, which may not need to be reflexive, symmetric nor transitive, approximation spaces defined in such different ways may lead to variant results [14]. This section shall introduce lower and upper approximation definitions based on singleton, subset, and concept methods which are first studied and generalized by Grzymala-Busse [14, 19].

5.1. Singleton Definition

Singleton lower approximation is

Singleton upper approximation is

In the example shown in Table 1, given the threshold and  , from Table 3, we have approximation space for the concept :

5.2. Subset Definition

Subset lower approximation is

Subset upper approximation is

5.3. Concept Definition

Concept lower approximation is

Concept upper approximation is

The difference between subset and concept definitions may be missed easily. In subset definition, extended tolerance classes of all elements in the universal set are examined, while only elements in are examined in the case of concept definition.

Obviously, singleton lower and upper approximations of are subsets of the subset lower and upper approximations of , respectively. The subset lower approximation is the same set as the concept lower approximation. The concept upper approximation, however, is a subset of the subset upper approximation.

Rough set approximations could be generalized with some other approaches [2022]. Actually, the above three definitions are classified as constructive rough set formulations by Yao [20], where rough set formulations are divided into two different groups: constructive and algebraic methods. The notion of singleton definition is indeed the same as the element based definition suggested by Yao. Meanwhile, subset definition is an expansion of concept definition and also undoubtedly is the same as the granule based definition in the Yao study. These definitions are special cases of the subsystem based definition by Yao when the covering is the set of equivalence/similarity/tolerance classes.  

5.4. Properties of Approximations

Approximation spaces defined based on extended tolerance relation have some properties suggested by Pawlak [1, 2] as well as other properties. We discuss them in detail below.

Property 3. Let be an incomplete information system, ,  and  . Table 4 shows which properties of the original rough set model are satisfied with singleton, subset, and concept definitions.

These properties within our approach can be proved the same as those in the Grzymala-Busse and Wojciech Rzasa study [19] and the Pawlak research [2]. Approximation spaces of those definition methods, in general, do not have properties 7a–7d. However, they are likely to satisfy the weaker versions of 7a–7d, which are defined by Yao [21].

Besides this, our tolerance relation is controlled by the threshold of tolerance degree. Thus, new properties for the threshold can be introduced as shown below.

Property 4. Let be an incomplete information system, and  . The following properties shall hold for arbitrary lower approximation   and upper approximation defined by singleton, subset, and concept methods:(9a) if ,(9b) if .

Proofs.

Proof of (9a). Take an element . From any of the lower approximation definitions, is derived. Since from Property 2, we get that , and then . Thus, if , then (note that ).

Proof of (9b). Take an element . From any of the upper approximation definitions, is derived. Since from Property 2, we get that , and then . Thus, if , then (note that ).

6. Reducts and Core

The concept of reducts and core was introduced by Pawlak [2] for complete information system. In this section, we shall propose a method to derive reduct and core for incomplete information systems based on extended tolerance relation. A subset of conditional attributes is a reduct of an incomplete information system, if the tolerance classes induced by are the same as the tolerance classes induced by all attributes in set and no attribute can be removed from set without changing the tolerance classes.

Definition 9. The comparison, Boolean function between two relations in terms of two attribute sets in an incomplete information system is defined as

If , the two relations developed from two different attribute sets make the same tolerance classes.

Definition 10. In the incomplete decision table, the function , where and is the power set of , is defined as

where is the decision value of an object .

Definition 11. The comparison Boolean function between two relations in terms of attribute sets in an incomplete decision table is defined as where

Proposition 12. The attribute is indispensable in if and only if for incomplete information system and for incomplete decision tables.

This proposition is applied to both incomplete information systems and incomplete decision tables. Consider that or   means that if is removed from , the tolerance classes based on extended tolerance relation in terms of are different from the tolerance classes based on . Hence, is indispensable in .

Proof. is indispensable in if and only if if and only if , from Definition 9, or from Definition 11.

Definition 13. The core of is the set of all indispensable attributes and defined by for incomplete information system and for incomplete decision table.

Proposition 14. A subset is a reduct of the incomplete information system (or decision table) with threshold if and only if(i)for incomplete information systems (or   for decision tables);(ii)For all   for incomplete information systems (or for decision tables).

Proof. Following the definition of reducts stated at the beginning of Section 6,    is a reduct if and only if (i)the tolerance classes induced by are the same as the tolerance classes induced by all attributes in set ;(ii)no attribute can be removed from set without changing the tolerance classes.
Based on Definitions 9 and 10, we have (i) for incomplete information systems (or for decision tables).
For the second condition, (ii) means all attributes in are indispensable. Consequently, (ii) for all for incomplete information systems (or for decision tables), according to Proposition 12.

In the example shown in Table 1, given the threshold, using the method of deriving core, all attributes including temperature, headache, and nausea are indispensable. Hence, in this system, {temperature, headache, nausea} is the core. The core also happens to be the only one reduct of this incomplete information system with .

7. Conclusion

This paper studies a rough set theory for incomplete information systems and establishes a new model based on tolerance degree called extended tolerance relation based rough set model. Frequency of attribute values appearing in the decision table is used to estimate the probability of matching among data items on an attribute. Then, tolerance degree is calculated based on the existence of equivalence on some attributes and probability of matching. Given a threshold to control tolerance degree, a tolerance relation is defined.

The approach is an extension of some rough set models and could solve the problem existing in tolerance relation of Kryszkiewicz. By adjusting the threshold, we are able to get the same results as tolerance, limited tolerance, and equivalence relations. The variable threshold also gives us a means to widen or thin the boundary region between lower and upper approximations. Actually, various lower and upper approximations are obtained using the approach, and the user can choose a threshold that suits his/her requirements.

The paper also discussed the mathematical properties of extended tolerance relation based rough set model and proposed a method to derive reducts and core.

Further research includes finding an algorithm to collect rules within the approach discussed. That is a significant application of rough set theory in knowledge acquisition from data.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.