Abstract

Grey theory is an essential uncertain knowledge acquisition method for small sample, poor information. The classic grey theory does not adequately take into account the distribution of data set and lacks the effective methods to analyze and mine big sample in multigranularity. In view of the universality of the normal distribution, the normality grey number is proposed. Then, the corresponding definition and calculation method of the relational degree between the normality grey numbers are constructed. On this basis, the grey relational analytical method in multigranularity is put forward to realize the automatic clustering in the specified granularity without any experience knowledge. Finally, experiments fully prove that it is an effective knowledge acquisition method for big data or multigranularity sample.

1. Introduction

Grey theory [1] is an effective uncertainty knowledge acquisition model, proposed by Professor Deng. It mainly focuses on small sample and limited information, which is only known partially.

Grey relational analysis [2] is an important task of grey theory by which to scale the similar or different level of development trends among various factors. It has drawn more and more researchers’ attention in recent years and achieved many research results. In paper [3], grey relational degree of decision-making information was defined to solve the decision-makers clustered situation. A decision method based on grey relational analysis and D-S evidence theory was proposed to reduce the uncertainty of decision significantly [4]. In paper [5], the convexity of data was used to characterize the similarity of the samples, and the concept of 3D grey convex relation degree was put forward. The MYCIN uncertain factor in fuzzy set theory and grey relation method were combined to build an inferential decision model [6]. In paper [7], the grey degree was seen as the classification standard of objects’ uncertainty. Then, a novel grey degree and grey number grading method was proposed based on set theory. The authors gave the decision method of quantitative and qualitative change in a scheme and the measure index of qualitative change judged by evaluator, and then, time weighted ensuring model was constructed based on grey relational degree [8]. The computing method of relational degree was defined based on information reduction operator of interval grey numbers sequence, and multicriteria interval grey numbers relational decision-making model was constructed [9].

From the above research achievements, it should be clear that the researchers have paid the more and more attention on the distribution of data in order to reduce the uncertainty with the development of grey relational analysis field. Simultaneously, other uncertainty knowledge acquisition methods are combined with it to supply a better application. However, we do not have the research on multigranularity grey relational analysis for data sequence, especially large data.

How to build the information granules with a strong data presentation and efficient processing capabilities is the most important for multigranularity relational analysis. In this paper, combining with the probability distribution of the data, the conception of the normality grey number is proposed. Moreover, the corresponding grey degree and grey relational degree are given. Finally, the method of grey relational analysis in multigranularity is constructed. Without any prior knowledge, it enables automatic clustering in the specified granularity. The experiments show the effectiveness of this method which provides a novel thought based on grey theory for big data knowledge acquisition.

2. Grey Theory Based on Normal Distribution

2.1. Grey Number and Grey Degree

Definition 1 (grey number [1]). A grey number is an uncertain number in an interval or a data set, denoted as “”. That is, , where and are, respectively, the infimum and supremum. When and , is called black number, else when , is called white number.

Grey number can be whitening with our understanding towards them. Usually, the whitenization weight function [1] should be used to describe the varying preferences of a grey number in the value range. Grey whitenization weight function presents the subjective information about the grey number, and the grey degree of a grey number is the level, which presents the amount of information. The grey degree of grey number of a typical whitenization weight function was given by Professor Deng:

Based on the length of grey interval and the mean whitening number of a grey number , an axiomatic definition [10] about grey degree was given by Professor Liu et al.: . However, there is a problem in preceding definitions about grey degree. When , the grey degree may tend to infinity. It is not normative. Consequently, a definition of grey degree based on the emergence background and domain was proposed by Professor Liu et al.: , where is the measure of grey number field and is the measure of domain [10]. Entropy grey degree from Professor Zhang and Qin is: , where and expresses the entropy and maximum entropy of grey number [11]. This definition requires that the values of grey number in the interval are discrete; it is not appropriate for continuous values.

2.2. Normal Distribution Grey Number and Its Grey Degree

A phenomenon usually is similar to a normal distribution when it is decided by the sum of several independent and slight random factors and the effect of every factor is respectively and evenly small. The normal distribution is widely existed in natural phenomenon, society phenomenon, science and technology and production activity. Much random phenomenon in practice obeys or similarly obeys normal distribution.

Definition 2 (normality grey number). It is an uncertain number in the interval or the data set, where and are the infimum and supremum. The value of this number obeys the normal distribution where the mean is and the deviation is in . The normality grey number is denoted as , abbreviated as .

On the basis of the definition of normality grey number, its typical whitenization weight function could be given (Figure 1).

The definition of grey degree based on normal distribution has an universal significance. According to Lindeberg-Levy central limit theorem, any of the random variables generated by probability distributions, when it is under operation of sequence summing or equivalent arithmetic mean, would be unified induced to normal distribution; that is,

According to the property of normal distribution, the normality grey number has the following properties.

Given , , is real number, then .

Given and , then .

Expectation and deviation of normal distribution are used to denote the distribution of continuous random variable. There is a higher probability when it gets closer to , conversely, the probability will be lower. Deviation embodies the concentration of variable distribution. The more concentrated for normal distribution, the larger the deviation. In normal distribution, the value of random variable is a striking feature, that is “3” principle.

As shown in Figure 2, the value distribution in has achieved 99%, and this interval plays a crucial role in the whole distribution. So, the paper proposes the definition of grey degree based on normal distribution.

Definition 3 (grey degree based on normal distribution). Given normality grey number , then , is the domain measure.
According to Definition 3, the formal description of grey number based on normal distribution is , where is the kernel of grey number which is usually the expectation of grey number in domain, and is the grey number which is usually the upper bound or lower bound.
Grey number based on normal distribution provides a novel thought based on grey system for multigranularity knowledge acquisition under large data set.

2.3. Hypothesis Testing and Transformation of Normal Distribution

Although normal distribution has favorable universality, many data distributions do not conform to normality assumption, such as distribution, distribution, and distribution. So, whether a given data set can transform into a grey number based on normal distribution needs the hypothesis testing of normal distribution. There are many methods for the whole normal testing, such as Jarque-Bera [12] testing method (be applied to large scale sample) and Lilliefors [13] normal testing method (be generally used).

If the data set does not satisfy normal testing, it can be transformed into normal distribution through some methods, such as power transformation of Box and Cox [14], distribution curve of Johnson [15], item group or packing [16], and Bootstrap resampling [17]. It should be noted that the data transformation is not entirely accurate. If the data set is not normal distribution itself, there will be a new error after normal transformation.

3. Grey Relational Analysis Based on Normality Grey Number

3.1. Grey Relational Analysis

Grey relational analysis is a method for quantitatively describing and comparing to the development tendency of a system. The core idea is to compare the geometrical similarity between reference data sequence and several comparative data sequences. The higher the grey correlation, the closer these sequences about their development direction and rate, and their relationship will be closer to each other.

Definition 4 (grey absolutely relational degree). Let   and be both the equally spaced sequence [10], and let , , respectively, be the zeroed value of start point [10]. Then, the grey absolutely relational degree of and is

In addition, there is relatively relational degree in grey relational analysis. Its construction is similar to absolutely relational degree. A little difference is that and will be processed by initial value before computing the zeroed value of start point.

3.2. Grey Relational Degree Based on Normal Grey Number

Grey absolutely relational degree is the basic of grey relational analysis. However, carefully analyzing Definition 4, there are the following problems about grey absolutely relational degree.

The length of sequence and must be equal; otherwise we need to fill the missing data or delete the excess data of the longer sequence which increases the uncertainty of system and has a direct impact on values of relational degree. Grey absolutely relational degree in fact is for white number. The relational degree will not work if the sequence itself is uncertain or the element of sequence is grey number directly.

Based on above analysis, combining with the property of normality grey number, a grey relational degree based on normality grey number is proposed. Normal random distribution is the core of this relational degree, which is obtained by calculating the area of two intersecting normal random distributions. The intersecting stations of two normal distributions are shown in Figure 3 (the shadow area presents each other’s similarity).

Set , , and as, respectively, the distribution function of and , where is the intersection of the two curves and . Then, the intersecting area of and is where , , is standard normal distribution. If is known, and are then obtained. Enquiring the table of standard normal distribution, the intersecting area can be calculated.

According to the intersection between distribution curves , then obtain that

In light of “3” principle of normal distribution, 99.74% of values are in . So, when calculating the similarity of two normal distributions, we only consider the distribution of variables in the interval. It would be well as if set , then, there are the following three situations about the distribution of and :(1)If , indicating that the values distribution of intersections can be neglected, so .(2)If there is a point or in the interval , as shown in Figure 3(a), then where , .(3)If and are in the interval simultaneously, as shown in Figures 3(b) and 3(c), then, where ,  .

Considering the normalization of the similarity, must be normalized. The area is seen as the similarity of two normal distributions after it is normalized.

Definition 5 (similarity of normal grey number). Given normality grey number , , their relational degree is: where is the intersection area between and . Similarity of normal grey number is abbreviated as normality grey relational degree.
Similarity of normal grey number better considers the change of data in interval to show the property of its probability distributions and to be fundamental for grey relational analysis based on normality grey number.

3.3. Unsupervised Fast Grey Clustering Based on Grey Relational Degree

Grey clustering is a method to divide some observation index or object into several undefined categories according to the grey number relational matrix or whiten weight function of grey number. The basic ideology is to nondimensionalize the evaluating indicator of original observation data, to calculate the relation coefficient, relational degree and to sort the evaluation indicator according to relational degree. The grey clustering only supports small sample data at present, and the whole process needs manual intervention. So, it will increase complexity and uncertainty of the algorithm. Because of this analysis, unsupervised fast grey clustering based on grey relational degree is proposed in this paper. With core of grey relational degree and by constructing grey relational matrix, this algorithm realizes unsupervised dynamical clustering using improved -means method.

The analysis of the improved algorithm is as follows. Features that are most similar will divide into the same category while sample data is classified. Their relationship in space can be characterized by some norm measure. Through gradual improvements in their layers (i.e., clustering number), the relationship of categories between layers is changing with the increase of layers. This change can be portrayed by some rules. The clustering will finish, once the requirements of the rules are met. In this paper, standard deviation of samples is used to characterize this change. With the increase of clustering layers, the various categories will get more and more aggregative and will continue to decrease. Judging if the clustering has finished, the mean sample variance is used as convergence conditions when ( is the number of samples); that is, clustering will be accomplished when .

Given grey relational matrix : where is the grey relational degree of grey number and , it could be grey number which is irregular data or satisfy one distribution, for example, normal distribution. According to grey relational matrix, unsupervised fast grey clustering based on grey relational degree is proposed.

Algorithm 6 (unsupervised fast grey clustering based on grey relational degree). Input: given sequences, the length of them is Output: aggregation matrix , ; is category number.
Steps.(1)Calculating grey relational matrix: is grey relational degree of the sample , is clustering number, whose invital value is equal to zero.(2)Extracting all unrepeated in to construct an ascending category vector .(3)Calculating threshold //the control condition of clustering finished.(4)Initial category , // is control variable of circulating.(5)(1)Constructing center category table : divide into portions equally, and take former portions to join as the initial category of under the case of ; set .(2)Set as temporary control variable.(3)While , execute the following circulating://after clustering tends to stabilize, the standard deviation of all categories will converge to a steady value.(a), (b)calculating the distance of every value in and all categories in , and merging it into the category where distance is minimum,(c)revising the center distance of all categories in according to weighted average,(d)computing the standard deviation of all categories in , set .(4). While //when , the aggregation of all categories will be ok, and clustering finishes.(6)Update Groupid in by the corresponding in .(7)Set , in ascending order and in descending order, will be processed as follows:(1)Taking and in turn, is the largest category number in . Then, the most similar sample set of in is, //the bigger Clusterid is, the more similar between clusters.(2)//the most similar sample of every cluster.(3).(8)Return .

After clustering by Algorithm 6, we can obtain the equivalent cluster of samples under grey relational degree. The same is similar sample sequence.

Using classic grey relational degree as the calculating basic of similar sequence, Algorithm 6 has strict requirement to the length of sequence and sequence values. However, the comparative similar sequence is sequence distribution under normality grey relational degree; there is not a rigid requirement for the length of sequences and it is suitable for grey relational analysis of large and multigranularity samples. For example, given the power consumption of one city in a year (by the day), we need to analyze and make statistics of the power consumption of a week, a month, and a quarter. The traditional grey relational analysis method cannot do this.

3.4. Multigranularity Grey Clustering Algorithm Based on Normal Grey Domain

Granularity is a concept from physics which is mean measure for the size of microparticle and the thickness of information in artificial intelligence field. Physics granularity involves the refinement partition to physics objects, but granularity information is to measure information and knowledge refinement [18, 19]. The essence of granular computing is to select appropriate granularity and to reduce complexity of solution problem.

Let denote the set which is composed of all equivalence relation on , and the equivalence relation can be defined as follows.

Definition 7 (the thickness of granularity). Set , if , , then it is called that is finer than , denoted as .
Definition 7 expresses the “coarse” and “fine” concept of granularity. So, the following theorems can be proved [4].

Theorem 8. According to the above definition about the relationship “”, forms a comprehensive semisequence lattices.

Theorem 8 is very revealing about the core property of granularity. Based on the theorem, the following sequence would be obtained:

The above sequence is visually correspondent to an -level tree. Set as an -level tree, all its leaf nodes construct a set . Then the nodes in every level correspond to a partition of . The hierarchical diagram of clustering obtaining by clustering operating also is an -level tree. Therefore, there must be a corresponding sequence of equivalence relation. So, this fully shows that there is good communicating peculiarity between clustering and granularity. It is the theory basic of the proposed algorithm.

According to the property of granular computing and Algorithm 6, multigranularity grey clustering algorithm based on normality grey domain is proposed. The algorithm does a partition under the designated granularity based on time sequence then dynamic clustering. The algorithm includes data partitioning of sequence, normality grey sequence constructing, and grey clustering. Take the data of time sequence, for example, to elaborate the algorithm as follows.

Given time sequence , and the necessary granularity for analysis, there is the following partition for original time sequence:

It should be noted that at the designated granularity partition of time sequence is incomplete. Number of the sequences is not necessarily exactly equal and values of subsequence may be in granularity . The partitioning sequence is composed by subsequences. Calculating the expectation and variance of every subsequence, the sequence can be transformed as normality grey sequence:

After designated granularity partitioning to sequence, grey clustering algorithm based on normality grey domain could be obtained.

Algorithm 9 (grey clustering algorithm based on normality grey domain). Input: time granularity sequences and the length of them is : Output: aggregation matrix , ; is category number.
Steps.(1)Sequence granularity portioning: invoking the formula of sequence partition for sequences to do a partition, its granularity is . Obtaining the partitioned sequence , ; .(2)Normality hypothesis testing to sequences. If they do not satisfy normal distribution, then make a normal distribution transformation.(3)Transforming into normality grey sequence .(4)Invoking Algorithm 6 to dynamic cluster for sequence , obtaining .(5)Return .

4. Experiment Results and Analysis

In order to evaluate the performance of Algorithm 9, two data sets of UCI database are selected for experiments (details shown in Table 1).

Dataset 1 records the power consumption of a family in 4 years (taking samples in 1-minute interval, from October 1, 2006, to November 1, 2010) in which there are 2,049,280 effectively recorded items. Dataset 2 gathers the vehicle traffic data of a highway junction in Los Angeles by sensor in 5-minute interval. There are 47,497 effectively recorded items from April 10, 2005, to October 1, 2005.

According to the feature of dataset, we need to analyze the power consumption of Dataset 1 to find out the peaks and troughs in electricity demand by the month and to analyze the traffic station of highway in Dataset 2 by the hour.

Using Algorithm 9 to treat the Dataset 1, the time sequences are transformed into normal grey sequences under the designated time granularity (by the month) firstly, as shown in Table 2.

Clustering normality grey sequences by years under the designated granularity (by the month), the following clustering results can be obtained (sort by expectation in descending order):

We could see that the most power consumption happened in January, February, November, and December and increased steadily; the more power consumption is in March, April, May, June, September, October and the change is relative stability; the least consumption is in July and August, and the fluctuation is the most evident. The clustering result is shown in Figure 4.

Using Algorithm 9 to treat the Dataset 2, the time sequences are transformed into normality grey sequences under the designated time granularity (by the hour) firstly, as shown in Table 3.

Clustering normal grey sequences by years under designated granularity (by the hour), the following clustering results can be obtained (sort by expectation in descending order):

The clustering results show that there is daily the most heavy traffic at 9, 14, 15, 16, and 17 o’clock and the volume is relative stability; the more heavy traffic is at 10, 11, 12, 13, and 19 o’clock, and the fluctuation of volume is also not very violent; the least busy hour is at 3, 4 o’clock in every morning and the volume is most low and most fluctuant. The clustering result is shown in Figure 5.

As shown in Figures 4 and 5, after multigranularity clustering using Algorithm 9, the results better match the actual data distribution. To further compare the clustering performance of Algorithm 9, the classic -means and DBSCAN [20] clustering algorithm are introduced into this experiment. The datasets are all transformed as normality grey sequences for fairness of experiment. -means and DBSCAN need to appoint the number of categories or set the optimum parameters artificially, but there is not any experience knowledge in Algorithm 9. The performance of three algorithms is all under the best experiment results. The comparative criteria contain entropy and purity [20]. The best algorithm has the lowest entropy and highest purity. The experiment results are in Table 4.

Table 4 shows that the performance of Algorithm 9 is obviously superior to DBSCAN and roughly equivalent to -means. The clustering number of Algorithm 9 is almost the same as the other clustering algorithm, but the former is automatic clustering and does not need any experience knowledge.

5. Conclusion

Although grey theory is becoming mature, it is not widely applied in practice. An important reason is the lack of effective grey relational analysis method for large data or multigranularity sample. The concept of normality grey number is built and the corresponding grey degree is defined based on probability distributions. Meanwhile, multigranularity grey relational analysis method based on normality grey relational degree is proposed. The whole method can realize automatic clustering without the prior knowledge. Finally, the proposed method has a more effective and significant performance than other algorithms in the clustering experiments. In further research, we will focus on grey modeling and predicting models based on normality grey sequence and build a complete theory system about normality grey relational analysis and multigranularity simulation and prediction.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant no. 61309014; the Natural Science Foundation Project of CQ CSTC under Grant nos. cstc2013jcyjA40009, cstc2013jcyjA40063; the Natural Science Foundation Project of CQUPT under Grant no. A2012-96.