Abstract

The similarity between objects is the core research area of data mining. In order to reduce the interference of the uncertainty of nature language, a similarity measurement between normal cloud models is adopted to text classification research. On this basis, a novel text classifier based on cloud concept jumping up (CCJU-TC) is proposed. It can efficiently accomplish conversion between qualitative concept and quantitative data. Through the conversion from text set to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole category concept. According to the cloud similarity between the test text and each category concept, the test text is assigned to the most similar category. By the comparison among different text classifiers in different feature selection set, it fully proves that not only does CCJU-TC have a strong ability to adapt to the different text features, but also the classification performance is also better than the traditional classifiers.

1. Introduction

Text classification is key technology in text mining, which denotes the task of assigning raw text documents to one or more predefined categories. This is a direct concept from machine learning, which implies the declaration of a set of labeled categories as a way to represent the documents and a statistical classifier trained with a labeled training set. Classification is the process in which objects are initially recognized, differentiated, and understood and implies that objects are grouped into categories, usually for some specific purposes. Classification is fundamental in prediction, inference, and decision making. However, there are a variety of ways to approach classification task [1].

In recent years, with the continued development and innovation of information technology and natural language processing (NLP) fields, it has laid a solid theory and practice basis for text classification. An increasing number of supervised classification approaches have been developed for various types of classification tasks, such as decision trees (DT) [2, 3], neural networks (NN) [4, 5], naive Bayes (NB) [68], support vector machines (SVM) [912], and k nearest neighbor(kNN) [1315]. These classifiers have their own characteristics, but in the perspective of comprehensive performance, SVM, kNN, and NB methods are more excellent [16].

SVM can handle high-dimensional sparse text and is not sensitive to the characteristics of relevance. But its classification speed is too slow and lacks the method for multi-class classification. In addition, the choice of kernel function and parameters still depended on experience [9]. The kNN method has the advantage of simple and stable performance and especially has good anti-interference to the noise data. But, it requires a large number of samples to get threshold. At the same time, how to choose value has a direct impact on the classification performance [14]. NB is simple, efficient, and stable. It requires few parameters and is not sensitive to the missing data. However, NB assumes that the attributes are mutually independent, but in practice the assumption is often not established. Especially, if the amount of attributes is larger or the correlation is great between attributes, its efficiency is low [17].

Cloud model [1820] is an uncertainty conversion model between the quality concepts and its quantity value expressed by the natural language value. The character of cloud can be expressed by expected value (Ex), entropy (En), and hyper entropy (He) [18]. Ex is the center value of concept in the theory field, which represents the value of the qualitative concept. En is the measuring of the fuzziness of qualitative concept, which reflects the numerical range that can be accepted by this concept in the theory field. He embodies the uncertainty of the qualitative concept. The bigger the entropy is, the bigger the numerical range can be accepted by the concept and the fuzzier the concepts are.

Given three digital features Ex, En, and He, the vector is called cloud vector, which represents a qualitative concept.

With the efficient conversion function between qualitative and quantitative data, the concept extraction method of cloud model is applied to text classification. Through the conversion from text collection to text information table based on VSM model, the text qualitative concept, which is extraction from the same category, is jumping up as a whole concept. On the basis of the cloud similarity between the test text and each category, the test text is assigned to the most similar category. By the comparison among different text classifiers based on different feature selection methods, it has not only a strong ability to adapt to the different text features, but also better classification performance than the traditional classifiers.

2. Similarity Measurement between Normal Cloud Models

In cloud model theory, the cloud vector is used to descript the qualitative concept. When making comparison of the similarity between vectors, Euclidean Distance or Cosine Angle is widely used. However, the physical meaning and the importance of each dimensionality of the cloud vector are completely different. Therefore, a novel similarity measurement between cloud vectors is required. Taking into account the normal cloud model with most universal, we propose the similarity degree between normal clouds.

Normal random distribution is the core of this similarity degree, which is obtained by calculating the area of two intersecting normal cloud vectors. The intersecting stations of two normal clouds are shown in Figure 1 (the shadow area presents each other’s similarity).

Given , , and are,respectively, the distribution function of and , where is the intersection of the two curves and . Then, the intersecting area of and is where , , is standard normal distribution. If is known, and are then obtained. Enquiring the table of standard normal distribution, the intersecting area can be calculated.

On the basis of the intersection between normal cloud vectors, , then obtains that

In light of “” principle of normal distribution, 99.74% of values are in . So, when calculating the similarity of two normal distributions, we only consider the distribution of variables in the interval. It would be well as if set ; then, there are the three following situations about the distribution of and .

If , indicating that the values distribution of intersections can be neglected, so .

If there is a point ( or ) in , as shown in Figure 1(a), then, where ,.

If and are in the interval simultaneously, as shown in Figures 1(b) and 1(c), then,where , .

Considering the normalization of the similarity, must be normalized. The area is seen as the similarity of two normal distributions after it is normalized.

Definition 1 (similarity of normal cloud). Given normal cloud , , their similarity degree is where is the intersection area between and .

3. Cloud Virtual Pan-Concept-Tree and Concept Jumping Up

3.1. Cloud Virtual Pan-Concept-Tree

The core idea of data mining is to discover and obtain the potential knowledge. In this process, the knowledge layers will be different with different knowledge granularity. To deal with the uncertainty of qualitative concepts, cloud model can be used to text concept representation. Text concepts set is composed of a series of basic concepts, which can be represented as cloud vector . On this basis, text virtual concept tree can be structured layer by layer.

The virtual concept tree, which is based on cloud model, is uncertainty. On the same layer, the distinction between the various concepts is flexible. A certain degree of overlap is allowed. In other words, the attributes with the same value may belong to different concepts and different attribute has different contribution to concepts. In the construction process, the layer of concept extraction is uncertain. It can be extracted from the bottom layer or the upper layers.

In the data mining process, with the increasing of the granularity and the abstraction degree to concept, there is no physical structure of pan-concept-tree. In the same time, there is no the process of the concept climbing and jumping up layer by layer. The granularity of concepts is continuous and can jump up to any greater granularity.

Figure 2 shows a pan-concept-tree for person age distribution based on cloud model.

On the basis of cloud pan-concept-tree, a similar pan-concept-tree model for text classification is proposed (Figure 3).

3.2. Concept Jumping Up and Merger

The jumping up of concept refers to directly rasie the concept to the required granularity or layer with the leaf nodes in pan-concept-tree.

The strategies for concept jumping up are as follows:(a)with the user-specified amount of concept;(b)automatic jumping up without the amount of concept;(c)man-machine interactive jumping up.

The concept jumping up process of pan-concept-tree usually only visits the original dataset a few times. When the original dataset is large, it can reduce the computing overhead. The strategy (a) and (b) access dataset only one time and strategy (c) requires more, whose I/O overhead depends on the number of human-machine interaction. In fact, strategy (c) can be achieved by invoking strategy (a) repeatedly.

In text classification practice, strategy (a) is mainly used to concept jumping up because the number of text categories is known. In the concept jumping up process, the merger between concepts is the most important operation. It can merge two adjacent concepts and forms an upper layer (or thicker granularity) concept. In cloud model, the concept merger operations are described as follows [7, 8].

Given two adjacent concepts, which can be signed as cloud vector and , the merged cloud vector is , so

4. Text Classification Approach Based on Cloud Concept Jumping Up

In order to apply cloud model theory to text mining, we need to implement the corresponding preoperation to text, which involves the construction of text information table and conversion.

4.1. Text Information Table and Conversion

Definition 2. An information table [5] is defined as , where is a nonempty finite set of objects, and is a nonempty finite set of attributes, , where is the set of condition attributes and is the set of decision attributes, . , , and are the domain of the attribute is a total function such that for every , .

On the basis of information table, text information table is given.

Definition 3. Given information table , where is text set, is a nonempty finite set of attributes, where is the set of text feature attributes and is the set of categories. , , and are the domain of the attribute is a total function such that for every , .

Using cloud model to describe the uncertainty between texts, it must be met that the values of different attribute in the text belong to the same domain. That is, the values of different attribute should have the same physical meaning. But the values and the meaning of attributes of the existing text information table are so different that we must find one way to convert them to the same physical domain. In this paper, a text information table conversion algorithm is proposed using statistical method.

Algorithm 4. The conversion algorithm for text information table.

Input. text information table

Output. conversion text information table Step 1. Set Step 2. For each in ,  is categories set of (1), is the number of the texts whose category is , is the value of the th attribute of the th text.(2) //Compute fluctuation degree of the attribute values of all sample(3) // Normalization  NextStep 3. Return .

After the conversion by Algorithm 4, the attributes of the text information table are changed into the same physical space. The novel table shows the fluctuation degree of the values in the different category and describes their statistical distribution.

4.2. Cloud Concept Jumping Up Classifier

On the basis of text information table conversion and similarity calculation, the text classifier based on cloud concept jumping up (CCJU-TC) is proposed. The classifier works as follows (Figure 4).

Whole CCJU-TC algorithm is divided into text preprocessing, text information table conversion, category concept extraction, text cloud model conceptual similarity, and several other components.

Algorithm 5. CCJU-TC (The text classifier based on cloud concept jumping up)

Input. training text set , test text (unknown category)

Output. the category of Step 1. Text (include and ) segment and remove stopped items;Step 2. Compute the weight of items by TF-IDF [9] formula;Step 3. Text features (items) selection;Step 4. Construct information table , where is text feature set and is category set;Step 5. Invoke Algorithm 4 to convert to Step 6. Loop each category in , calculate its concept cloud vector. Finally obtain categories concept set (a)For to (b) //obtain category concept by concept merging, where are texts with the same category , and are their corresponding cloud models.(c)Next(d); Step 7. Compute , where is cloud vector of , is the cloud similarity between and any category concept in . The category with most great similarity is the category of text .Step 8. Return.

5. Experiments and Evaluations

5.1. Experiment on Different Datasets

In order to evaluate the performance of CCJU-TC facility, we have acquired four datasets of varying characteristics. Each dataset has its own unique characteristics in terms of the degree of similarity between categories and the dimensionality of categories, as shown in Table 1.

The Featured Articles dataset was designed and organized by our research group by extracting different types of articles from Wikipedia website. A total of 1159 articles were acquired from twenty-three randomly selected categories. 10 documents from each category have been randomly selected to build the training set. The remaining documents were utilized for testing purposes. In other word, the training set of this dataset consists of 230 documents, while the testing set consists of a total of 929 documents.

The Vehicles dataset was built by extracting vehicle related articles from Wikipedia website. This dataset was acquired by extracting articles from four subcategories in the category of ‘‘Vehicles.’’ All the four categories are easily differentiated and each category has its own unique keywords. This dataset consists of 640 documents. Each category consists of 160 documents where 50 documents were used to build the training set and the remaining 110 documents were utilized for testing purposes. In other words, the training set consists of a total of 200 documents, while the testing set contains a total of 440 documents.

A dataset containing articles about mathematical topics has been acquired from arxiv.org. This dataset consists of eight categories regarding mathematical topics. 40 documents for each category have been collected, and the entire dataset consists of a total of 320 documents. 10 documents from each category were extracted randomly to build the training set, while the remaining 30 documents from each category were used for testing purposes.

20-Newsgroups dataset is one of the standard benchmark datasets used by many text classification research groups to evaluate the performance of their presented classification approaches. 20-Newsgroups dataset is a collection of 20,000 Usenet articles from twenty different newsgroups with 1000 articles per newsgroup. 20-Newsgroups collection has become a widely used dataset for experiments in text applications of machine learning techniques, such as text classification and text clustering. 20-Newsgroups dataset used in our experiments was acquired from the CMU Text Learning Group’s website. In our experiments using this dataset, every category was divided into two subsets. 300 documents from each category were divided for training, while the remaining 700 documents were used for testing purposes. In other words, the training set consists of 6000 documents and the remaining 14,000 documents are used for testing purpose.

Many research works in text document classification apply preprocesses such as stop word elimination, word stemming, and feature selection to the datasets used in their experiments and evaluations in order to obtain better experimental results.

As our research goal in this paper is to evaluate the performance of CCJU-TC facility without sacrificing the simplicity and low cost classification algorithms, we did not perform any preprocess such as stop word elimination, word stemming to the datasets that we used in our experiments. In order to reduce the interference of different feature set, IG, CHI [21, 22], and FAS [23, 24] feature selection methods are adopted.

In all experiments, we employ SVM (Torch), kNN (), and NB classifiers as a comparison. In the way to construct the training set and test set, we apply the 3-fold cross validation, which randomly divides the text sets into three parts. Two parts are used as training sets; the other left is test set. Finally, the average of three classification results is the experimental result, which is evaluated by precision rate (), recall rate (), and . The details of experimental result are in Tables 2, 3, and 4.

In order to compare the performance of the classifier more intuitive, Figure 5 is created based on the above tables.

5.2. Summary of Experiment

In the above classification experiment, CCJU-TC has the better performance, followed by kNN and SVM, NB classification performance is the worst (Figure 5). Meanwhile, the different feature selection method for text categorization performance impact is also different. The performance of the text classifier, which selected features by FAS, is higher and stable (Figure 5(c)). CHI feature selection method for text classification more significant impact, mainly due to over-reliance on low-frequency features, especially in the 20-Newsgroups dataset (Figure 5(b)).

Through the comparison test among multiple text classifiers with a variety of feature selection methods on the different datasets, CCJU-TC has shown an excellent text classification capability. It not only has good ability to adapt different feature set, but also has better classification ability. It full proves that CCJU-TC is an efficient classifier.

6. Conclusion

In the present research to text mining (TM), traditional data mining methods still dominated. However, with further research, it faces more severe challenges. These difficulties, such as the huge dimensions and sparsity of text object, the high complexity of algorithm, and the requirement of prior knowledge, have seriously hampered the development of TM.

Through in-depth analysis, these problems in TM process are due to the uncertainty of natural language. The uncertainty of natural language (especially text) comes from the uncertainty of the human thinking in essence. Although it strengthens understanding of spatial and cognitive for people, brings a series of problems to TM. Therefore, from the point of reducing the complexity of natural language, if we can carry out the advanced innovation, which is based on making full use of these existing technologies, and find out a novel uncertainty artificial intelligence approach for TM, it will greatly facilitate the rapid development of TM.

CCJU-TC classifier is a novel attempt to apply uncertain knowledge acquisition tool (cloud model) to text classification in TM. The experimental result shows that it is an excellent solution. Future, more data mining methods, based on cloud model, will be carried out.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant no. 61309014; the Natural Science Foundation Project of CQ CSTC under Grant no. cstc2013jcyjA40009, no. cstc2013jcyjA40063; and the Natural Science Foundation Project of CQUPT under Grant no. A2012-96.