Abstract

Due to the era of Big Data and the rapid growth in textual data, text classification becomes one of the key techniques for handling and organizing the text data. Feature selection is the most important step in automatic text categorization. In order to choose a subset of available features by eliminating unnecessary features to the classification task, a novel text categorization algorithm called chaos genetic feature selection optimization is proposed. The proposed algorithm selects the optimal subsets in both empirical and theoretical work in machine learning and presents a general framework for text categorization. Experimental results show that the proposed algorithm simplifies the feature selection process effectively and can obtain higher classification accuracy with a smaller feature set.

1. Introduction

Due to the era of Big Data and the rapid growth in textual data, feature selection (FS) is very important for organizing the data. Feature selection is also called attribute selection. Feature selection is a key step in automatic text categorization and machine learning systems, which automatically assigns the documents to a set of predefined classes based on their textual content. It is well known that feature selection is often used to deal with a high dimensional space of features whose main objective is to simplify a dataset by reducing its dimensionalities and identifying relevant underlying features. In the practical application of machine learning, the number of features which exist irrelevant and interdependent features usually very large. It easily leads to the following consequences. Firstly, there is more time consumption in features analysis and model training when the number of features is increasing. Secondly, it easily leads to “curse of dimensionality” when the number of features is increasing and results in the model becoming more complicated. Feature selection has been widely applied to various fields including text categorization [1], signal processing [2], data mining [3], machine learning [4], neural networks, and pattern recognition [5].

Given a feature set with size , there exist possible feature subsets and each feature subset is represented by a binary vector of dimension . The FS problem is to find a minimal feature subset of size while retaining a suitably high accuracy in representing the original features [6]. As redundant features will affect the system classification accuracy and increase the computation time, we should eliminate the features with little information and ignore the redundant features that are strongly correlated. Feature selection can effectively deal with the problem because of its flexibility, computational efficiency, and capacity to handle high dimensional data [7].

In order to choose a subset of available features by eliminating unnecessary features to the categorization task, this paper makes use of FS method, together with machine learning knowledge, and proposes a novel heuristic algorithm for feature selection called chaos genetic feature selection optimization (CGFSO). Chaos is universal phenomenon in many nonlinear systems that exhibits sensitive dependence on initial conditions and includes infinite unstable periodic motions [8]. Chaotic optimization algorithm (COA) first changes optimized variables into chaotic variables, examines each point in the entire solution space by change rule of chaotic variables, and accepts the better point as the present optimum solution [9]. Then, it takes the present optimum solution as the kernel and goes on searching the optimum solution by affixing a perturbation until the requirements are met. CGFSO algorithm is applied to text features of bag of words model in which a document is considered as a set of words or terms and each position in the input feature vector corresponds to a given term in original document. The proposed algorithm selects the optimal subsets in both empirical and theoretical work in machine learning and presents a general framework for text categorization. Compared with other existing algorithms, the proposed algorithm simplifies the feature selection process effectively and can obtain higher classification accuracy with a smaller feature set.

The rest of this paper is organized as follows. Section 2 discusses on the prior research on feature selection. Section 3 proposes the CGFSO algorithm. Section 4 shows the experimental results and finally some conclusions are pointed out and future works are offered in Section 5.

In this section we focus our discussion on the prior research on feature selection. Many scholars at home and abroad have made great contributions to the feature selection in both empirical and theoretical work, which are necessary and sufficient for solving the text categorization problem.

In order to achieve minimum classification error, Kanan and Faez [10] presented an improved ant colony optimization algorithm for feature selection in face recognition. Their algorithm can select the optimal feature subset in terms of shortest feature length and the best performance of classifier. Cao et al. [11] further developed this method by learning feature weights in kernel spaces. The proposed algorithm was often done as a data processing step, independent of classifier construction. To address the problem of jointly learning SVM (support vector machine) parameters and kernels, Zhen et al. [12] proposed a method for choosing SVM parameters including the parameters of kernels by minimizing the leave-one-out cross validation error.

Genetic algorithm (GA) is a parallel heuristic intelligent method, which is a popular technology for nonlinear optimization problem. Due to the advantages of GA, GA has been widely used an effective tool for FS in text categorization. Zhu et al. [13] proposed a combined feature subset selection method, called RICGA (ReliefF immune clonal genetic algorithm) based on the ReliefF algorithm, immune clonal algorithm, and GA. In the RICGA method, the paper first use ReliefF to get rid of irrelevant features then apply a modified genetic algorithm to acquire the finally feature subset. In order to extract feature set, Kim et al. [14] applied genetic algorithm to the feature selection problem and proposed a novel genetic algorithm feature selection (GAFS). Muni et al. [15] presented an online feature selection algorithm using genetic programming (GP). The proposed GP method simultaneously selected a good subset of features and constructed a classifier using the selected features. Waqas et al. [16] focused on multiobjective genetic algorithms for solving feature subset selection. The research showed that independent subsets of features are excellent in accuracy. AlSukker et al. [17] presented a novel modified genetic algorithm based on enhanced population diversity, parents' selection, and improved genetic operators. Practical results indicated the significance of the proposed GA variant in comparison to many other algorithms from the literature on different datasets. Mahrooghy et al. [18] employed filter-based feature selection genetic algorithm (FFSGA) to find an optimal set of features where redundant and irrelevant features are removed. The entropy index fitness function was used to evaluate the feature subsets. The results showed that using the feature selection technique not only improves the equitable threat score by almost 7% at some threshold values for the winter season, but also extremely decreases the dimensionality.

3. Application of CGFSO Algorithm

In this section, we focus our discussion on algorithms that explicitly attempt to select an optimal feature subset. It is usually difficult to obtain an optimal feature subset and has been proven to be NP-hard. Therefore, lots of heuristic algorithms have been used to perform feature selection of training including genetic algorithms, neural networks, and simulated annealing. In order to avoid the combinatorial search problem to find an optimal subset of m features, the most popular feature selection methods is the application of genetic algorithm, which always provide a suboptimal solution.

Although GA has a powerful quality of global search, it is liable to raise the problem of prematurely convergence in the practical application and has low search efficiency in the late evolving period [19]. Chaos movement can nonrepeatedly cover all state in a certain range, according to its own rules [20]. COA shows a promising performance on nonlinear function optimization. However, the local search capability of COA is poor since its heuristic and stochastic properties often suffer from getting stuck in local optima. Thus, this paper takes advantage of the merit of GA and COA and a novel FS algorithm for text categorization; namely, CGFSO is proposed. The experimental results show that the proposed CGFSO finds subsets that result in the best accuracy, while finding compact feature subsets and performing faster than other traditional methods.

3.1. Chaotic Optimization Algorithm

COA is a novel approach of global optimization that has attracted widespread attention in recent years. In the COA, the well-known logistic map is normally described as follows: where is a control parameter, which cannot be bigger than 4, and is a variable. It is easy to find that (1) is a deterministic dynamic system without any stochastic disturbance. When , the system above is completely in chaos state.

The basic process of chaos optimization algorithm generally includes two major steps. Firstly, define a chaotic sequences generator based on the logistic map. Generate a sequence of chaotic points and map it to a sequence of design points in the original design space. COA has a very sensitive dependence upon its initial condition and parameter. Chaotic sequences have been adopted instead of random sequences and somewhat good results have been shown in many applications. Then, calculate the objective function based on the generated design points, and choose the point with the minimum objective function as the current optimum. Secondly, the current optimum is assumed to be close to the global optimum after certain iterations, and it is viewed as the consult point with a little chaotic perturbation and explores the descent direction along axis directions in order. Repeat the above two steps until some specified convergence criterion is satisfied, then the global optimum is obtained. However, further numerical simulation showed that the method is effective only in small design space.

3.2. Chaos Genetic Feature Selection Optimization

Generally, a text categorization system consists of several essential parts including feature extraction and feature selection [21, 22]. In the feature selection stage can be used with the proposed algorithms to obtain a feature subset that allows the increase of the classification system accuracy and simplicity, and the reduction of the learning efforts. CGFSO is used to explore the space of all subsets of given feature set. The performance of selected feature subsets is measured by invoking an evaluation function with the corresponding reduced feature space and measuring the specified classification result. Firstly, generating a feature subset from the given feature set, then using the evaluation function to evaluate the feature subset. Evaluation results are compared with the stopping criterion, if the result of the evaluation is better than stopping criterion, then CGFSO algorithm automatically stops. Otherwise, CGFSO algorithm continues to produce the next feature subset. Feature subsets elected general also verify its validity.

In CGFSO algorithm, each individual in the population represents a candidate solution to the feature selection problem [23]. The first thing to consider is the algorithm coding problem, and we set the number of features as the length of chromosomes. If the individual (chromosome) is represented as a string ?,??each gene ?corresponds to the th feature. If??, it means that the corresponding feature is selected. If , it indicates that the th feature is ignored.

The solution quality in terms of classification accuracy is evaluated by classifying the training data sets using the selected features. Classification accuracy and feature cost are the two key factors used to design a fitness function. The test accuracy measures the number of examples that are correctly classified. Thus, the individual who has high classification accuracy and low total feature cost produces a high fitness value. The individual with high fitness value has high probability to be selected to the next generation. A solution obtaining higher accuracy and with fewer features will get a greater quality function value. Therefore, the fitness function can be defined as follows: where is the test precision ratio, is the test recall ratio, is the sum of measurement costs of the feature subset represented by??, and is the adjustment coefficient. is an upper bound on the costs of candidate solutions. In this case, is simply the sum of the costs associated with all of the features. indicates that feature is selected; otherwise, indicates that feature is ignored.

The main steps of the CGFSO algorithm can be summarized as follows.

Step??1. Give the population size , the crossover probabilities , the mutation probabilities , and the termination generation . Then randomly initialize the initial population , and set evolution generation .

Step??2. Evaluate the fitness of initial population according to objective function.

Step??3. Select individuals with larger fitness to the next generation population .

Step??4. Perform the crossover operation for to generate .

Step??5. Perform logistic chaotic mutation for the population to generate the population .

Step??6. Compute individual fitness after logistic chaotic mutation. If the fitness value after mutation is larger than the old one, then substitute the old one with it, and obtain the next generation population .

Step??7.??; if stopping conditions are satisfied, the algorithm ends, and then output the best feature subset; otherwise, go back to Step 2 until the maximum evolution iterations are completed.

4. Experimental Results

In this section, a series of simulation experiments were conducted to show the effectiveness and superiority of the CGFSO algorithm for text categorization problems. In order to provide an overview on the base accuracy of the classifiers, the Reuters collection was taken in our experiments. We uses Reuters-21567 that are 5213 documents in training set and 2016 documents in test set and adopt the top ten classes. Experimental platform use Dell computer with CPU Xeon 3.06?GHz (24P8122) and 2?GB of RAM. We implement the proposed CGFSO algorithm and other two FS methods such as GA and SVM; that is, the parameters of CGFSO and GA are set as follows: the size of the population is 100, the maximum number of generations is 500, crossover probability is 0.7, and mutation probability is 0.2. Since the experimental result depend on the population randomly generated by the CGFSO and GA algorithms, so we have performed 20 simulations on each data set.

4.1. Precision and Recall

In most text categorization, the performance of feature selection techniques is particularly important. Several norms such as precision and recall are often used to evaluate the performance of feature selection algorithm. Precision is defined as the ratio of correct topic cases to the total predicted topic cases. Recall is defined as the proportion of the correct topic cases to the total cases. Precision and recall are defined as follows.

Definition 1. Assume that represents the number of test documents correctly classified under th category () and denotes the number of test documents incorrectly classified ; then classification precision can be formulated as

Definition 2. Assume that represents the number of test documents correctly classified under th category (), and is the number of test documents incorrectly classified under other categories; these probabilities may be estimated in terms of the contingency table for ; then classification recall can be formulated as

4.2. Simulation Experiment

To analyse the performance of the feature selection algorithms, we will show the results obtained using the proposed approach. Figures 15 show the performance of our proposed method against the GA and SVM for the ten most frequent categories with respect to classification accuracy. Figure 1 is the precision of GA, SVM, and CGFSO with different categories. Figure 2 shows the recall of the GA, SVM, and CGFSO. Figure 3 shows the average fitness in the solutions obtained by the algorithms GA, SVM, and CGFSO. Figure 4 is the precision of GA, SVM, and CGFSO with different number of features. Figure 5 is the recall of GA, SVM, and CGFSO with different number of features.

From the experimental result in Figure 1, it can be seen that the precision of CGFSO is the highest in most cases and its maximum value is close to 96%. However, the precision of GA is relatively minimal and its minimum value is close to 83.1%. From the experimental results in Figure 2, it can be seen that CGFSO is significantly better than the other two algorithms in the aspect of recall. The maximum recall of CGFSO is close to 98%. However, the recall of GA is relatively minimal and its minimum value is close to 78.5%. From the experimental results in Figures 1 and 2, we can easily see that CGFSO algorithm can obtain better performance with a smaller feature set than other two algorithms, especially in the aspect of recall.

From the experimental results in Figure 3, the average fitness of CGFSO is the highest in most cases and its maximum value is close to 0.78. The performance of SVM and GA is relatively close. Because CGFSO effectively combines the advantages of chaos optimization algorithm and genetic algorithm, and effectively expands the range of feasible solution. When a gradual increase is in the number of features, the precision and recall of the three feature selection algorithms are gradually increased. As can be seen from Figures 4 and 5, the overall performance of CGFSO is significantly superior to GA and SVM. It is worth noting that our approach has the least number of support vectors compared with other feature selection approaches.

It can be seen from the experimental results that CGFSO learning process effectively and efficiently reduces the complexity of the system in the feature selection stage.

5. Conclusions

Due to the era of Big Data and the rapid growth in textual data, text classification has become a way to process and organize the text data. In order to achieve the goal of this paper, we designed a new text classification algorithm based on genetic algorithm and chaotic optimization algorithm. The experimental results show that the CQFSO yields the best result of these three methods. The experiment also demonstrated that the CQFSO yields better accuracy even with a large data set since it achieved better performance with the lower number of features. In the future, we will design a new heuristic feature selection algorithm, apply it to text classification field, and will involve experiments with other kinds of datasets.

Acknowledgment

This work is partially supported by the National Science Foundation of China under Grant nos. 61370226 and 61272546.