Abstract

The filtering feature-selection algorithm is a kind of important approach to dimensionality reduction in the field of the text categorization. Most of filtering feature-selection algorithms evaluate the significance of a feature for category based on balanced dataset and do not consider the imbalance factor of dataset. In this paper, a new scheme was proposed, which can weaken the adverse effect caused by the imbalance factor in the corpus. We evaluated the improved versions of nine well-known feature-selection methods (Information Gain, Chi statistic, Document Frequency, Orthogonal Centroid Feature Selection, DIA association factor, Comprehensive Measurement Feature Selection, Deviation from Poisson Feature Selection, improved Gini index, and Mutual Information) using naïve Bayes and support vector machines on three benchmark document collections (20-Newsgroups, Reuters-21578, and WebKB). The experimental results show that the improved scheme can significantly enhance the performance of the feature-selection methods.

1. Introduction

Text categorization [1], which assigns the predefined categories to an unlabeled text document [2], has become a very efficient method to manage the vast volumes of digital documents available on the Internet. In recent years, many sophisticated machine learning algorithms, such as support vector machine (SVM) [3], naïve Bayes (NB) [4], and K-nearest neighbor (KNN) [2, 5], have been extensively applied to the text categorization.

The high dimensionality is the major characteristic of text categorization in which the number of the features can easily reach the orders of tens of thousands even for moderate size dataset [6, 7]. Most of features are irrelevant and lead to poor performance of the classifier [8]. Therefore, the dimensionality reduction, which attempts to reduce the size of feature space without sacrificing the performance of the text categorization, has been a critical step in text categorization [1, 9]. Feature selection [10], which selects a subset from original feature space according to evaluation criteria, is the most commonly used dimensionality reduction method in the field of the text categorization [11]. Feature-selection methods can be divided into three classes [12]. One is the embedded approach that the process of the feature selection is embedded in the induction algorithm; another one is the wrapper approach that the evaluation function is used to select the feature subset as a wrapper around the classifier algorithm [11, 13, 14]; the last one is the filtering approach that the evaluation function used to select the feature subset is independent of the classifier algorithm [14]. In this paper, we focus on the filtering approach. Many efficient and effective filtering feature-selection methods have been applied to text categorization, such as Information Gain (IG) [7], Chi-square statistics (CHI) [7, 15], Mutual Information (MI) [16], Document Frequency (DF) [7], improved Gini index (GINI) [17], DIA association factor (DIA) [1, 6], Comprehensive Measurement Feature Selection (CMFS) [11], Orthogonal Centroid Feature Selection (OCFS) [18], and Deviation from Poisson Feature Selection (DFPFS) [15].

So far, almost all of feature-selection algorithms evaluate the significance of a term based on the balanced datasets without considering the influence of the imbalanced factor. In fact, most of data in the real world is imbalanced. There are two reasons why there exist the imbalanced data in the world. One is the intrinsic nature of such event; the rare events yield less samples. The other reason is the expense of collecting samples and legal or privacy reasons [19]. The imbalanced factors in the datasets degrade the performance of the learning algorithms [20]. In recent years, the imbalanced learning problem has got broad attention of numerous experts and scholars [2123]. In this paper, an improved scheme of existing feature-selection methods is proposed, which weakens the influence of the imbalanced factors occurring in the dataset. In our experiments, we applied the improved scheme on NB and SVM using three benchmark corpora. We favorably show the effectiveness of our approach by demonstrating that it significantly outperforms nine existing feature-selection algorithms.

The rest of this paper is organized as follows. Section 2 presents nine existing feature-selection algorithms used in the paper. Section 3 describes the basic idea and implementation of the improved scheme of nine existing feature-selection methods. The experimental details are given in Section 4 and the experimental results are presented in Section 5. Section 6 shows the statistical analysis and discussion. Our conclusion and the future work direction are provided in the last section.

2.1. Information Gain (IG)

Information Gain [24] is a criterion commonly used in the machine learning [7]. The Information Gain of the feature over the class is the reduction in uncertainty about the value of when the value of is known. The Information Gain of the feature over the class can be calculated as follows: where is the fraction of the documents in category over the total number of documents and is the fraction of documents in the category that contain the word over the total number of documents. is the fraction of the documents containing the term over the total number of documents [25].

2.2. Chi-Square (CHI)

Chi-square testing [7] was applied to evaluate the independence of two variables in mathematical statistics. In this paper, the independence of the feature and the category is measured by Chi-square. The greater the value of the CHI is, the more category information the feature contains. Chi-square formula is defined as follows: where is the amount of documents in the training set; is the frequency with which feature occurs in the category ; is the frequency with which feature occurred in all categories except ; is the frequency with which category occurs and does not contain feature ; is the number of times neither nor occurs.

2.3. Mutual Information (MI)

Mutual Information is a concept in information theory, which measures the dependencies between random variables and can be applied to measure the information content contained in a feature [26]. Mutual Information is used to measure the dependence between the feature and the category in the feature selection. The higher Mutual Information with the category the feature possesses, the more information about category the feature contains: where is the probability that feature occurs in category .

2.4. Document Frequency (DF)

Document Frequency calculates the number of documents in which a feature occurs. The basic idea is that the rare terms are not useful for category prediction and maybe degrade the global performance [7]. The larger the number of the documents containing the feature in the category is, the more predictable information for category the feature possesses [1]. The Document Frequency of a term is calculated as follows:

2.5. Improved Gini Index (GINI)

The Gini index was originally developed for the best split in decision tree induction [15]. In order to utilize it in text categorization with multiclass setting, the original Gini index was improved by Shang et al. [27]. The improved Gini index measures the purity of feature toward a category . The bigger the value of purity is, the better the feature is. The formula of the improved Gini index is defined as follows: where is the probability that the feature occurs in category and refers to the conditional probability that the feature belongs to the category when the feature occurs.

2.6. DIA Association Factor (DIA)

DIA association factor [1, 28] is used to evaluate the conditional probability of a document being assigned to category when it contains the term . It determines the significance of the term for the category . The bigger the DIA of the term with respect to category is, the more significant for category the term is. The DIA association factor is defined by where refers to the conditional probability that feature belongs to category when the feature occurs.

2.7. Comprehensive Measurement Feature Selection (CMFS)

CMFS [11] is a new feature-selection algorithm proposed in our previous research work, in which the significance of a term both in intercategory and intracategory is comprehensively measured. The experimental results show that the CMFS can significantly improve the performance of the classifier:

2.8. Orthogonal Centroid Feature Selection (OCFS)

The Orthogonal Centroid Feature Selection selects features optimally according to the objective function implied by the Orthogonal Centroid algorithm [17, 18]. The centroid of each category and entire training set are used to calculate the score of the term. The score of a term is calculated as follows: where is the amount of documents in the category , is the amount of documents in the training set, is the th element of the centroid vector of class , is the th element of the centroid vector of entire training set, and refers to the number of categories in the corpus.

2.9. Deviations from Poisson Feature Selection (DFPFS)

The Poisson distribution has been successfully used to select the effective query words in information retrieval. The DFPFS is derivedfrom Poisson distribution and measures the degree at which a feature deviates from the Poisson distribution [15]. The farther a feature departs from Poisson distribution, the more effective it is. Conversely, if a feature can be predicted by Poisson distribution, then it is poor: where is the total frequency of term in all messages and and are the numbers of messages which occur in and are absent from , respectively.

3. Algorithms

3.1. Motivation

Prior to feature selection for text categorization, a term-to-category matrix [11], in which rows are the features and columns are category vector, must be generated. In fact, the term-to-category matrix is the foundation of most feature-selection algorithms. All the feature-selection algorithms only consider the term frequency of a feature occurring in a given category and do not take the influence of the imbalance problem into consideration. Table 1 shows 5 features in term-to-category matrix for top 10 categories of Reuters-21578 corpus. The number in the parentheses indicates the sum of documents in the corresponding category. It can be seen from Table 1 that categories C1 and C4 have significantly more training documents than other categories, and, hence, the term frequency of many features appearing in these two categories is significantly higher than their frequency in other categories; for example, the total term frequency of five features occurring in categories C1 and C4 is 3853 and 5700, respectively. However, we think that the term frequency of a feature occurring in one majority category cannot suggest the essence of the feature in this category; the number of one feature occurring in one minority category cannot reflect the truth of the feature in this category. Based on this observation, a scheme which can eliminate the influence of the imbalance problem for feature-selection algorithms is proposed in this paper.

3.2. The Improved Scheme

Feature selection contains three steps. The first step is to calculate the significance of a particular feature over a given category . is the local significance of the feature. The second step is to combine the category-specific scores of each feature into one score . is the global significance of the feature [7]. The last step is to rank all features in the training set according to the global significance of each feature and then select the top significant features as a new feature subset. To eliminate the negative influence of the imbalance problem, the local significance of feature can be calculated using where is the probability of category occurring in the entire training set. Two alternate ways can be used to calculate the value of . One is to use the number of documents to calculate the probability ; the other is to use the amount of all features occurring in category to calculate the probability . In this paper, (12) is used: where is the total number of documents in the entire training set; is the sum of the documents in category ; is the amount of features occurring in category ; is the number of the categories.

There are two alternate ways that calculate the global significance of one feature based on the local significance. In one way the average value of one feature over all categories will be taken as the global value. The formula for the average way is shown in (13). In the other way the maximum value of one feature over all categories will be regarded as the global score. The formula for the maximum way is shown in (14). In order to weaken the influence of the imbalance problem, we substitute (15) and (16) for (13) and (14) in this paper:

Based on the idea proposed in this paper, the feature-selection algorithms listed in Section 2 can be improved. Table 2 shows the improved formula of nine existing feature-selection algorithms in Section 2. Since the category-specific score of GINI is not provided in the literature about the GINI algorithm, the extension version of local feature selection for GINI is not listed in Table 2. The category-specific score of OCFS is not described in the literature either; however, it can be deduced from the formula of OCFS that is the local significance of the feature .

4. Experiment Setup

4.1. Classifiers

In this paper, both NB and SVM are used to make a comparison before and after nine existing feature-selection methods are improved, respectively.

NB [4] is an excellent algorithm for text categorization. It is based on the assumption that a term occurring in a document is independent of other terms. There are two commonly used models for Bayesian classifier: one is the multivariate Bernoulli model; the other is the multinomial model which is used in this paper.

SVM, which was developed by Drucker et al. [3] for spam categorization and applied to text categorization by Joachims [29], is a higher efficient classifier in text categorization. In our study, LIBSVM toolkit [30] is used and the options for LIBSVM are assigned the default value.

4.2. Datasets

Three benchmark datasets (Reuters-21578, WebKB, and 20-Newsgroups) were used to evaluate the performance of the proposed method in our experiments. In the preprocessing step, all words were converted to lower case, punctuation marks were removed, stop lists were used, and no stemming was used. Document Frequency of a term was used in the text representation, and 10-fold validation was adopted in this paper.

The 20-Newsgroups dataset is one of the standard corpora for text categorization. It contains 19997 Newsgroup postings, and all documents were assigned evenly to 20 different UseNet groups.

21578 stories in Reuters-21578 dataset, which are from the Reuters newswire, are nonuniformly divided into 135 categories. In this paper, the top 10 categories are used.

The WebKB, which is a collection of web pages from four different college web sites, contains 8282 web pages. All web pages are nonuniformly assigned to 7 categories. In this paper, four categories (“course,” “faculty,” “project,” and “student”) are used.

4.3. Performance Measures

The text categorization effectiveness is usually measured using the F1, accuracy, and AUC [1, 31]. F1 measure is a combined effectiveness measure determined by “precision” and “recall.” Precision is the conditional probability that the decision is correct when a random document is classified under a specific category. Recall is the conditional probability that the decision is taken when a random document ought to be classified under a specific category. The formulas of the precision and recall for the category are defined as where is the amount of the documents that are correctly classified to category ; is the amount of the documents that are misclassified to category ; is the amount of the documents which belong to category and are misclassified to other categories. For evaluating performance average across categories, the microaveraging was used in our experiments. The microprecision and microrecall may be obtained as where is the number of the categories. The micro-F1 and accuracy are defined in the following way:

The receiver operating characteristics (ROC) curve provides a powerful method to visualize performance of the classifier [22]. The area under the ROC curve (AUC) has become a wide measurement of performance of supervised classification rules. However, the simple form of AUC is only applicable to the case of two classes [32]. To calculate the multiclass AUC, the method proposed by Provost and Dominigos [33] is used in our experiments. First, the ROC curve of each class versus all other classes [34] is generated and their respective AUC is measured. Second, the expected AUC is the weighted average of all the AUCs.

5. Results

5.1. The Experimental Results on 20-Newsgroups

Tables 3 and 4 show the performance comparison of nine improved and existing feature-selection algorithms in terms of micro-F1 and AUC on 20-Newsgroups, respectively. It can be seen from Tables 3 and 4 that the performance of improved version of CHI, DIA, MI, DF, GINI, CMFS, and OCFS is significantly superior to that of the old version. Although the micro-F1 and AUC of NB based on the improved version of IG are inferior to that of existing version of IG, the performance of SVM based on the improved version of IG is superior to that of IG. Moreover, the performance of the improved version of Deviation from Poisson Feature Selection is inferior to that of the old version.

Figures 1 and 2 show the accuracy curves of NB and SVM based on nine pairs of feature-selection methods with 20-Newsgroups, respectively. The value of -axis in Figures 1 and 2 is the number of features selected by different feature-selection algorithms. Figure 1 indicates that the accuracy curve of NB based on CHIX, MIX, DFX, GINIX, DIAX, CMFSX, and OCFSX is significantly higher than that of CHI, MI, DF, GINI, DIA, CMFS, and OCFS. The extent of the performance growth of DIAX is the highest and the highest growth rate is 165 percent. The accuracy curves of NB based on IGX and IG completely coincide with each other in shape. However, the curve of NB based on DFPFSX is lower than that of DFPFS. It can be seen from Figure 2 that the curve of SVM based on improved version is higher than that of existing version except for DFPFS.

5.2. The Experimental Results on Reuters-21578

Table 5 shows the comparison of nine improved and existing feature-selection methods in terms of micro-F1 on Reuters-21578, respectively. It can be seen from Table 5 that the micro-F1 of NB based on CHIX, DFX, DIAX, OCFSX, and DFPFSX is superior to that of CHI, DFX, DIA, OCFS, and DFPFS. The micro-F1 of NB based on IGX is superior to that of IG when the number of selected features is 800 or 1200, respectively. The micro-F1 of NB based on MIX is superior to that of MI when the number of selected features is 400, 1600, or 2000, respectively. The micro-F1 of NB based on GINIX is superior to that of GINI when the number of selected features is 400, 800, 1600, or 2000, respectively. The micro-F1 of NB based on CMFSX is superior to that of CMFS when the number of selected features is 400, 800, 1200, or 1600, respectively. The micro-F1 of SVM based on IGX, CHIX, DFX, GINIX, DIAX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of IG, CHI, DF, GINI, DIA, CMFS, OCFS, and DFPFS. The micro-F1 of SVM based on MIX is superior to that of MI when the number of selected features is 1200, 1600, or 2000, respectively.

Table 6 indicates that the AUCs of SVM based on the improved feature-selection methods on Reuters-21578 are almost superior to that of the nine existing methods. Although some of AUCs of NB on Reuters-21578 based on the improved feature-selection methods are inferior to that of the existing feature-selection algorithms, there is no significant difference between them.

Based on nine pairs of feature-selection methods and Reuters-21578, the accuracy curves of the NB and SVM are shown in Figures 3 and 4, respectively. It can be seen from Figure 3 that the accuracy curve of NB based on IGX almost coincides with that of IG. The accuracy curve of NB based on CHIX is higher than that of CHI except that the number of features is 200, 400, 1200, or 1400. The accuracy curve of NB based on MIX is higher than that of MI when the number of features is greater than 1000. The accuracy curves of NB based on DFX, DIAX, and DFPFSX are higher than those of DF, DIA, and DFPFS, but the growth rate of performance of DFX is quite small. The accuracy performance of NB based on GINIX is superior to that of GINI except that the number of features is 1400, 1600, or 1800. The accuracy performance of NB based on CMFSX is superior to that of GINI except that the number of features is 200, 400, or 800. The accuracy curve of NB based on OCFSX is higher than that of OCFS when the number of features is greater than 400. Figure 4 indicates that the accuracy curves of SVM based on IGX, DFX, DIAX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of IG, DF, DIA, CMFS, OCFS, and DFPFS, respectively. When the number of features is greater than 200, the accuracy curve of SVM based on CHIX is higher than that of CHI. The performance of SVM based on MIX is superior to that of MI when the number of features is greater than 1000. The accuracy curve of SVM based on DIAX is higher than that of DIA when the number of selected features is greater than 400.

5.3. The Experimental Results on WebKB

Table 7 indicates the comparison of nine improved and existing feature-selection methods with respect to micro-F1 measure on WebKB, respectively. It can be seen from Table 7 that the micro-F1 of NB based on CHIX, DFX, GINIX, DIAX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of CHI, DF, GINI, DIA, CMFS, OCFS, and DFPFS, respectively; the micro-F1 of NB based IGX is superior to that of IG when the number of selected features is 400 or 2000; the micro-F1 of NB based on MIX is superior to that of MI when the number of the selected features is greater than 200. The micro-F1 of SVM based on IGX, CHIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX is significantly superior to that of IG, CHI, DF, GINI, CMFS, OCFS, and DFPFS.

Table 8 lists the AUCs of NB and SVM on WebKB based on nine improved and existing feature-selection algorithms, respectively. The AUCs of SVM based on the improved feature-selection methods are superior to that of the existing methods except for the DIAX and MIX. The AUCs of NB based on IGX is higher than that of IG when the number of selected features is 400 or 2000. The AUC of NB based on DIAX is superior to that of DIA when the number of features is 1200, 1600, or 2000.

Figures 5 and 6 show the accuracy curves of NB and SVM based on nine pairs of feature-selection methods on WebKB, respectively. The accuracy curve of NB based on IGX is very close to that of IG. The accuracy curves of NB based on DFX, MIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of DF, MI, DF, GINI, CMFS, OCFS, and DFPFS. When the number of features is greater than 800, the accuracy of NB based DIAX is superior to that of DIA. The accuracy curves of SVM based on IGX, CHIX, DFX, GINIX, CMFSX, OCFSX, and DFPFSX are significantly higher than those of IG, CHI, DF, GINI, CMFS, OCFS, and DFPFS. However, the accuracy curves of SVM based on MIX and DIAX are lower than those of MI and DIA, respectively.

6. Discussion

Because the amount of documents in every category is equal, 20-Newsgroup is a balance dataset in the view of the number of documents in each category. However, the length of different documents is not identical and the number of terms contained in each document is also different. Figure 7 shows the total number of term frequency of each category in 20-Newsgroups dataset. It can be seen from Figure 7 that the sum of term frequency of the category “talk.politics.mideast” is maximum; the total number of term frequency of the category “misc.forsale” is minimum. Hence, it can be seen from Table 2 and Figures 1 and 2 that the performance of the improved feature-selection algorithms, which alleviate the effect of the imbalance factor, is significantly superior to that of existing feature-selection methods.

The expected cross-entropy (ECE) is a feature-selection algorithm used by Zhang and Qiu [35]. The formula of expected cross-entropy is defined by (20). It can be concluded from the experiments that the performance of ECE is superior to that of most of feature-selection algorithms. Table 9 lists the accuracy comparison of NB between ECE and nine existing feature-selection algorithms on 20-Newsgroups when the number of selected features is 400, 800, 1200, 1600, or 2000. It can be seen from Table 9 that the performance of ECE is superior to that of CHI, DF, IG, MI, OCFS, DIA, and DPFFS and inferior to that of GINI and CMFS. By analyzing of the formula of ECE, we found that the imbalance factor has been considered by ECE; it is the reason why the ECE is more effective than others:

The time complexity of the improved feature-selection algorithm is higher than that of old version. The reason is that the cost of calculating the prior probability in the improved feature-selection method has been taken into account. There are two ways to calculate the time complexity based on the formula of . We assume that the size of vector space is and the number of categories is . If the is evaluated with the amount of documents in every category, the time complexity of is . If is evaluated with the sum of term frequency of all features in every category, the cost of is .

To learn more about our experiments, readers can visit the web site (http://pan.baidu.com/s/1y8z7 K).

7. Conclusion

Feature-selection algorithm is designed to measure the significance of a feature for categorization on the basis of the balance dataset. Though most datasets are balanced in the view of the number of documents in every category, they are imbalanced in the view of the number of features in every category. Thus the traditional feature-selection algorithm does not achieve the best performance due to the adverse effect of the imbalance factor in the corpus. In this paper, we proposed an improved scheme which can weaken the adverse effect caused by the imbalance factor in the corpus. In our experiments, nine well-known feature-selection algorithms are improved using the scheme proposed in this paper. The experimental results indicate that the improved scheme can effectively enhance the performance of text categorization.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research is supported by the project development plan of science and technology of Jilin province under Grant no. 20140204071GX.