Abstract

Feature selection plays a critical role in text categorization. During feature selecting, high-frequency terms and the interclass and intraclass relative contributions of terms all have significant effects on classification results. So we put forward a feature selection approach, IIRCT, based on interclass and intraclass relative contributions of terms in the paper. In our proposed algorithm, three critical factors, which are term frequency and the interclass relative contribution and the intraclass relative contribution of terms, are all considered synthetically. Finally, experiments are made with the help of kNN classifier. And the corresponding results on 20 NewsGroup and SougouCS corpora show that IIRCT algorithm achieves better performance than DF, -Test, and CMFS algorithms.

1. Introduction

As the number of digital documents available on the Internet has been growing significantly in recent years, it is impossible to manipulate manually such enormous information [1]. More and more methods based on statistical theory and machine learning have been proposed, and they are applied successfully to information processing. An effective method for managing the vast amount of data is text categorization, which has been widely applied to many fields such as theme detection, spam filtering, identity recognition, web page classification, and semantic parsing.

The goal of text classification is to assign a new document automatically to a predefined category [2]. A typical text classification framework consists of preprocessing, document representation, feature selection, feature weighting, and classification stages [3]. In the preprocessing stage, it usually contains such tasks as tokenization, stop-word removal, lowercase conversion, and stemming. In the document representation stage, it generally utilizes the vector space model that makes use of the bag-of-words approach [4]. In the feature selection stage, it usually employs the filter methods such as document frequency (DF) [5], mutual information (MI) [6], information gain (IG) [7], and chi-square (CHI) [8]. In the feature weighting stage, it usually uses TF-IDF to calculate the weights of the selected features in each document. And in the classification stage, it always uses some popular classification algorithms, for example, decision trees [9], -Nearest Neighbors (kNN) [10], and support vector machine (SVM) [11].

The major characteristic of text categorization is that the feature number in the feature space can easily reach up to tens or hundreds of thousands. It can not only increase computational time but also degrade classification accuracy [12]. As a consequence, feature selection plays a critical role in text classification.

The existing experimental results show that IG is one of the most effective feature selection methods, the performance of DF is similar to IG, and MI is the worst [13]. Through comparative analysis, it is easy to find that the performances of DF and IG are good, which means that high-frequency terms are really essential to text classification, while the performance of MI is bad as it is inclined to select low-frequency terms as features. Besides, -Test method is also based on term frequency [14] and its performance is good. During feature selecting, Categorical Term Descriptor (CTD) method considers the document frequency of IDF and the category information of ICF particularly [15]. Similarly, Strong Class Information Words (SCIW) method selects the terms which have good abilities to distinguish categories [16] and it also considers the category information. Experimental results show that CTD and SCIW both have good accuracies. So we can easily know that feature selection methods based on category information always have good performances. As a result, we draw that high-frequency terms and category information are very important in improving the classification effectiveness. Comprehensively Measure Feature Selection (CMFS) method [1] considers high-frequency terms and category information simultaneously, and it also obtains good results. But it does not consider the interactions between categories. In view of these, we propose a new feature selection algorithm named as feature selection approach based on interclass and intraclass relative contributions of terms (IIRCT), in which term frequency and the interclass relative contribution and the intraclass relative contribution of terms are all considered synthetically.

To deal with massive documents corpora, many feature selection approaches have been proposed. And their purpose is to select the terms whose classification capabilities are stronger comparatively in feature space. After feature selection, the dimensionality of feature space can be reduced, and the efficiency and accuracy of classifiers can be improved. Its main idea is as follows. Firstly, it uses the feature selection function to compute some important indicators of each word in feature space. And then, it sorts the words in descending order according to above values. Finally, it selects the top m words to construct the feature vector.

In this section, we introduce some symbols used in the following firstly.

is the times that the term appears in document , namely, term frequency.

is the average frequency of the term within a single category , and the calculation formula is as follows:where is the document number in collection , is the document number in category , and , which is an indicator to discriminate whether document belongs to category .

is the average term frequency of the term in collection , and it is calculated according toSimilarly, is the document number in collection .

Then we give the definition of three feature selection methods, which are DF, -Test, and CMFS, respectively.

2.1. DF

DF method calculates the number of documents which contain the terms in the category to measure the relevance of the terms and the categories. And the terms can be reserved only when they appear in adequate documents. This measurement is based on such an assumption that the terms which have low values of DF have few effects on the classification performance [8]. So DF method always selects terms with high values of DF and removes terms with low values of DF.

DF method is a simple word reduction technology and has good performance. Due to its linear complexity, it can be easily scaled to be used in large-scale corpus.

2.2. -Test

-Test [14] is a feature selection approach based on term frequency, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. And it is defined as follows:In (3), is the average frequency of the term within a single category , is the average term frequency of the term in collection , is the document number in category , is the document number in collection , , and is the category number in collection .

The following two ways are used alternatively when the main features are finally selected:where , is the document number in category , and is the document number in collection .

Generally, the method shown in (4) is always better than that shown in (5) for multiclass problem.

2.3. CMFS

When selecting features, DF method only computes the document frequency of each unique term in one category, and then the highest document frequency of a term in various categories is retained as the term’s score. DIA association factor method [17] only calculates the distribution probability of a term in various categories, and then the highest probability of the term can be used as the term’s score. Yang et al. [1] noticed that both DF and DIA methods only focus on one respect of the problems (row or column). Thus DF method concentrates on the column of the term-to-category matrix, while DIA focuses on the row of the term-to-category matrix. Based on such observation, a new feature selection algorithm, Comprehensively Measure Feature Selection (CMFS), is proposed by Yang et al. It comprehensively measures the significance of a term both in intercategory and intracategory. And it is defined as follows:Here, is the probability that the feature appears in category , and can be considered as the conditional probability that the feature belongs to category when the feature occurs.

To measure the goodness of a term globally, two alternate ways can be used to combine the category-specific scores of a term. And the formulae are as follows:where , is the document number in category , and is the document number in collection .

3. IIRCT

In this section, we propose a feature selection approach based on interclass and intraclass relative contributions of terms. In the proposed algorithm, three critical factors, which are term frequency and the interclass relative contribution and the intraclass relative contribution of terms, are all considered synthetically.

3.1. Motivation

At present, a large number of feature selection algorithms emerge. Through studying and analysing them, we can easily find that DF, IG, and -Test algorithms are inclined to select high-frequency terms as main features, and their performances are good. Among them, DF and IG algorithms are based on document frequency, and -Test algorithm is based on term frequency. CTD and SCIW algorithms consider the category information, and they both have good accuracies.

Therefore, we conclude the following ones:(1)A term, which frequently occurs in a single class and does not occur in the other classes, is distinctive. Therefore, it should be given a high score.(2)A term, which rarely occurs in a single class and does not occur in the other classes, is irrelevant. Therefore, it should be given a low score.(3)A term, which frequently occurs in all classes, is irrelevant. Therefore, it should be given a low score.(4)A term, which occurs in some classes, is relatively distinctive. Therefore, it should be given a relatively high score.

From points (1) and (2), it can be seen that high-frequency terms have effects on the classification performance. From points (3) and (4), it can be seen that category information is also a very important factor which influences the classification effect. As a result, we have a conclusion that high-frequency terms and category information are both very important factors in improving the classification performance. In view of these, high-frequency terms and category information are considered synthetically when constructing feature selection function in this paper. When judging whether a word is a high-frequency term, term frequency method is used. While considering category information, we notice that ① if the probability that the feature occurs in category is higher than other features, can represent more effectively, ② if the probability that the feature occurs in category is higher than occurs in other categories, can represent more effectively, ③ if the conditional probability that the feature belongs to category is higher than belongs to other categories when the feature occurs, can represent more effectively. So, the feature selection function constructed in this paper considers the interclass and intraclass relative contributions of terms to measure the category information.

Based on the above, we propose a new feature selection approach, IIRCT, in which term frequency and the interclass relative contribution and the intraclass relative contribution of terms are all considered synthetically.

3.2. Algorithm Implementation

In this section, we firstly introduce some symbols.

is the term frequency of term in category , and it is calculated according towhere is the document number in category and is the times that the term appears in document .

is the document frequency of term in category .

is the total term frequency of all terms in category , and the calculation formula is as follows:where is the term number in category .

is the total document frequency of term in all categories, and it is calculated according towhere is the category number.

IIRCT algorithm measures the significance of a term from three aspects comprehensively, which are term frequency and the interclass and intraclass relative contributions of terms. Thus, we define comprehensive measurement for each term with respect to category as follows:where is the category number, is the term frequency of term in category , is the total term frequency of all terms in category , is the document frequency of term in category , and is the total document frequency of term in all categories.

In view of the probability theory, we can regard in (11) as the probability that the feature occurs in category , that is, . in (11) can be considered as the conditional probability that the feature belongs to category when the feature occurs, that is, . in (11) can be considered as the probability that the feature occurs in category , that is, . in (11) can be considered as the conditional probability that the feature belongs to category when the feature occurs, that is, . So (11) can be further represented as follows:Here, is the probability that the feature occurs in category , and can be considered as the conditional probability that the feature belongs to category when the feature occurs.

To measure the goodness of a term globally, we construct the following function:where which is the probability that category occurs in the entire training set, is the document number in category , and is the document number in collection .

3.3. Algorithm Description

According to the above, we present a new feature selection algorithm, IIRCT, based on interclass and intraclass relative contributions of terms. Its pseudocode is as in Pseudocode 1.

Input: training set , selected feature number
Output: top features in
()   For each category
() Compute the total term frequency of all terms in category
()   End For
()   For each term
() Compute the total document frequency of a term in all categories
() For each category
()  Compute the term frequency of a term in category
()  Compute the document frequency of a term in category
() End For
() End For
() For each term
() For each category
()  Compute the significance of a term in category
() End For
() End For
() For each term
() Compute the value of
() End For
() Rank all terms descendingly based on
() Selest top terms as features

4. Experiments Setup

4.1. Experimental Data

In this paper, we use two popular datasets, 20 NewsGroup and SougouCS.

The 20 NewsGroup corpus, which is collected by Ken Lang, has been widely used in text classification. This corpus contains 19997 newsgroup documents which are nearly evenly distributed among 20 discussion groups, and every group consists of 1,000 documents. All letters are converted into lowercase, and the word stemming is applied. In addition, we use the stop words list to filter words. The details of 20 NewsGroup corpus are as shown in Table 1.

The SougouCS corpus is provided by Sogou Laboratory. The documents of the corpus are from Sohu news website which has a lot of classified information. As the number of web pages in some classes is too small, we only choose 12 classes. And the detail is as shown in Table 2.

4.2. Document Representation

Documents are represented by vector space model [4]. That is, the content of a document is represented by a vector in the term space. It is illustrated in detail as the following. Consider , where is the number of the features selected by feature selection algorithms and is the weight of feature in document . In experiments, Term Frequency-Inverse Document Frequency (TF-IDF) [18] is used to calculate the weights of the m selected features in each document.

4.3. Classifier Selection

In the experiments, -Nearest Neighbors (kNN) is used to classify and test documents. And it is also a case-based or instance-based categorization algorithm. At present, kNN is widely used in text classification as it is simple and has low error rate.

The principle of kNN classification algorithm is very simple and intuitive. Giving a test document whose category is unknown, the classification system will find the -nearest documents by computing the similarities between documents in training data. And then, we will get the category of the test documents according to the -nearest documents. The similarity measure used for the classifier is the cosine function [19].

In the paper, we set . And we randomly select 65% instances from each category as training data and the rest as testing data.

4.4. Performance Measures

We measure the effectiveness of classifiers in terms of the combination of precision () and recall () widely used in text categorization. That is, we use the well-known function [20] as follows:

For multiclass text categorization, is usually calculated in two ways. And they are the macroaveraged () and the microaveraged (). Here, we only use , as shown inwhere is the value of the predicted th category.

5. Results and Discussions

5.1. Results and Discussions on 20 NewsGroup

Figure 1 shows the precision and recall of IIRCT, DF, -Test, and CMFS on the 20 NewsGroup corpus when 1,500 features are selected in feature space. It can be seen from Figure 1(a) that the precision of IIRCT is higher than that of DF, -Test, and CMFS. And in some categories, the precision of IIRCT almost reaches up to 95%. Similarly, Figure 1(b) also indicates that the performance of IIRCT is better than that of DF, -Test, and CMFS, and the recall of most categories has some improvements.

The numbers 1–20 in Figure 1 can be referred to in Table 1.

Figure 2 shows the performance of IIRCT, DF, -Test, and CMFS on the 20 NewsGroup corpus with different feature dimensionalities. From Figure 2, we can conclude that the of IIRCT is close to that of CMFS when 100 features are selected. But if 200, 500, 1000, 1500, 2000, 2500, 3000, or 3500 terms are selected as features, the curve of IIRCT is higher than that of DF, -Test, and CMFS. This means that the performance of IIRCT is better than the other three algorithms. Besides, it can be found that the value of macro- decreases as the feature number increases. The reason for this is that the boundaries between categories are very clear in the 20 NewsGroup corpus. As a consequence, small amount of features can achieve good classification performance. But with the feature number increasing, many features have a negative impact on classification performance. And the classification effect gets poor.

5.2. Results and Discussions on SougouCS

Figure 3 shows the precision and recall of IIRCT, DF, -Test, and CMFS on the SougouCS corpus when 4,500 features are selected in feature space. It is clear that, in most categories, the precision and recall of IIRCT have some improvements compared to DF, -Test, and CMFS. And this means that IIRCT achieves better performance than that of DF, -Test, and CMFS.

The numbers 1–12 in Figure 3 can be referred to in Table 2.

Figure 4 depicts the performance of the four algorithms on the SougouCS corpus. From Figure 4, we can know that the curve of IIRCT lies above the other three curves, which also means IIRCT has better performance than that of DF, -Test, and CMFS. Besides, it can be found that the value of macro- is the largest when 4500 features are selected. And when the selected feature number increases or decreases from 4500, the value of macro- decreases. The reason for this is that, in the SougouCS corpus, some categories, such as fashion and entertainment, have many common words which make the boundaries between categories obscure. When small amount of features is selected, some documents cannot be classified correctly. And when the feature number increases to a certain value, these features make the boundaries between categories clear and improve the classification effect. When the feature number keeps increasing, many features have a negative impact on classification performance. And the classification effect gets poor.

6. Conclusions

Feature selection plays a critical role in text classification and has an immediate impact on text categorization. So we put forward a feature selection approach, IIRCT, based on interclass and intraclass relative contributions of terms in the paper. In our proposed algorithm, term frequency and the interclass and intraclass relative contributions of terms are all considered synthetically. The experimental results on 20 NewsGroup and SougouCS corpora show that IIRCT achieves better performance than DF, -Test, and CMFS. Therefore, the algorithm proposed in this paper is an effective feature selection method.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This research was supported by the National Science Foundation of China under Grants 61402363 and 61272284, Shaanxi Technology Committee Industrial Public Relation Project under Grant 2014K05-49, Natural Science Foundation Project of Shaanxi Province under Grant 2014JQ8361, Education Department of Shaanxi Province Key Laboratory Project under Grant 15JS079, Xi’an Science Program Project under Grant CXY1509(7), and Beilin district of Xi’an Science and Technology Project under Grant GX1625.