Computational Intelligence and Neuroscience

Volume 2016, Article ID 1715780, 8 pages

http://dx.doi.org/10.1155/2016/1715780

## A Feature Selection Approach Based on Interclass and Intraclass Relative Contributions of Terms

School of Computer Science and Engineering, Xi’an University of Technology, Xi’an, Shaanxi 710048, China

Received 29 March 2016; Revised 21 June 2016; Accepted 11 July 2016

Academic Editor: Elio Masciari

Copyright © 2016 Hongfang Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Feature selection plays a critical role in text categorization. During feature selecting, high-frequency terms and the interclass and intraclass relative contributions of terms all have significant effects on classification results. So we put forward a feature selection approach, IIRCT, based on interclass and intraclass relative contributions of terms in the paper. In our proposed algorithm, three critical factors, which are term frequency and the interclass relative contribution and the intraclass relative contribution of terms, are all considered synthetically. Finally, experiments are made with the help of kNN classifier. And the corresponding results on 20 NewsGroup and SougouCS corpora show that IIRCT algorithm achieves better performance than DF, -Test, and CMFS algorithms.

#### 1. Introduction

As the number of digital documents available on the Internet has been growing significantly in recent years, it is impossible to manipulate manually such enormous information [1]. More and more methods based on statistical theory and machine learning have been proposed, and they are applied successfully to information processing. An effective method for managing the vast amount of data is text categorization, which has been widely applied to many fields such as theme detection, spam filtering, identity recognition, web page classification, and semantic parsing.

The goal of text classification is to assign a new document automatically to a predefined category [2]. A typical text classification framework consists of preprocessing, document representation, feature selection, feature weighting, and classification stages [3]. In the preprocessing stage, it usually contains such tasks as tokenization, stop-word removal, lowercase conversion, and stemming. In the document representation stage, it generally utilizes the vector space model that makes use of the bag-of-words approach [4]. In the feature selection stage, it usually employs the filter methods such as document frequency (DF) [5], mutual information (MI) [6], information gain (IG) [7], and chi-square (CHI) [8]. In the feature weighting stage, it usually uses TF-IDF to calculate the weights of the selected features in each document. And in the classification stage, it always uses some popular classification algorithms, for example, decision trees [9], -Nearest Neighbors (kNN) [10], and support vector machine (SVM) [11].

The major characteristic of text categorization is that the feature number in the feature space can easily reach up to tens or hundreds of thousands. It can not only increase computational time but also degrade classification accuracy [12]. As a consequence, feature selection plays a critical role in text classification.

The existing experimental results show that IG is one of the most effective feature selection methods, the performance of DF is similar to IG, and MI is the worst [13]. Through comparative analysis, it is easy to find that the performances of DF and IG are good, which means that high-frequency terms are really essential to text classification, while the performance of MI is bad as it is inclined to select low-frequency terms as features. Besides, -Test method is also based on term frequency [14] and its performance is good. During feature selecting, Categorical Term Descriptor (CTD) method considers the document frequency of IDF and the category information of ICF particularly [15]. Similarly, Strong Class Information Words (SCIW) method selects the terms which have good abilities to distinguish categories [16] and it also considers the category information. Experimental results show that CTD and SCIW both have good accuracies. So we can easily know that feature selection methods based on category information always have good performances. As a result, we draw that high-frequency terms and category information are very important in improving the classification effectiveness. Comprehensively Measure Feature Selection (CMFS) method [1] considers high-frequency terms and category information simultaneously, and it also obtains good results. But it does not consider the interactions between categories. In view of these, we propose a new feature selection algorithm named as feature selection approach based on interclass and intraclass relative contributions of terms (IIRCT), in which term frequency and the interclass relative contribution and the intraclass relative contribution of terms are all considered synthetically.

#### 2. Related Works

To deal with massive documents corpora, many feature selection approaches have been proposed. And their purpose is to select the terms whose classification capabilities are stronger comparatively in feature space. After feature selection, the dimensionality of feature space can be reduced, and the efficiency and accuracy of classifiers can be improved. Its main idea is as follows. Firstly, it uses the feature selection function to compute some important indicators of each word in feature space. And then, it sorts the words in descending order according to above values. Finally, it selects the top m words to construct the feature vector.

In this section, we introduce some symbols used in the following firstly.

is the times that the term appears in document , namely, term frequency.

is the average frequency of the term within a single category , and the calculation formula is as follows:where is the document number in collection , is the document number in category , and , which is an indicator to discriminate whether document belongs to category .

is the average term frequency of the term in collection , and it is calculated according toSimilarly, is the document number in collection .

Then we give the definition of three feature selection methods, which are DF, -Test, and CMFS, respectively.

##### 2.1. DF

DF method calculates the number of documents which contain the terms in the category to measure the relevance of the terms and the categories. And the terms can be reserved only when they appear in adequate documents. This measurement is based on such an assumption that the terms which have low values of DF have few effects on the classification performance [8]. So DF method always selects terms with high values of DF and removes terms with low values of DF.

DF method is a simple word reduction technology and has good performance. Due to its linear complexity, it can be easily scaled to be used in large-scale corpus.

##### 2.2. -Test

-Test [14] is a feature selection approach based on term frequency, which is used to measure the diversity of the distributions of a term between the specific category and the entire corpus. And it is defined as follows:In (3), is the average frequency of the term within a single category , is the average term frequency of the term in collection , is the document number in category , is the document number in collection , , and is the category number in collection .

The following two ways are used alternatively when the main features are finally selected:where , is the document number in category , and is the document number in collection .

Generally, the method shown in (4) is always better than that shown in (5) for multiclass problem.

##### 2.3. CMFS

When selecting features, DF method only computes the document frequency of each unique term in one category, and then the highest document frequency of a term in various categories is retained as the term’s score. DIA association factor method [17] only calculates the distribution probability of a term in various categories, and then the highest probability of the term can be used as the term’s score. Yang et al. [1] noticed that both DF and DIA methods only focus on one respect of the problems (row or column). Thus DF method concentrates on the column of the term-to-category matrix, while DIA focuses on the row of the term-to-category matrix. Based on such observation, a new feature selection algorithm, Comprehensively Measure Feature Selection (CMFS), is proposed by Yang et al. It comprehensively measures the significance of a term both in intercategory and intracategory. And it is defined as follows:Here, is the probability that the feature appears in category , and can be considered as the conditional probability that the feature belongs to category when the feature occurs.

To measure the goodness of a term globally, two alternate ways can be used to combine the category-specific scores of a term. And the formulae are as follows:where , is the document number in category , and is the document number in collection .

#### 3. IIRCT

In this section, we propose a feature selection approach based on interclass and intraclass relative contributions of terms. In the proposed algorithm, three critical factors, which are term frequency and the interclass relative contribution and the intraclass relative contribution of terms, are all considered synthetically.

##### 3.1. Motivation

At present, a large number of feature selection algorithms emerge. Through studying and analysing them, we can easily find that DF, IG, and -Test algorithms are inclined to select high-frequency terms as main features, and their performances are good. Among them, DF and IG algorithms are based on document frequency, and -Test algorithm is based on term frequency. CTD and SCIW algorithms consider the category information, and they both have good accuracies.

Therefore, we conclude the following ones:(1)A term, which frequently occurs in a single class and does not occur in the other classes, is distinctive. Therefore, it should be given a high score.(2)A term, which rarely occurs in a single class and does not occur in the other classes, is irrelevant. Therefore, it should be given a low score.(3)A term, which frequently occurs in all classes, is irrelevant. Therefore, it should be given a low score.(4)A term, which occurs in some classes, is relatively distinctive. Therefore, it should be given a relatively high score.

From points (1) and (2), it can be seen that high-frequency terms have effects on the classification performance. From points (3) and (4), it can be seen that category information is also a very important factor which influences the classification effect. As a result, we have a conclusion that high-frequency terms and category information are both very important factors in improving the classification performance. In view of these, high-frequency terms and category information are considered synthetically when constructing feature selection function in this paper. When judging whether a word is a high-frequency term, term frequency method is used. While considering category information, we notice that ① if the probability that the feature occurs in category is higher than other features, can represent more effectively, ② if the probability that the feature occurs in category is higher than occurs in other categories, can represent more effectively, ③ if the conditional probability that the feature belongs to category is higher than belongs to other categories when the feature occurs, can represent more effectively. So, the feature selection function constructed in this paper considers the interclass and intraclass relative contributions of terms to measure the category information.

Based on the above, we propose a new feature selection approach, IIRCT, in which term frequency and the interclass relative contribution and the intraclass relative contribution of terms are all considered synthetically.

##### 3.2. Algorithm Implementation

In this section, we firstly introduce some symbols.

is the term frequency of term in category , and it is calculated according towhere is the document number in category and is the times that the term appears in document .

is the document frequency of term in category .

is the total term frequency of all terms in category , and the calculation formula is as follows:where is the term number in category .

is the total document frequency of term in all categories, and it is calculated according towhere is the category number.

IIRCT algorithm measures the significance of a term from three aspects comprehensively, which are term frequency and the interclass and intraclass relative contributions of terms. Thus, we define comprehensive measurement for each term with respect to category as follows:where is the category number, is the term frequency of term in category , is the total term frequency of all terms in category , is the document frequency of term in category , and is the total document frequency of term in all categories.

In view of the probability theory, we can regard in (11) as the probability that the feature occurs in category , that is, . in (11) can be considered as the conditional probability that the feature belongs to category when the feature occurs, that is, . in (11) can be considered as the probability that the feature occurs in category , that is, . in (11) can be considered as the conditional probability that the feature belongs to category when the feature occurs, that is, . So (11) can be further represented as follows:Here, is the probability that the feature occurs in category , and can be considered as the conditional probability that the feature belongs to category when the feature occurs.

To measure the goodness of a term globally, we construct the following function:where which is the probability that category occurs in the entire training set, is the document number in category , and is the document number in collection .

##### 3.3. Algorithm Description

According to the above, we present a new feature selection algorithm, IIRCT, based on interclass and intraclass relative contributions of terms. Its pseudocode is as in Pseudocode 1.