Abstract

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

1. Introduction

Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network. The problem of text classification can be described as follows. In practice, text documents are often represented as a document term matrix (shown in (1)). Row vector denotes the th text document among the text documents. Each feature corresponds to a term in the text documents and is term frequency of in . Consider a simple example of two short text documents. The first document is “You are beautiful!” and the second is “Good morning, you guys!”. of two documents is shown in (2). and show that the first feature “are” exists once in the first text document (“You are beautiful!”). Moreover, is labeled with a class in which should be. is drawn from a predefined -class label set . It is the major task of text classification to construct classification models using training text documents and then to predict a class label () for other text documents whose class labels are unknown.

DTM of Documents

DTM of Two Documents

The most important characteristics of text data are its high dimensionality and sparsity [1]. A corpus of text documents often contains a large amount of words (high dimensionality) and a given text document may contain only a small number of words (sparsity). This idea is described in the aspect of as follows. Consider in (1), In most cases, is very large, which makes high-dimensional. For , a large number of because only contains a small number of the total terms. This makes high-sparse. Due to the high dimensionality and sparsity of text data, it is essential to find an efficient feature selection method to reduce feature space. The reduction of feature space causes the loss of information for text classification. A feature selection method is efficient if it keeps classification accuracy of classifiers when feature space is reduced.

Feature selection is one of the important tasks in text classification due to the high dimensionality of feature space and the existence of indiscriminative features [1]. Feature selection methods are able to reduce the high-dimensional indiscriminative feature space into low-dimensional discriminative feature subspace [2]. For example, in (2), feature “beautiful” and feature “morning” may be discriminative features and then a six-dimensional can be reduced into a two-dimensional one. This paper focuses on filter feature selection and all the methods introduced in this paper are filter feature selection methods. Filter methods are chosen because they are more computationally efficient than wrapper methods [2]. A lot of feature selection methods have been proposed. Supervised feature selection methods include information gain (IG), mutual information, Gini index (GI), and expected cross-entropy. Unsupervised feature selection methods include document frequency (DF), variance score (VS) (or term variance), and LS.

LS is an unsupervised feature selection method. It is the key assumption that data from the same class label are often close to each other and the importance of a feature is evaluated by its power of locality preserving ability [3]. More details of LS are covered in Section 2.2.3. LS has been successfully used in areas such as face recognition. LS has not been used in feature selection for text classification currently. We are interested in the efficiency of this proved-discriminative feature selection method in text classification. However, both experiments and analysis indicate that LS is inefficient for text classification due to the sparsity of . LS is unable to keep accuracy of classifiers when feature space is reduced and is unable to efficiently reduce the sparsity of . To improve LS, this paper proposes an unsupervised feature selection method DVS. DVS uses feature distance contribution (a ratio) to replace similarity measure used in original LS to rank feature importance so as to select discriminative features. DVS is efficient in selecting discriminative features from because feature distance contribution pays much attention to nonzero values and is able to avoid negative effects caused by 0 values. DVS is also able to reduce the sparcity of more efficiently than LS. More details of DVS are covered in Section 3.

The rest of this paper is organized as follows. Section 2 introduces current feature selection methods and gives details of LS. Section 3 formally introduces DVS and analyzes details of DVS. In Section 4, experiments are performed using two from UCI repository to validate efficiency of DVS on . More discussion between efficiency of DVS and LS in text classification are in Section 4.2.3. Finally, Section 5 gives the conclusion.

This section discusses several supervised and unsupervised feature selection methods proposed in current studies.

2.1. Supervised Feature Selection Methods

For in (1), denotes the prior probability that a text document belongs to class . denotes the prior probability that . denotes the prior probability of such that belongs to and in , (; ).

2.1.1. Information Gain (IG)

IG selects features according to amounts of information that a feature uses for classification [46]. IG of in text classification is given by the following formula: Some studies have applied IG to feature selection for text classification [79], including improvements of IG [10, 11].

2.1.2. Mutual Information (MI)

MI [12, 13] observes interesting relationship between two variables. MI of in text classification is given by the following formula: MI has been used in feature selection for text classification in current studies [14, 15]. Improvements on MI are also covered in current studies [16, 17].

2.1.3. Gini Index (GI)

GI is first introduced in CART (a typical decision tree algorithm) to evaluate impurity of [6, 18]. GI of in text classification is given by the following formula: GI has been used in feature selection for text classification [19]. Improvements of GI are also catch attentions of current studies [20].

2.1.4. Expected Cross Entropy (ECE)

The same as IG and MI, ECE [21] is a measure from information theory. ECE of in text classification is given by the following formula:ECE is imposed to feature selection for text classification in some studies [22, 23], including improvements of ECE [24].

2.2. Unsupervised Feature Selection Methods

For in (1), is the mean of and .

2.2.1. Document Frequency (DF)

DF [7, 25] assumes that frequent terms are more informative than nonfrequent terms. DF is the total number of documents in which appears. DF is usually used as a criterion to evaluate efficiency of other feature selection methods. DF of is given by the following formula: denotes the total number of documents in which appears. A discriminative feature gets high DF. Improvements on DF have been proposed in some studies [26, 27].

2.2.2. Variance Score (VS)

VS [25] ranks features by calculating variance of each feature in . VS is given by the following formula:A discriminative feature gets high VS. VS is a simple feature selection method used in feature selection for text mining [28].

2.2.3. Laplacian Score (LS)

It is the key assumption of LS that data from the same class label are often close to each other and the importance of a feature is evaluated by its power of locality preserving ability and LS is described as follows [3]: evaluates the similarity between and . and are similar if is among nearest neighbors of or is among nearest neighbors of . is given by the following formula: is a suitable constant. is the element of weighted matrix which evaluates similarities of samples of .

Define vector , then . in formula (4) is the weighted variance of . is given by the following formula: is the element of . The weighted variance can be simply replaced by standard variance formulated as the following formula:

For a discriminative feature, is smaller whenever is bigger and, on the other hand, is bigger. Thus, LS of a discriminative feature tends to be small. (more details of LS are in [3]).

LS is able to evaluate and rank importance of features and it has been used incurrent studies. He et al. [3] propose LS for feature selection and apply it and data variance (VS) to select features from CUM PIE face database for clustering. Experimental results show that LS performs better than data variance (VS) in several tests with different clusters (). Hu and Man [29] impose LS to spare coding. They use it to evaluate the locality preserving ability of candidate atoms from data examples and select high ranked candidates. Experiments on UCI data demonstrate effectiveness, which also indicates efficiency of LS. Solorio-Fernández et al. [30] propose LS for supervised feature selection. It is imposed to rank features to reduce the search space. This method outperforms better than the other feature selection methods on UCI data sets. Good performance of this method also demonstrates robustness of LS in feature selection. Huang et al. [31] propose LS for face recognition. They impose LS to rank the regular and irregular features in order to achieve the best performances more quickly and it is demonstrated that it is better than other several methods. It is proved that it is capable of giving a better ranking of features in their experiments. Chunlu et al. [32] study classification of sMRI data in Alzheimer’s disease. LS is imposed to select more discriminative features by evaluating their locality preserving abilities. Experimental results prove robustness of LS in feature selection. Bo et al. [33] impose LS to gene feature selection and develop a new method called locality sensitive Laplacian Score (LSLS) by combining LS and locality sensitive discriminant analysis. Experimental results indicate that the method is effective and LS performs well.

Current studies major in the ideas of improving LS. Liu et al. [34] propose an improved method called LSE (Laplacian Score combined with distance-based entropy measure) for unsupervised feature selection. LSE uses entropy measure to replace -means clustering used in LS to solve the drawbacks of LS. Experimental results of 6 UCI data sets demonstrate that LSE performs better than LS. Padungweang et al. [35] also propose an efficient method based on the strong constraint on the global topology of the data space. Experimental results indicate that compared with conventional LS, this method has better performances. Zhu et al. [36] propose an improved LS method called IterativeLS. It tires improving LS by gradually improving the nearest neighbor graph by discarding the least relevant features at each iteration. Experimental results on both UCI data sets and face databases demonstrate that IterativeLS performs better than LS. Benabdeslem and Hindawi [37] address the problem of semisupervised feature selection for high-dimensional data and propose Constrained Laplacian Score (CLS) to improve LS. Experimental results on 5 UCI data sets demonstrate that CLS is better than LS in the problem of semisupervised feature selection. Moreover, Wang et al. [38] also propose a new method termed Label Reconstruction based Laplacian Score (LRLS) to improve LS for semisupervised feature selection. Experimental results on three UCI data sets indicate that LRLS is clearly better than LS in most occasions.

It is shown that LS is efficient and widely used in feature selection of current studies. It is used as a benchmark technique in order to compare with a new technique in both unsupervised and supervised feature selections [35]. LS is successfully used in some areas such as sparse coding [29], face recognition [31], and gene feature selection [33]. This paper studies feature selection in text classification and argues that conventional LS is not appropriate for feature selection of . This paper proposes an unsupervised method termed DVS to improve LS in feature selection of . Section 3 introduces DVS and analyzes details of DVS.

Current feature selection methods for text classification have been studied a lot and a lot of improvements have been proposed. LS is proved discriminative in the application of several areas. However, LS has not been used in feature selection for text classification currently. We are interested in performances of this proved-discriminative method in text classification. LS performs poorly in experiments. Through further analysis, we found that LS is not appropriate for text classification because LS is not able to select discriminative features from a usually sparse . A new method called DVS is proposed to improve LS in text classification. DVS uses difference measure to replace similarity measure of original LS. More discussions of DVS and LS are covered in Section 4.2.3.

3. Distance Variance Score (DVS)

3.1. Feature Distance Contributions (DC)

It is the difference between DVS and LS that DVS takes difference measures among text documents in instead of similarity measures. Let denote feature distance contributions of . shows how much contributes to differences among text documents in so as to evaluate importance of for text classification. is given by the following formula:

is a distance measure. evaluates the difference between and and evaluates the difference between and . is a ratio to show the importance of for . gives little information to understand how much it contributes to . However, if is taken into account, it is easy to understand how much contributes to . City block distance is introduced to implement because it is a well-know and computationally easy-calculated distance measure [6]. and .

Consider the in (16):

Obviously, but it is hard to understand whether contributes more to than contributes to . However, if and are taken into account,

It is clear that contributes 40% to and contributes 50% to , which indicates that is more important for than is for .

is efficient to evaluates importance of a given feature for text documents.

A

3.2. Algorithm of DVS

Let denote a document term matrix of text documents and denote the th text document of . Let denote the th term of and denote the th term frequency of . Let denote DVS of . is formulated as:

is the mean of . DVS is constructed by two parts: standard variance (shown in formula (8)) and feature distance contributions . shows global structure of a given feature and shows local structure of a given feature. The key idea of DVS in choosing discriminative features is that a feature is discriminative if it contains much information ( is large) and it contributes much to differences among documents ( is large). DVS is able to show differences among documents so as to classify text documents from different classes. and in original LS (shown in Formula (6)) are replaced by to evaluate differences instead of similarities. More details of are in Section 3.1. For a discriminative feature , if and are large, is large, too.

Algorithm of feature selection using DVS is shown in Algorithm 1.   of each is calculated using and all the are sorted in decreasing order using in Algorithm 1.   also creates a matrix such that . is used to store city block distances among documents to avoid recalculating each in procedure so as to improve efficiency of the algorithm. The features whose DVS are bigger are chosen for text classification. is determined manually.

Procedure  
begin
let   denotes the map storing and () and
 # is the matrix to store value of city block distance among documents and
 #
 #when ,
for each in   do
  for each in   do
   ;
  end
end
 #calculate for each
for   in   do
  ;
  ;
end
 #sort in decreasing order according to value
;
 #select and return the first features
return  ;
end
Procedure  
# is the matrix to store value of city block distance among documents and
#
#when ,
begin
let   denotes the standard variance of
let   denotes feature distance contributions of
let   denotes the index vector of samples in and
 #calculate .  See Section 3.1
for each in   do
  for each in and   do
   #calculate city block distance between and . See Section 3.1
   ;
   #calculate city block distance between and . See Section 3.1
   ;
    += ;
  end
end
return  ;
end

4. Experiments

4.1. Data

DBWorld e-mails data set is taken from UCI ML repository and contains 64 e-mails of two classes manually collected from DBWorld mailing list. This data set is represented as . Each attribute corresponds to a term in documents. This data set is donated on the 6th of November, 2011, and it is for classification task. Stop Words have been removed in the preprocessing task. The donator has used it to train different classification algorithms.

CNAE data set is also taken form UCI ML repository. It contains 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. Each document of this data set is represented as a term vector. CNAE data set is then transformed into a for experiments. The donator donated this data set on the 3rd of October, 2012.

Details about the two data sets are shown in Table 1.

4.2. Discriminative Features Experiments
4.2.1. Experiments of DVS

DVS is calculated using formula (17).

are calculated using in Algorithm 1   and all the are sorted in decreasing order using in Algorithm 1.   For both data sets, the features of total features are selected by LS and DVS from feature spaces to compare performances of LS and DVS. It shows performances of LS and DVS as rapidly decreases, choosing = 2721, 1721, 721, 521, 321, 121, and 21 for DBWorld data set and, moreover, choosing = 656, 456, 256, 156, and 56 for CNAE data set. DF can be used as a criterion to evaluate performance of a feature selection method for text mining. DF is also applied to experiments to select features from the two data sets to evaluate performance of DVS.

In order to compare efficiency of LS and DVS for feature selection, features selected by LS and DVS are tested by classifiers. The commonly used classifiers for text classification include Naïve Bayes [39, 40], -nearest neighbor [40, 41], neural network [42, 43], support vector machine (SVM, [40]), and decision tree (DT, [40, 43]). This paper chooses DT(C5.0) and SVM to test features selected by LS and DVS because DT has been used either as main classification tool or as baseline classifier and SVM offers two advantages for text classification, according to Fabrizio’s paper [43].

Both data sets are separated into training sets and testing sets. A training set is 70% of the whole data set. DT(C5.0) and SVM classifiers are constructed to test the first features selected by LS and DVS. Meanwhile, DT(C5.0) and SVM classifiers constructed by total features are used as baselines to compare efficiency of LS and DVS for feature selection.

For each , 20 DT(C5.0) and 20 SVM classifiers are constructed using randomly separated training sets. “Two-tailed paired test” is used to test whether feature selection using LS and DVS will affect accuracy of classification. If significances (Sig.’s) are lower than 0.05, then means of the baseline and means of classifiers using LS or DVS are considered not the same (to reject the null hypothesis); that is, LS or DVS will affect accuracy of classification negatively or positively. If Sig.’s are greater than 0.05, then mean of the pairs is considered the same (not to reject the null hypothesis); that is, LS or DVS has no effects on accuracy of classification.

4.2.2. Experimental Results

Figure 1 shows average accuracies of each of both DT(C5.0) and SVM classifiers of the two data sets. Obviously, DVS outperforms LS and accuracy of classifiers using feature selected by DVS keeps in a high level as reduces from 2721 to 21 and from 656 to 56. Features selected by DVS are more effective than those selected by LS. At a very small subfeature space, classifiers using LS gets low accuracies. High accuracies can be achieved by using features selected by DVS. Figure 1 also shows that DVS performs as well as DF. Furthermore, Table 2 shows average accuracies (mean) and standard deviations (std) of testing sets of classifiers and results of two-tailed paired test. For DBWorld data set, most of the significances (Sig.’s) of DT(C5.0) or SVM using LS in Table 2(a) are much lower than 0.05, which means accuracies of classifier using features selected by LS are significantly different form baseline. On the contrary, in Table 2(a), all the significances of the two classifiers using DVS are much higher than 0.05, showing that DVS significantly gets the same good classification accuracies as baseline. For CNAE data set, most of the significances (Sig.’s) of DT(C5.0) or SVM using LS in Table 2(a) are much lower than 0.05, which means accuracies of classifier using features selected by LS are significantly different form baseline. Classifiers using LS gets a very low accuracies when = 156 or 56. Table 2(b) shows that DT(C5.0) using DVS significantly gets the same good accuracies as baseline. However, in Table 2(b), SVM using DVS does not significantly gets the same accuracies as baseline still keeps relatively high accuracies. Figure 1 shows low accuracies of classifier using features selected by LS. Feature selected by LS have negative effects on text classification, especially when is very small. Figure 1 shows high accuracies of classifiers using features selected by DVS. Feature selected by DVS have positive effects on text classification. Figure 1 also shows that DVS performs better in DT(C5.0) than in SVM.

Experimental results in this paper show that DVS is much better than LS in feature selection of text classification, especially as is reduced to a small number. Features selected by DVS have positive effects on text classification and features selected by LS make negative effects. DVS is able to overcomes negative effects caused by sparsity of .

4.2.3. More Discussion about DVS and LS

As mentioned above, it is the key assumption of LS that documents with the same class are close. LS of a discriminative feature should be smaller. Consider formula (10) introduced in Section 2. LS uses to measure similarity between and and to measure similarity between and . For a discriminative feature, tends to small, which indicates that each feature selected by LS contains many 0 values. Thus, LS is easily affected by 0 values in and the sub-s constructed by features selected by LS are also sparse. As reduces, the sub-s contain less and less information, resulting in pool performances of classifiers in text classification. On the contrary, the key idea of DVS is that text documents with different classes are not close. DVS tries to discover differences instead of similarities among documents and are used to evaluate how much a feature contributes to differences among text documents. pays much attention to nonzero value and is able to avoid negative effects caused by 0 values. Thus, DVS is capable of reducing the sparsity of and select discriminative features capable of discovering differences among text documents. By discovering differences among text documents, DVS is much more efficient than LS. The sparsity of sub-s of LS and DVS is shown in Table 3. Table 3 shows that as reduces, the sparsity of sub-s constructed by features selected by DVS also reduce but the sparsity of sub-s constructed by features selected by LS still keep a high level.

5. Conclusion

This paper proposes an unsupervised feature selection method DVS for text classification. DVS aims to improve LS in feature selection of text classification. DVS uses feature distance distributions to discover importance of a feature for text classification and overcomes affects caused by the sparsity of DTM. DVS efficiently reduces the sparsity of DTM and features selected by DVS have positive effects on text classification. Experimental results on UCI DBWorld e-mails data set indicates that DVS performs much better than LS in feature selection of text classification.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by Project of National Social Sciences Foundation, Grant no. 13BTJ005.