Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Wang, Heyong; Hong, Ming

doi:https://doi.org/10.1155/2015/695720

Mathematical Problems in Engineering

On this page

Abstract Introduction Related Work Conclusion References Copyright Related Articles

Research Article | Open Access

Volume 2015 | Article ID 695720 | https://doi.org/10.1155/2015/695720

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Heyong Wang¹and Ming Hong¹

Academic Editor: Alexander Klimenko

Received21 Jan 2015

Revised30 Mar 2015

Accepted15 Apr 2015

Published11 May 2015

Abstract

With the rapid development of web applications such as social network, a large amount of electric text data is accumulated and available on the Internet, which causes increasing interests in text mining. Text classification is one of the most important subfields of text mining. In fact, text documents are often represented as a high-dimensional sparse document term matrix (DTM) before classification. Feature selection is essential and vital for text classification due to high dimensionality and sparsity of DTM. An efficient feature selection method is capable of both reducing dimensions of DTM and selecting discriminative features for text classification. Laplacian Score (LS) is one of the unsupervised feature selection methods and it has been successfully used in areas such as face recognition. However, LS is unable to select discriminative features for text classification and to effectively reduce the sparsity of DTM. To improve it, this paper proposes an unsupervised feature selection method named Distance Variance Score (DVS). DVS uses feature distance contribution (a ratio) to rank the importance of features for text documents so as to select discriminative features. Experimental results indicate that DVS is able to select discriminative features and reduce the sparsity of DTM. Thus, it is much more efficient than LS.

1. Introduction

Text classification is one of the important subfields of text mining, which recently gains increasing attention with the rapid development of web applications, such as social network. The problem of text classification can be described as follows. In practice, text documents are often represented as a document term matrix (shown in (1)). Row vector denotes the th text document among the text documents. Each feature corresponds to a term in the text documents and is term frequency of in . Consider a simple example of two short text documents. The first document is “You are beautiful!” and the second is “Good morning, you guys!”. of two documents is shown in (2). and show that the first feature “are” exists once in the first text document (“You are beautiful!”). Moreover, is labeled with a class in which should be. is drawn from a predefined -class label set . It is the major task of text classification to construct classification models using training text documents and then to predict a class label () for other text documents whose class labels are unknown.

DTM of Documents

DTM of Two Documents

The most important characteristics of text data are its high dimensionality and sparsity [1]. A corpus of text documents often contains a large amount of words (high dimensionality) and a given text document may contain only a small number of words (sparsity). This idea is described in the aspect of as follows. Consider in (1), In most cases, is very large, which makes high-dimensional. For , a large number of because only contains a small number of the total terms. This makes high-sparse. Due to the high dimensionality and sparsity of text data, it is essential to find an efficient feature selection method to reduce feature space. The reduction of feature space causes the loss of information for text classification. A feature selection method is efficient if it keeps classification accuracy of classifiers when feature space is reduced.

Feature selection is one of the important tasks in text classification due to the high dimensionality of feature space and the existence of indiscriminative features [1]. Feature selection methods are able to reduce the high-dimensional indiscriminative feature space into low-dimensional discriminative feature subspace [2]. For example, in (2), feature “beautiful” and feature “morning” may be discriminative features and then a six-dimensional can be reduced into a two-dimensional one. This paper focuses on filter feature selection and all the methods introduced in this paper are filter feature selection methods. Filter methods are chosen because they are more computationally efficient than wrapper methods [2]. A lot of feature selection methods have been proposed. Supervised feature selection methods include information gain (IG), mutual information, Gini index (GI), and expected cross-entropy. Unsupervised feature selection methods include document frequency (DF), variance score (VS) (or term variance), and LS.

LS is an unsupervised feature selection method. It is the key assumption that data from the same class label are often close to each other and the importance of a feature is evaluated by its power of locality preserving ability [3]. More details of LS are covered in Section 2.2.3. LS has been successfully used in areas such as face recognition. LS has not been used in feature selection for text classification currently. We are interested in the efficiency of this proved-discriminative feature selection method in text classification. However, both experiments and analysis indicate that LS is inefficient for text classification due to the sparsity of . LS is unable to keep accuracy of classifiers when feature space is reduced and is unable to efficiently reduce the sparsity of . To improve LS, this paper proposes an unsupervised feature selection method DVS. DVS uses feature distance contribution (a ratio) to replace similarity measure used in original LS to rank feature importance so as to select discriminative features. DVS is efficient in selecting discriminative features from because feature distance contribution pays much attention to nonzero values and is able to avoid negative effects caused by 0 values. DVS is also able to reduce the sparcity of more efficiently than LS. More details of DVS are covered in Section 3.

The rest of this paper is organized as follows. Section 2 introduces current feature selection methods and gives details of LS. Section 3 formally introduces DVS and analyzes details of DVS. In Section 4, experiments are performed using two from UCI repository to validate efficiency of DVS on . More discussion between efficiency of DVS and LS in text classification are in Section 4.2.3. Finally, Section 5 gives the conclusion.

This section discusses several supervised and unsupervised feature selection methods proposed in current studies.

2.1. Supervised Feature Selection Methods

For in (1), denotes the prior probability that a text document belongs to class . denotes the prior probability that . denotes the prior probability of such that belongs to and in , (; ).

2.1.1. Information Gain (IG)

IG selects features according to amounts of information that a feature uses for classification [4–6]. IG of in text classification is given by the following formula: Some studies have applied IG to feature selection for text classification [7–9], including improvements of IG [10, 11].

2.1.2. Mutual Information (MI)

MI [12, 13] observes interesting relationship between two variables. MI of in text classification is given by the following formula: MI has been used in feature selection for text classification in current studies [14, 15]. Improvements on MI are also covered in current studies [16, 17].

2.1.3. Gini Index (GI)

GI is first introduced in CART (a typical decision tree algorithm) to evaluate impurity of [6, 18]. GI of in text classification is given by the following formula: GI has been used in feature selection for text classification [19]. Improvements of GI are also catch attentions of current studies [20].

2.1.4. Expected Cross Entropy (ECE)

The same as IG and MI, ECE [21] is a measure from information theory. ECE of in text classification is given by the following formula:ECE is imposed to feature selection for text classification in some studies [22, 23], including improvements of ECE [24].

2.2. Unsupervised Feature Selection Methods

For in (1), is the mean of and .

2.2.1. Document Frequency (DF)

DF [7, 25] assumes that frequent terms are more informative than nonfrequent terms. DF is the total number of documents in which appears. DF is usually used as a criterion to evaluate efficiency of other feature selection methods. DF of is given by the following formula: denotes the total number of documents in which appears. A discriminative feature gets high DF. Improvements on DF have been proposed in some studies [26, 27].

2.2.2. Variance Score (VS)

VS [25] ranks features by calculating variance of each feature in . VS is given by the following formula:A discriminative feature gets high VS. VS is a simple feature selection method used in feature selection for text mining [28].

2.2.3. Laplacian Score (LS)

It is the key assumption of LS that data from the same class label are often close to each other and the importance of a feature is evaluated by its power of locality preserving ability and LS is described as follows [3]: evaluates the similarity between and . and are similar if is among nearest neighbors of or is among nearest neighbors of . is given by the following formula: is a suitable constant. is the element of weighted matrix which evaluates similarities of samples of .

Define vector , then . in formula (4) is the weighted variance of . is given by the following formula: is the element of . The weighted variance can be simply replaced by standard variance formulated as the following formula:

For a discriminative feature, is smaller whenever is bigger and, on the other hand, is bigger. Thus, LS of a discriminative feature tends to be small. (more details of LS are in [3]).

LS is able to evaluate and rank importance of features and it has been used incurrent studies. He et al. [3] propose LS for feature selection and apply it and data variance (VS) to select features from CUM PIE face database for clustering. Experimental results show that LS performs better than data variance (VS) in several tests with different clusters (). Hu and Man [29] impose LS to spare coding. They use it to evaluate the locality preserving ability of candidate atoms from data examples and select high ranked candidates. Experiments on UCI data demonstrate effectiveness, which also indicates efficiency of LS. Solorio-Fernández et al. [30] propose LS for supervised feature selection. It is imposed to rank features to reduce the search space. This method outperforms better than the other feature selection methods on UCI data sets. Good performance of this method also demonstrates robustness of LS in feature selection. Huang et al. [31] propose LS for face recognition. They impose LS to rank the regular and irregular features in order to achieve the best performances more quickly and it is demonstrated that it is better than other several methods. It is proved that it is capable of giving a better ranking of features in their experiments. Chunlu et al. [32] study classification of sMRI data in Alzheimer’s disease. LS is imposed to select more discriminative features by evaluating their locality preserving abilities. Experimental results prove robustness of LS in feature selection. Bo et al. [33] impose LS to gene feature selection and develop a new method called locality sensitive Laplacian Score (LSLS) by combining LS and locality sensitive discriminant analysis. Experimental results indicate that the method is effective and LS performs well.

Current studies major in the ideas of improving LS. Liu et al. [34] propose an improved method called LSE (Laplacian Score combined with distance-based entropy measure) for unsupervised feature selection. LSE uses entropy measure to replace -means clustering used in LS to solve the drawbacks of LS. Experimental results of 6 UCI data sets demonstrate that LSE performs better than LS. Padungweang et al. [35] also propose an efficient method based on the strong constraint on the global topology of the data space. Experimental results indicate that compared with conventional LS, this method has better performances. Zhu et al. [36] propose an improved LS method called IterativeLS. It tires improving LS by gradually improving the nearest neighbor graph by discarding the least relevant features at each iteration. Experimental results on both UCI data sets and face databases demonstrate that IterativeLS performs better than LS. Benabdeslem and Hindawi [37] address the problem of semisupervised feature selection for high-dimensional data and propose Constrained Laplacian Score (CLS) to improve LS. Experimental results on 5 UCI data sets demonstrate that CLS is better than LS in the problem of semisupervised feature selection. Moreover, Wang et al. [38] also propose a new method termed Label Reconstruction based Laplacian Score (LRLS) to improve LS for semisupervised feature selection. Experimental results on three UCI data sets indicate that LRLS is clearly better than LS in most occasions.

It is shown that LS is efficient and widely used in feature selection of current studies. It is used as a benchmark technique in order to compare with a new technique in both unsupervised and supervised feature selections [35]. LS is successfully used in some areas such as sparse coding [29], face recognition [31], and gene feature selection [33]. This paper studies feature selection in text classification and argues that conventional LS is not appropriate for feature selection of . This paper proposes an unsupervised method termed DVS to improve LS in feature selection of . Section 3 introduces DVS and analyzes details of DVS.

Current feature selection methods for text classification have been studied a lot and a lot of improvements have been proposed. LS is proved discriminative in the application of several areas. However, LS has not been used in feature selection for text classification currently. We are interested in performances of this proved-discriminative method in text classification. LS performs poorly in experiments. Through further analysis, we found that LS is not appropriate for text classification because LS is not able to select discriminative features from a usually sparse . A new method called DVS is proposed to improve LS in text classification. DVS uses difference measure to replace similarity measure of original LS. More discussions of DVS and LS are covered in Section 4.2.3.

3. Distance Variance Score (DVS)

3.1. Feature Distance Contributions (DC)

It is the difference between DVS and LS that DVS takes difference measures among text documents in instead of similarity measures. Let denote feature distance contributions of . shows how much contributes to differences among text documents in so as to evaluate importance of for text classification. is given by the following formula:

is a distance measure. evaluates the difference between and and evaluates the difference between and . is a ratio to show the importance of for . gives little information to understand how much it contributes to . However, if is taken into account, it is easy to understand how much contributes to . City block distance is introduced to implement because it is a well-know and computationally easy-calculated distance measure [6]. and .

Consider the in (16):

Obviously, but it is hard to understand whether contributes more to than contributes to . However, if and are taken into account,

It is clear that contributes 40% to and contributes 50% to , which indicates that is more important for than is for .

is efficient to evaluates importance of a given feature for text documents.

A

3.2. Algorithm of DVS

Let denote a document term matrix of text documents and denote the th text document of . Let denote the th term of and denote the th term frequency of . Let denote DVS of . is formulated as:

is the mean of . DVS is constructed by two parts: standard variance (shown in formula (8)) and feature distance contributions . shows global structure of a given feature and shows local structure of a given feature. The key idea of DVS in choosing discriminative features is that a feature is discriminative if it contains much information ( is large) and it contributes much to differences among documents ( is large). DVS is able to show differences among documents so as to classify text documents from different classes. and in original LS (shown in Formula (6)) are replaced by to evaluate differences instead of similarities. More details of are in Section 3.1. For a discriminative feature , if and are large, is large, too.

Algorithm of feature selection using DVS is shown in Algorithm 1. of each is calculated using and all the are sorted in decreasing order using in Algorithm 1. also creates a matrix such that . is used to store city block distances among documents to avoid recalculating each in procedure so as to improve efficiency of the algorithm. The features whose DVS are bigger are chosen for text classification. is determined manually.

Procedure
begin
let denotes the map storing and () and
# is the matrix to store value of city block distance among documents and
#
#when ,
for each in do
for each in do
;
end
end
#calculate for each
for in do
;
;
end
#sort in decreasing order according to value
;
#select and return the first features
return ;
end
Procedure
# is the matrix to store value of city block distance among documents and
#
#when ,
begin
let denotes the standard variance of
let denotes feature distance contributions of
let denotes the index vector of samples in and
#calculate . See Section 3.1
for each in do
for each in and do
#calculate city block distance between and . See Section 3.1
;
#calculate city block distance between and . See Section 3.1
;
+= ;
end
end
return ;
end

4. Experiments

4.1. Data

DBWorld e-mails data set is taken from UCI ML repository and contains 64 e-mails of two classes manually collected from DBWorld mailing list. This data set is represented as . Each attribute corresponds to a term in documents. This data set is donated on the 6th of November, 2011, and it is for classification task. Stop Words have been removed in the preprocessing task. The donator has used it to train different classification algorithms.

CNAE data set is also taken form UCI ML repository. It contains 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories. Each document of this data set is represented as a term vector. CNAE data set is then transformed into a for experiments. The donator donated this data set on the 3rd of October, 2012.

Details about the two data sets are shown in Table 1.

4.2. Discriminative Features Experiments

4.2.1. Experiments of DVS

DVS is calculated using formula (17).

are calculated using in Algorithm 1 and all the are sorted in decreasing order using in Algorithm 1. For both data sets, the features of total features are selected by LS and DVS from feature spaces to compare performances of LS and DVS. It shows performances of LS and DVS as rapidly decreases, choosing = 2721, 1721, 721, 521, 321, 121, and 21 for DBWorld data set and, moreover, choosing = 656, 456, 256, 156, and 56 for CNAE data set. DF can be used as a criterion to evaluate performance of a feature selection method for text mining. DF is also applied to experiments to select features from the two data sets to evaluate performance of DVS.

In order to compare efficiency of LS and DVS for feature selection, features selected by LS and DVS are tested by classifiers. The commonly used classifiers for text classification include Naïve Bayes [39, 40], -nearest neighbor [40, 41], neural network [42, 43], support vector machine (SVM, [40]), and decision tree (DT, [40, 43]). This paper chooses DT(C5.0) and SVM to test features selected by LS and DVS because DT has been used either as main classification tool or as baseline classifier and SVM offers two advantages for text classification, according to Fabrizio’s paper [43].

Both data sets are separated into training sets and testing sets. A training set is 70% of the whole data set. DT(C5.0) and SVM classifiers are constructed to test the first features selected by LS and DVS. Meanwhile, DT(C5.0) and SVM classifiers constructed by total features are used as baselines to compare efficiency of LS and DVS for feature selection.

For each , 20 DT(C5.0) and 20 SVM classifiers are constructed using randomly separated training sets. “Two-tailed paired test” is used to test whether feature selection using LS and DVS will affect accuracy of classification. If significances (Sig.’s) are lower than 0.05, then means of the baseline and means of classifiers using LS or DVS are considered not the same (to reject the null hypothesis); that is, LS or DVS will affect accuracy of classification negatively or positively. If Sig.’s are greater than 0.05, then mean of the pairs is considered the same (not to reject the null hypothesis); that is, LS or DVS has no effects on accuracy of classification.

4.2.2. Experimental Results

Figure 1 shows average accuracies of each of both DT(C5.0) and SVM classifiers of the two data sets. Obviously, DVS outperforms LS and accuracy of classifiers using feature selected by DVS keeps in a high level as reduces from 2721 to 21 and from 656 to 56. Features selected by DVS are more effective than those selected by LS. At a very small subfeature space, classifiers using LS gets low accuracies. High accuracies can be achieved by using features selected by DVS. Figure 1 also shows that DVS performs as well as DF. Furthermore, Table 2 shows average accuracies (mean) and standard deviations (std) of testing sets of classifiers and results of two-tailed paired test. For DBWorld data set, most of the significances (Sig.’s) of DT(C5.0) or SVM using LS in Table 2(a) are much lower than 0.05, which means accuracies of classifier using features selected by LS are significantly different form baseline. On the contrary, in Table 2(a), all the significances of the two classifiers using DVS are much higher than 0.05, showing that DVS significantly gets the same good classification accuracies as baseline. For CNAE data set, most of the significances (Sig.’s) of DT(C5.0) or SVM using LS in Table 2(a) are much lower than 0.05, which means accuracies of classifier using features selected by LS are significantly different form baseline. Classifiers using LS gets a very low accuracies when = 156 or 56. Table 2(b) shows that DT(C5.0) using DVS significantly gets the same good accuracies as baseline. However, in Table 2(b), SVM using DVS does not significantly gets the same accuracies as baseline still keeps relatively high accuracies. Figure 1 shows low accuracies of classifier using features selected by LS. Feature selected by LS have negative effects on text classification, especially when is very small. Figure 1 shows high accuracies of classifiers using features selected by DVS. Feature selected by DVS have positive effects on text classification. Figure 1 also shows that DVS performs better in DT(C5.0) than in SVM.

(a) DBWorld data set

(b) CNAE data set

Experimental results in this paper show that DVS is much better than LS in feature selection of text classification, especially as is reduced to a small number. Features selected by DVS have positive effects on text classification and features selected by LS make negative effects. DVS is able to overcomes negative effects caused by sparsity of .

4.2.3. More Discussion about DVS and LS

As mentioned above, it is the key assumption of LS that documents with the same class are close. LS of a discriminative feature should be smaller. Consider formula (10) introduced in Section 2. LS uses to measure similarity between and and to measure similarity between and . For a discriminative feature, tends to small, which indicates that each feature selected by LS contains many 0 values. Thus, LS is easily affected by 0 values in and the sub-s constructed by features selected by LS are also sparse. As reduces, the sub-s contain less and less information, resulting in pool performances of classifiers in text classification. On the contrary, the key idea of DVS is that text documents with different classes are not close. DVS tries to discover differences instead of similarities among documents and are used to evaluate how much a feature contributes to differences among text documents. pays much attention to nonzero value and is able to avoid negative effects caused by 0 values. Thus, DVS is capable of reducing the sparsity of and select discriminative features capable of discovering differences among text documents. By discovering differences among text documents, DVS is much more efficient than LS. The sparsity of sub-s of LS and DVS is shown in Table 3. Table 3 shows that as reduces, the sparsity of sub-s constructed by features selected by DVS also reduce but the sparsity of sub-s constructed by features selected by LS still keep a high level.

5. Conclusion

This paper proposes an unsupervised feature selection method DVS for text classification. DVS aims to improve LS in feature selection of text classification. DVS uses feature distance distributions to discover importance of a feature for text classification and overcomes affects caused by the sparsity of DTM. DVS efficiently reduces the sparsity of DTM and features selected by DVS have positive effects on text classification. Experimental results on UCI DBWorld e-mails data set indicates that DVS performs much better than LS in feature selection of text classification.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research was supported by Project of National Social Sciences Foundation, Grant no. 13BTJ005.

References

C. C. Aggarwal and C.-X. Zhai, “An introduction to text mining,” in Mining Text Data, pp. 1–10, Springer, 2012.
View at: Google Scholar
K. K. Bharti and P. K. Singh, “A survey on filter techniques for feature selection in text mining,” in Proceedings of the Second International Conference on Soft Computing for Problem Solving (SocProS 2012), December 28–30, 2012, vol. 236 of Advances in Intelligent Systems and Computing, pp. 1545–1559, Springer, New Delhi, India, 2014.
View at: Publisher Site | Google Scholar
X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” in Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, Mass, USA, 2005.
View at: Google Scholar
J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
View at: Publisher Site | Google Scholar
I. Iguyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
View at: Google Scholar
J. Han, M. Kamber, and J. Pei, Data Ming Concepts and Techniques, 3rd edition, 2012.
Y. M. Yang and J. O. Pedersen, “A comparative study on feature selection in text categorization,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 412–420, 1997.
View at: Google Scholar
T. Joachims, “Text categorization with support vector machines: learning with many relevant features,” in Proceedings of the 10th European Conference on Machine Learning (ECML '98), pp. 137–142, Chemnitz, Germany, 1998.
View at: Google Scholar
K. Li, X. Diao, and J. Cao, “Improved algorithm of text feature weighting based on information gain,” Computer Engineering, vol. 37, no. 1, pp. 16–21, 2011.
View at: Google Scholar
C. Lee and G. G. Lee, “Information gain and divergence-based feature selection for machine learning-based text categorization,” Information Processing and Management, vol. 42, no. 1, pp. 155–165, 2006.
View at: Publisher Site | Google Scholar
Y. Guo and X. Liu, “Study on information gain-based feature selection in Chinese text categorization,” Computer Engineering and Applications, vol. 48, no. 27, pp. 119–122, 2012.
View at: Google Scholar
K. W. Church and P. Hanks, “Word association norms, mutual information, and lexicography,” Computational Linguistics, vol. 27, no. 1, pp. 22–29, 1990.
View at: Google Scholar
S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive learning algorithms and representations for text categorization,” in Proceedings of the International Conference on Information and Knowledge Management, pp. 148–155, 1998.
View at: Google Scholar
Z. Pei, Y. Zhou, L. Liu, L. Wang, Y. Lu, and Y. Kong, “A mutual information and information entropy pair based feature selection method in text classification,” in Proceedings of the International Conference on Computer Application and System Modeling (ICCASM '10), pp. V6-258–V6-261, IEEE, Taiyuan, China, October 2010.
View at: Publisher Site | Google Scholar
Z. Ge, “Reverse text frequency based on mutual information on text categorization,” Application Research of Computers, vol. 29, no. 2, pp. 487–489, 2012.
View at: Google Scholar
Z. Lu, H. Shi, Q. Zhang, and C. Yuan, “Automatic Chinese text categorization system based on mutual information,” in Proceedings of the International Conference on Mechatronics and Automation (ICMA '09), pp. 4986–4990, IEEE, Changchun, China, August 2009.
View at: Publisher Site | Google Scholar
H. Liu, Z. Yao, and Z. Su, “Optimization mutual information text feature selection method based on word frequency,” Computer Engineering, vol. 40, no. 7, pp. 179–182, 2014.
View at: Google Scholar
W. Shang, H. Huang, H. Zhu, Y. Lin, Y. Qu, and Z. Wang, “A novel feature selection algorithm for text clustering,” Expert Systems with Applications, vol. 33, no. 1, pp. 1–5, 1997.
View at: Google Scholar
C. C. Aggarwal and C. X. Zhai, “A survey of text classification algorithms,” in Mining Text Data, pp. 163–222, Springer, New York, NY, USA, 2012.
View at: Publisher Site | Google Scholar
H. Park and H. C. Kwon, “Improved Gini-index algorithm to correct feature-selection bias in text classification,” IEICE Transactions on Information and Systems D, vol. 94, no. 4, pp. 855–865, 2011.
View at: Publisher Site | Google Scholar
D. Mladenić and M. Grobelnik, “Feature selection on hierarchy of web documents,” Decision Support Systems, vol. 35, no. 1, pp. 45–87, 2003.
View at: Publisher Site | Google Scholar
D. Koller and M. Sahami, “Hierarchically classifying documents using very few words,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 170–178, July 1997.
View at: Google Scholar
W. Wang, Y. Kang, and X. Wu, “Study on feature selection in text categorization,” Information Technology, no. 12, pp. 29–31, 2008.
View at: Google Scholar
L.-L. Shan, B.-Q. Liu, and C.-J. Sun, “Comparison and improvement of feature selection method for text categorization,” Journal of Harbin Institute of Technology, vol. 43, no. 1, pp. 319–324, 2011.
View at: Google Scholar
L. Liu, J. Kang, J. Yu, and Z. Wang, “A comparative study on unsupervised feature selection methods for text clustering,” in Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE '05), pp. 597–601, November 2005.
View at: Publisher Site | Google Scholar
Y. Xu, B. Wang, J. T. Li, and H. Jing, “An extended document frequency metric for feature selection in text categorization,” in Information Retrieval Technology, vol. 4993 of Lecture Notes in Computer Science, pp. 71–82, Springer, Berlin, Germany, 2008.
View at: Publisher Site | Google Scholar
F. Xia, T. Jicun, and L. Zhihui, “A text categorization method based on local document frequency,” in Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD '09), vol. 7, pp. 468–471, August 2009.
View at: Publisher Site | Google Scholar
D. Zhang, S. Chen, and Z.-H. Zhou, “Constraint score: a new filter method for feature selection with pairwise constraints,” Pattern Recognition, vol. 41, no. 5, pp. 1440–1451, 2008.
View at: Publisher Site | Google Scholar
J. Hu and H. Man, “Dictionary learning based on laplacian score in sparse coding,” in Machine Learning and Data Mining in Pattern Recognition, vol. 6871 of Lecture Notes in Computer Science, pp. 253–264, Springer, 2011.
View at: Google Scholar
S. Solorio-Fernández, J. A. Carrasco-Ochoa, and J. F. Martínez-Trinidad, “Hybrid feature selection method for supervised classification based on Laplacian score ranking,” in Advances in Pattern Recognition, vol. 6256 of Lecture Notes in Computer Science, pp. 260–269, Springer, Berlin, Germany, 2010.
View at: Publisher Site | Google Scholar
H. Huang, H. L. Feng, and C. Y. Peng, “Complete local Fisher discriminant analysis with Laplacian score ranking for face recognition,” Neurocomputing, vol. 89, pp. 64–77, 2012.
View at: Publisher Site | Google Scholar
L. Chunlu, L. Ju, and W. Qiang, “Classification of sMRI data in Alzheimer's disease based on UMPCA and Laplacian Score,” in Proceedings of the 9th International Conference on Information, Communications and Signal Processing (ICICS '13), pp. 1–4, December 2013.
View at: Publisher Site | Google Scholar
L. Bo, J. Yan, L. Wei, Z. Wen, C. Lijun, and C. Zhi, “Gene selection using locality sensitive Laplacian score,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 6, pp. 1146–1156, 2014.
View at: Publisher Site | Google Scholar
R. Liu, N. Yang, X. Ding, and L. Ma, “An unsupervised feature selection algorithm: laplacian score combined with distance-based entropy measure,” in Proceedings of the 3rd International Symposium on Intelligent Information Technology Application (IITA '09), pp. 65–68, IEEE, Nanchang, China, November 2009.
View at: Publisher Site | Google Scholar
P. Padungweang, C. Lursinsap, and K. Sunat, “Univariate filter technique for unsupervised feature selection using a new laplacian score based local nearest neighbors,” in Proceedings of the Asia-Pacific Conference on Information Processing (APCIP '09), vol. 2, pp. 196–200, July 2009.
View at: Publisher Site | Google Scholar
L. Zhu, L. Miao, and D. Zhang, “Iterative laplacian score for feature selection,” in Pattern Recognition, vol. 321 of Communications in Computer and Information Science, pp. 80–87, 2012.
View at: Publisher Site | Google Scholar
K. Benabdeslem and M. Hindawi, “Constrained Laplacian score for semi-supervised feature selection,” in Machine Learning and Knowledge Discovery in Databases, vol. 6911 of Lecture Notes in Computer Science, pp. 204–218, Springer, Berlin, Germany, 2011.
View at: Publisher Site | Google Scholar
J. Wang, Y. Li, and J. Chen, “Label Reconstruction based Laplacian Score for semi-supervised feature selection,” in Proceedings of the International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC '13), pp. 1112–1115, IEEE, Shengyang, China, December 2013.
View at: Publisher Site | Google Scholar
D. D. Lewis, “Naïve (Bayes) at forty: the independence assumption in information retrieval,” in Machine Learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21–23, 1998 Proceedings, vol. 1398 of Lecture Notes in Computer Science, pp. 4–15, Springer, Berlin, Germany, 1998.
View at: Publisher Site | Google Scholar
J. Bakus and M. S. Kamel, “Higher order feature selection for text classification,” Knowledge and Information Systems, vol. 9, no. 4, pp. 468–491, 2006.
View at: Publisher Site | Google Scholar
S. Manne, S. K. Kotha, and S. S. Fatima, “Text categorization with K-nearest neighbor approach,” in Proceedings of the International Conference on Information Systems Design and Intelligent Applications 2012 (INDIA 2012) held in Visakhapatnam, India, January 2012, vol. 132 of Advances in Intelligent and Soft Computing, pp. 413–420, Springer, Berlin, Germany, 2012.
View at: Publisher Site | Google Scholar
E. Wiener, J. Pedersen, and A. Weigend, “A neural network approach to topic spotting,” in Proceedings of the Symposium on Document Analysis and Information Retrieval, San Jose, Calif, USA, February 1995.
View at: Google Scholar
F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2015 Heyong Wang and Ming Hong. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

2868

Downloads

1141

Citations

Mathematical Problems in Engineering

Distance Variance Score: An Efficient Feature Selection Method in Text Classification

Abstract

1. Introduction

2. Related Work

2.1. Supervised Feature Selection Methods

2.1.1. Information Gain (IG)

2.1.2. Mutual Information (MI)

2.1.3. Gini Index (GI)

2.1.4. Expected Cross Entropy (ECE)

2.2. Unsupervised Feature Selection Methods

2.2.1. Document Frequency (DF)

2.2.2. Variance Score (VS)

2.2.3. Laplacian Score (LS)

3. Distance Variance Score (DVS)

3.1. Feature Distance Contributions (DC)

3.2. Algorithm of DVS

4. Experiments

4.1. Data

4.2. Discriminative Features Experiments

4.2.1. Experiments of DVS

4.2.2. Experimental Results

4.2.3. More Discussion about DVS and LS

5. Conclusion

Conflict of Interests

Acknowledgment

References

Copyright