Research Article

An Embedded-Based Weighted Feature Selection Algorithm for Classifying Web Document

Table 1

The usage of machine learning algorithms for document classification—a summary.

ReferenceMethodSummary

[30]DragPushing(i) Proposes kNN optimization which automatically balances the data points evenly across all the classes to avoid model misfits.
(ii) They have set the value of to 7 for their experiment.
(iii) Three datasets are used Reuter-21578, industry sector, and TDT-5.
[31]Prototype selection(i) Much faster than [30].
(ii) Prototype selection recommends the most portable prototypes for training purposes.
(iii) [31] eliminates most of the data points in training to increase speed.
[32]ForesTexter(i) The Gini index can easily predict the skewness in the majority class and creates many subtrees to balance the data points.
(ii) Can work faster than [30, 31].
(iii) [32] combines both feature subspace selection and splitting criterion to create multiple subtrees to balance the data.
[33]Resampling(i) Handles the imbalance problem better than [32] by performing resampling.
(ii) Instance weighting enables one to assign few weights for the imbalanced class so that the end performance (in terms of accuracy) is balanced.
(iii) They validated the proposed method with SVM classifier.
[34].Topic modelling(i) Instead of balancing the data points across all the classes, this method uses the topic model to construct new data points in each class to create a complete dataset.
(ii) This method considers more data points than [30, 32, 33] because of topic modeling which can construct new data points.
[35]Bag of concepts(i) Aims to reduce the dimensions of the document matrix representation.
(ii) Instead of recommending data points from the data set (such as [31]), the bag of concepts groups one or more data points into topics.
(iii) Bag of concepts solves many problems in the traditional bag of words models such as high dimensionality and sparsity issues.
[36]Ontology-based deep learning model(i) Enhances the problem of [35] by not considering the relationships among the documents.
(ii) The features and their relationships are extracted based on deep learning.
(iii) The ontology enhancement proposed in [36] helps to reduce the high dimensions.
(iv) This method consumes more time in training the samples.
ProposedWeighted feature selection(i) Resolves the imbalance problem by assigning weights to the most important features.
(ii) Three classifiers are used, namely, kNN, SVM, and Naïve Bayes.
(iii) [35] fails to detect the relationships among the documents. In this paper, the proposed system detects the relationship and considers it for classification.