Research Article

Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

Table 6

The comparison of nine improved and existing feature-selection methods with respect to AUC on Reuters-21578 for NB and SVM, respectively. The bold values indicate the best performance of the classifier when various feature-selection methods are used, respectively.

Feature selectionNaïve BayesSupport vector machines
400800120016002000400800120016002000

IG0.89780.90680.90730.90880.90930.90050.90500.90650.90790.9086
IGX0.89770.90680.90730.90870.90930.90830.91240.91430.91590.9168

CHI0.89880.90550.90930.90910.91120.90530.90710.90660.90700.9079
CHIX0.88640.90580.90740.91010.91160.90900.91260.91430.91440.9138

MI0.69230.85210.87990.89020.89790.79570.85410.88100.89040.8974
MIX0.72820.82560.88150.89280.90070.73570.83410.88970.89700.9070

DF0.89770.90750.90980.91070.91070.90120.90600.90910.90860.9090
DFX0.90080.90910.91050.91110.91160.91030.91380.91590.91690.9167

GINI0.90820.91190.91350.91380.91230.90830.90680.90750.90880.9083
GINIX0.91120.91290.91390.91350.91350.91650.91470.91520.91570.9164

DIA0.70050.71460.72490.73340.75650.85010.86730.87500.87900.8811
DIAX0.78840.86540.90400.90660.90880.79540.87780.91240.91430.9159

CMFS0.91090.91410.91330.91390.91340.90950.90940.90840.90870.9094
CMFSX0.90710.91350.91430.91470.91440.91500.91710.91530.91560.9162

OCFS0.89830.90740.90880.90920.91040.90270.90290.90460.90570.9075
OCFSX0.89140.90650.90910.90950.91020.90680.90940.91110.91270.9136

DFPFS0.81590.81570.81590.81610.81620.88750.88690.88650.88710.8863
DFPFSX0.88280.88820.88840.88800.88830.90320.90660.90640.90620.9067