Research Article

Improved Feature-Selection Method Considering the Imbalance Problem in Text Categorization

Table 4

The comparison of nine improved and existing feature-selection methods with respect to AUC on 20-Newsgroups for NB and SVM, respectively. The bold values indicate the best performance of the classifier when various feature-selection methods are used, respectively.

Feature selectionNaïve BayesSupport vector machines
400800120016002000400800120016002000

IG0.71830.75790.78640.80480.81870.77460.79980.81440.82570.8318
IGX0.71820.75780.78630.80450.81870.77740.80380.81830.82840.8361

CHI0.82340.85450.86600.87170.87710.84850.86710.87020.87070.8690
CHIX0.83460.86210.87100.87700.88150.85890.87260.87470.87630.8753

MI0.58300.62830.72050.74050.76950.61650.67330.75690.77210.7905
MIX0.60200.65970.73150.76770.79020.62690.68940.77000.79610.8089

DF0.76850.80540.83020.84540.85310.80430.82240.83800.84640.8509
DFX0.77630.81110.83380.85010.85850.81460.82970.84160.85000.8557

GINI0.85690.87260.87780.88010.88290.87110.86780.86310.86090.8625
GINIX0.86030.87490.88000.88430.88580.87200.87350.86900.86600.8672

DIA0.55710.60400.61070.61650.62450.63310.68380.70860.73020.7419
DIAX0.74290.79330.82190.85770.87280.75190.80380.83240.85850.8690

CMFS0.85360.86710.87480.88280.88650.86890.87690.87470.87230.8705
CMFSX0.85980.87360.88210.88870.89150.87220.87980.87880.87760.8766

OCFS0.70830.73750.78920.80170.81130.79620.81640.82600.83010.8351
OCFSX0.75500.78820.80910.82050.82900.79840.81930.83130.83650.8408

DFPFS0.81750.83980.84990.85610.86000.83770.84460.84920.85210.8543
DFPFSX0.75910.78730.80000.80560.80860.79520.81190.81810.82180.8240