Research Article

Gamma-Poisson Distribution Model for Text Categorization

Table 1

Vocabulary size obtained by feature selection with CF.

Feature selection 20 Newsgroups Reuters-21578 Industry Sector TechTC-100

Initial vocabulary 103,135 22,792 64,680 103,003
CF ≥ 2 63,285 13,809 38,107 55,250
CF ≥ 5 34,152 7,268 21,681 29,070
CF ≥ 10 21,845 4,455 14,767 18,946
CF ≥ 20 13,792 2,626 9,885 12,254
CF ≥ 50 7,166 1,224 5,722 6,593
CF ≥ 100 4,056 689 3,580 3,937
CF ≥ 200 2,091 345 2,057 2,249
CF ≥ 500 693 110 854 936
CF ≥ 1000 230 36 420 432
CF ≥ 2000 57 18 159 195