Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014, Article ID 517498, 7 pages
http://dx.doi.org/10.1155/2014/517498
Research Article

Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

1Department of Language Engineering, PLA University of Foreign Languages, Luoyang, Henan 471003, China
2College of Computer, National University of Defense Technology, Changsha, Hunan 410073, China
3College of Humanities and Social Sciences, National University of Defense Technology, Changsha, Hunan 410073, China

Received 6 December 2013; Accepted 11 February 2014; Published 19 March 2014

Academic Editors: F. Yu and G. Yue

Copyright © 2014 Wuying Liu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.