Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience / 2020 / Article
Special Issue

Explainable and Reliable Machine Learning by Exploiting Large-Scale and Heterogeneous Data

View this Special Issue

Research Article | Open Access

Volume 2020 |Article ID 8879795 | https://doi.org/10.1155/2020/8879795

Kelin Shen, Peinan Hao, Ran Li, "A Compressive Sensing Model for Speeding Up Text Classification", Computational Intelligence and Neuroscience, vol. 2020, Article ID 8879795, 11 pages, 2020. https://doi.org/10.1155/2020/8879795

A Compressive Sensing Model for Speeding Up Text Classification

Academic Editor: Nian Zhang
Received25 Jun 2020
Revised07 Jul 2020
Accepted18 Jul 2020
Published07 Aug 2020

Abstract

Text classification plays an important role in various applications of big data by automatically classifying massive text documents. However, high dimensionality and sparsity of text features have presented a challenge to efficient classification. In this paper, we propose a compressive sensing- (CS-) based model to speed up text classification. Using CS to reduce the size of feature space, our model has a low time and space complexity while training a text classifier, and the restricted isometry property (RIP) of CS ensures that pairwise distances between text features can be well preserved in the process of dimensionality reduction. In particular, by structural random matrices (SRMs), CS is free from computation and memory limitations in the construction of random projections. Experimental results demonstrate that CS effectively accelerates the text classification while hardly causing any accuracy loss.

1. Introduction

With the advancement of information technology over the last decade, digital resources have penetrated into all fields in our society, generating big data, which present a new challenge to data mining and information retrieval [1]. Texts are very common in daily life, and, with their large numbers, it remains an open question to organize and manage them [2]. As one of the fundamental techniques in natural language processing (NLP), text classification means assigning labels or categories to texts according to the content, and it is key to solving the problem of text overloads [3]. In its broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection, text classification provides support for the efficient query and search of texts, attracting a lot of attention from both academia and industry [4, 5].

Word matching (WM), the simplest method in text classification, determines the category of a text by the categories of most words in the text [6]. But, due to the ambiguity of word meaning, WM fails to provide satisfying accuracy. By representing words as vectors, the vector space model (VSM) [7] improves the accuracy of text classification, thus replacing WM as the popular method, but the model requires many rules and great efforts from professionals in labeling texts, which would be a lot of cost. As machine learning (ML) [8] continues to develop, the accuracy of text classification has been further improved. By extracting features from a text to train a classifier, ML reforms VSM and avoids the rule-based inference. Recently, the rapidly developing deep learning (DL) [9], which is a branch of ML, has made text classification more efficient. However, high dimensionality and sparsity of text features pose a challenge to ML, restricting the practical use of ML-based text classification.

In ML, many classifiers can be used to classify texts, such as support vector machine (SVM) [10], decision tree [11], adaptive boosting (AdaBoost) [12], K-nearest neighbor (KNN) [13], and Naïve Bayes [14]. To train these classifiers, texts must be represented as feature vectors by some feature extraction models, among which the commonest is Bag of Words (BOW) [15]. BOW uses the term frequencies of n-grams in the vocabulary constructed by N-Gram [16] to encode every text. Because vocabulary may potentially run into millions, BOW faces the curse of dimensionality; that is, it produces a sparse representation with a huge dimensionality, resulting in the impracticality of training classifiers. Therefore, dimensionality reduction (DR) is used to reduce the size of feature space. In DR, the most common techniques still introduce some time and memory complexity due to their nature of supervised learning, including principal component analysis (PCA) [17], independent component analysis (ICA) [18], and nonnegative matrix factorization (NMF) [19]. Many DL networks use autoencoder to compress the size of parameters. An autoencoder is a neural network that is trained to attempt to copy its input. Some popular architectures include sparse autoencoder [20], denoising autoencoder [21], and variational autoencoder [22]. Internally, they have a hidden layer that describes a code used to represent the input. By being embedded into the neural network, the autoencoder can end up learning a low-dimensional representation very similar to PCAs.

Compared with the above-mentioned DR techniques, random projection [23, 24] is a better choice, since it avoids the model training, but it is still a challenge to store random projections due to the huge dimensionality of text feature. Compressive sensing (CS) [2527], which has recently been rapidly developing, can be regarded as a random projection technique specially for sparse vectors, and it proves that the perfect recovery of sparse vector can be realized by several random projections. CS retains the advantages of random projection in DR and further overcomes the problem of memory with the help of structural random matrices (SRMs) [28, 29], which makes CS a potential DR technique for text classification. In view of the merits of CS, we use it to speed up the training of text classifiers in this paper. For a low time and memory complexity, SRMs are selected as CS measurement matrices to reduce the size of sparse feature vector. Experimental results demonstrate that CS effectively accelerates the text classification while hardly causing any accuracy loss.

The rest of this paper is organized as follows. Section 2 briefly reviews text classification and CS theory. Section 3 describes the CS model for text classification in detail. Section 4 presents experimental results, and finally Section 5 concludes this paper.

2. Background

2.1. Text Classification

Given a text dataset D = {d1, d2, …, dL} of L documents and a set C = {c1, c2, …, cJ} of J predefined categories, the goal of text classification is to learn a mapping f from inputs di ∈ D to outputs cj ∈ C. If J = 2, it is called binary classification; if J > 2, it is called multiclass classification. The mapping f is called the classifier, and it is trained by being fed with a labeled dataset, where each document in D has been assigned a category from C by professionals in advance. The trained classifier f is used to make predictions on new documents which are not included in D. Because of the subjectivity of text labeling, a test dataset is still needed to evaluate the prediction accuracy of f.

A typical flow of text classification is illustrated in Figure 1. In text preprocessing, we tokenize each document in D, erase punctuations, and remove unnecessary words such as stop words, misspelling, and slang. To reduce the size of vocabulary from D, some operations, e.g., capitalization, lemmatization, and stemming, can also be added. After text preprocessing, feature extraction is performed to represent documents in D as feature vectors, which is a crucial step for the accuracy and complexity of text classification. By N-Gram, we collect n-grams from D as the vocabulary of BOW model. It is very common to use unigram and bigram, where unigram is a single word and bigram is a word pair. Each document in D is encoded as a feature vector based on the frequency distribution of its n-grams on the BOW vocabulary. The size of feature vector is the same as that of BOW vocabulary, resulting in the huge dimensionality of feature space. By using DR techniques, dimensionality can be significantly decreased, reducing the time complexity and memory consumption when training the classifier. The feature vector of a document is also highly sparse because the number of its n-grams is far smaller than the size of BOW vocabulary. The high sparsity makes it possible to realize DR by CS without the loss of classification accuracy. Compared with the traditional DR methods, CS not only avoids the computations invested in supervised learning but also reduces the memory burden for constructing random projections. In this paper, we use CS to reduce the feature dimensionality and try to prove its efficiency of speeding up text classification.

2.2. Compressive Sensing

CS is a novel sampling paradigm that goes against the traditional Nyquist/Shannon theorem, and it shows that a signal can be recovered precisely from only a small set of samples. The success of CS relies on two principles: sparsity and incoherence, where the former defines an S-sparse signal s in RN with all but the S entries set to be zero, and the latter highlights the incoherent measure vectors with s. The following briefly describes the CS framework.

By ordering these measure vectors in column, a measurement matrix Φ ∈ RM×N is constructed as follows:

By using Φ to linearly measure s, we obtain the sampled vector y ∈ RM by

We define the ratio of M/N as the subrate R; that is, R = M/N, and DR is realized by setting R to be less than 1, but it also becomes an ill-posed problem to find s from y. Based on the sparsity property of s, this problem can be solved by an optimizing model:where ||·||0 represents l0 norm to count the number of nonzero entries in s, and the solution is an estimate of s. The incoherence between φi and s has an effect on the convergence of the solution to the original s, which presents a challenge for CS, that is, how to construct incoherence measurement vectors. Fortunately, it is found that random vectors are largely incoherent with any fixed signal, so Φ can be produced by some random distributions, for example, Gaussian, Bernoulli, and uniform.

By performing incoherent measuring with random matrices, CS can be categorized as the random projection technology in DR. In particular, in order to enhance the robustness of recovery, CS requires Φ to further hold the restricted isometry property (RIP) for S-sparse signals. When RIP holds, Φ preserves the approximate Euclidean length of S-sparse signals, which implies that all pairwise distances between S-sparse signals can be well preserved in the measurement space. In text classification, the feature vectors of documents in text dataset are highly sparse, so RIP of CS can significantly reduce feature dimensionality while preserving pairwise distances between feature vectors. Superior to traditional DR methods, CS ensures less memory consumption and faster computing by SRMs. In view of the merits of CS, we explore CS features extracted by SRMs to speed up text classification.

3. Proposed CS-Based Text Classification

3.1. Framework Description

Figure 2 presents the framework of the proposed CS-based text classification. After text preprocessing, the text dataset is divided into training dataset P and testing dataset Q, where the former is used to train classifiers, and the latter is used to evaluate the classification accuracy. The core of our work is to extract CS features to represent documents in text dataset. In CS feature extraction, we represent each document pi in the training dataset P as the highly sparse vector xi by BOW and construct an SRM Φ ∈ RM×N to linearly measure xi, producing the CS feature vector yi of xi. CS feature is a low-dimensional and dense vector, which can shorten the time of training classifier, especially for a large-scale text dataset. In the following parts, we describe, respectively, CS feature extraction, SRMs construction, and classifiers in detail.

3.2. CS Feature Extraction

We collect unigrams and bigrams from the training dataset P to create the vocabulary of BOW model. Unigrams are single words from P, and most of them occur very few times to impact classification, so we only add top N1 words from these unigrams to the BOW vocabulary. Bigrams are word pairs from P, and they are a good way to model negation like “not good.” The total amount of bigrams is very big, but most of them are noise at the end of frequency spectrum, so we use top N2 word pairs from these bigrams, adding them to the BOW vocabulary. In the experiment part, we set suitable N1 and N2 for different classification tasks.

After collecting unigrams and bigrams, we convert each document pi in P into the feature vector xi in sparse representation. The BOW feature xi is the frequency distribution of pi on the BOW vocabulary, and its size is N, which is the sum of N1 and N2. All BOW features consist of a feature matrix X as follows:where L1 is the amount of P. In the ordinary classification, X is input into the classifier to train it. Being a large size, X results in the curse of dimensionality; for example, when N and L1 are set to be 25000 and 800000, respectively, the size of X is 25000 × 800000, and it needs a memory of 8 × 1010 bytes (≈75 GB) assuming that 4 bytes encode each entry in X. That would lead to a heavy computational burden, so we reduce the size of X by CS measuring as follows:where Φ ∈ RM×N is a CS measurement matrix and Y ∈ RM×L1 is the CS feature matrix, of which the i-th column yi is the CS feature vector of the i-th document pi in the training dataset P.

To precisely recover signals, the CS measurement matrix is required to hold RIP. In practice, a random matrix, e.g., produced by Gaussian or Bernoulli distribution, obeys RIP for S-sparse signal provided thatis satisfied [30]. M can be set to be far smaller than N since BOW features are highly sparse, so the size of Y can be significantly reduced. Importantly, RIP can be enforced or degraded by widening or reducing the gap between M and S; that is, when M is far larger than 4·S, the pairwise distances between S-sparse signals are well preserved in the CS feature space, and these pairwise distances can be destroyed when gradually reducing M, so the subrate R becomes a key factor impacting the accuracy of text classification. In the experiment part, we will evaluate the effects of different R values on pairwise distances between features and the accuracy of classification. In general, these random projections are dense, and a common computer does not have sufficient memory to store them, so CS-based DR is not applicable to a large-scale dataset if traditional method is used to produce the random projections. However, CS offers some measurement matrices for large-scale and real-time applications, among which the most famous is SRMs. The following describes how to construct SRMs, so as to make CS-based DR feasible for a large-scale dataset.

3.3. SRMs  Construction

SRM, proposed by Do et al. [28], is a known sensing framework in the field of CS. With its fast and efficient implementation, it brings some benefits to CS-based DR, for example, low complexity, fast computation, block-based processing support, and optimal incoherence. By using SRMs, with less memory consumption, the length of BOW feature can be fast and greatly reduced while holding RIP.

SRM is defined as a product of three matrices; that is,where E ∈ RN×N is a random permutation matrix that uniformly permutes the locations of vector entries globally, F ∈ RN×N is an orthonormal matrix constructed by popular fast computable transform, e.g., Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Walsh-Hadamard Transform (WHT), or their block diagonal versions, D ∈ RM×N is a random subset of M rows of the identity matrix of N × N in size to subsample the input vector, and is a scale to normalize the transform so that the energy of the subsampled vector is almost similar to that of the input vector. By plugging (7) into (5), the matrix product Φ·X can be performed according to a sensing algorithm as shown in Algorithm 1. The SRM sensing algorithm can be computed fast; that is, the computational complexity is typically in the order of O(N) to O(NlogN). Suppose that F is FFT or DCT matrix; the implementation of SRM takes O(NlogN) operations. SRM is used to measure L1 BOW features one by one, which takes O(L1NlogN) operations; that is, the total computational complexity of the proposed CS model is O(L1NlogN). Compared with existing random projection techniques, SRMs not only cost less time and space complexity, but they also convert the sampled vector into a white noise-like one by scrambling the vector structure to achieve universal incoherence. Therefore, SRMs can make CS-based text classification more efficient.

Task : Perform Φ·X in which Φ is one of SRMs
Input: The BOW feature matrix X = [x1,…,xi,…,xL1], the measurement number M, and a fast transform operator F(·).
Main iteration: Iterate on i until i > L1 is satisfied.
(1)Pre-randomization: randomize xi by uniformly permuting its sample locations. This step corresponds to multiplying xi with E.
(2)Transform: apply a fast transform F(·) to the randomized vector, e.g, FFT, DCT, etc.
(3)Subsampling: randomly pick up M samples out of N transform coefficients. This step corresponds to multiplying the transform coefficients with D.
Output: The CS feature matrix Y = [y1,…,yi,…,yL1].
3.4. Classifiers

Many popular classifiers can be used in our model, e.g., SVM, decision tree, AdaBoost, KNN, and Naïve Bayes. In the experiment part, these classifiers are applied and their classification accuracy is evaluated to verify the efficiency of our model. This section reviews these popular classifiers in text classification.

SVM [10] is a nonprobabilistic linear binary classifier. For a training set of points (yi, li), where yi is the CS feature vector and li is the category of the document di, we try to find the maximum-margin hyperplane that divides the points with li = 1 and li = -1. The equation of the hyperplane is as follows:

We maximize the margin, denoted by γ, asto separate the points well. By error-correcting output codes (ECOC) model [31], SVM can also undertake multiclass classification tasks.

Decision tree [11] is a classifier model in which each node of the tree represents a test on the attribute of the data set, its children represent the outcomes, and the leaf nodes represent the final categories of the data points. The training dataset is used to form the decision tree, and the best decision has to be made for each node in the tree. The decision tree can be fast trained, but it is also extremely sensitive to small perturbations in the dataset and can be easily overfit. By cross validation and pruning, these effects can be suppressed.

AdaBoost [12] extracts a classifier from the set of weak classifiers at each iteration and assigns a weight to the classifier according to its relevance. The weight in AdaBoost for each sample is measured according to how difficult previous classifiers have found it to get it correct. At each iteration, a new classifier is trained on the training dataset, and the weights are modified based on how successfully the training sample has been classified before. Training terminates after several iterations or when all training samples are classified correctly.

KNN [13] is a nonparametric technique used for classification. Given the CS feature yi, KNN finds the K-nearest neighbors of yi among all CS features in the training dataset and gives the category candidate a score based on the labels of the K neighbors. The similarity between yi and its neighbor can be the score of the category of the neighbor features. After sorting the score values, KNN decides which category the candidate falls into with the highest score from yi. KNN is easy to implement and adapts to any kind of feature space. It can also handle multiclass cases. The performance of KNN depends on finding some meaningful distance functions, and it is limited by data storage when finding the nearest neighbors for large search problems.

Naïve Bayes [14] has been widely used for text classification, and it is a generative model based on Bayes theorem. This model assumes that the value of a particular feature is independent of the value of any other feature. The proposed CS model is on the assumption that any entry in a CS feature vector is independent of other entries. Given a to-be-tested CS feature y, its category is predicted as follows:

According to Bayes inference, we see thatwhere ym is the m-th entry in the CS feature y. The probabilities p(l) and p(ym|l) can be estimated by maximum likelihood on the training dataset.

4. Experimental Results

4.1. Dataset and Setting

We conduct experiments on two datasets, one for a binary classification task and the other for a multiclass classification task. For the binary classification task, we use the Twitter sentiment dataset, which was crawled and labeled positive or negative. For the multiclass classification task, we use the weather report dataset that contains a text description and category labels for each event including thunderstorm wind, hail, flash flood, high wind, and winter weather. The classes of two datasets are imbalanced, especially for weather report dataset. To avoid the effects of imbalance on classification accuracy, the two datasets are preprocessed to make their classes balanced; i.e., for Twitter sentiment dataset, we randomly remove some positive and negative observations and make each class having 10000 observations; for weather report dataset, we delete the classes with few observations, and 9 classes remain: thunderstorm wind, hail, flash flood, high wind, winter weather, Marine Thunderstorm Wind, Winter Storm, Heavy Rain, and Flood, among which one has 1000 observations. Figure 3 presents the statistics of Twitter sentiment dataset and weather report dataset after balancing. For any dataset, 20% of observations in each class are set aside at random for testing. In feature extraction, we first do some preprocessing on documents in two datasets including the following: (1) tokenize the documents; (2) lemmatize the words; (3) erase punctuation; (4) remove a list of stop words such as “and,” “of,”, and “the”; (5) remove words with 2 or fewer characters; (6) remove words with 15 or more characters. Then, for both datasets, we, respectively, collect the top 8000 unigrams and 10000 bigrams from the training set to construct the BOW vocabulary, i.e., N1 = 8000 and N2 = 10000, and represent each training observation as the BOW feature vector with length of N being 18000. Finally, by setting different subrates, the SRMs are used to measure the BOW feature vectors, and the corresponding CS feature vectors are produced. We train different classifiers on the BOW-based and CS-based training sets, respectively, tune parameters by cross validation, and evaluate these classifiers on the test sets. Due to the random partition of dataset, the training and testing are repeated five times, and the mean testing accuracy is used as the evaluation metrics.

The experimental settings are as follows. To evaluate the effects of different SRMs on feature distance and classification accuracy, we construct five SRMs by using transform matrices F including DCT, FFT, Block DCT, Block WHT, and Block Gaussian, in which the latter three are block diagonal matrices, of which the diagonal elements are DCT and WHT and Gaussian matrices with the size of 32 × 32. We use six classifiers including SVM, decision tree, AdaBoost, KNN, and Naïve Bayes to evaluate the classification accuracy of our model and compare the proposed CS model with the three DR methods: PCA [17], ICA [18], and NMF [19]. The subrate R is set to be between 0.1 and 0.6, and it is preset parameter, which is used to decide the length of CS feature vector. All of the experiments are conducted under the following computer configuration: Intel(R) Core (TM) i7 @3.30 GHz CPU, 8 GB, RAM, Microsoft Windows 7 64 bits, and MATLAB Version 9.6.0. (R2019a). The datasets and experimental codes have been downloaded from SIGMULL Team Website: http://www.scholat.com/showTeamScholar.html?id=1234&changeTo=Ch&nav=4.

4.2. Effects of SRMs

Feature distance measures the similarity between any two documents, which has a significant impact on training accuracy. If the features output by DR can well preserve their pairwise distances in original space, DR suppresses the loss of training accuracy; therefore, we evaluate the effects of SRMs on pairwise distances between text features. In the training set P, the average distance between the i-th BOW or CS feature and others is computed as follows:where xi and yi are, respectively, the i-th BOW and CS feature vector in P and L1 is the amount of P. We select Block DCT as the core of SRM and use (12) and (13) to compute the average distance of each BOW and CS feature as shown in Figure 4. We can see that the tendencies of all distance curves are similar, and the curve of CS features trends closer to that of BOW features as the subrate increases, which indicates that the pairwise distances between BOW features correspond to those between CS features. To measure the distance differences between BOW and CS features, we compute the Mean Square Error (MSE) between the average distances of BOW and CS features as follows:

Table 1 presents the MSEs on multiclass classification dataset when using different subrates and SRMs. It can be seen from Table 1 that all SRMs provide similar MSEs at any subrate; e.g., the average MSE of each SRM at all subrates is about 11.00, and the MSEs of SRMs decrease as the subrate increases; e.g., the MSE of DCT is 18.78 at the subrate of 0.1, and it is reduced to 5.92 at the subrate of 0.6. These MSE results indicate that SRMs can preserve the approximate pairwise distances between BOW features in the CS feature space.


Subrate RSRMs
DCTFFTBlock DCTBlock WHTBlock Gaussian

0.118.7818.8118.3818.4818.62
0.214.3714.4014.3814.0814.51
0.311.3911.3911.1711.1111.78
0.49.149.129.0189.019.33
0.57.367.347.197.217.33
0.65.925.915.965.906.03
Avg.11.1611.1611.0210.9611.27

Then, we select SVM as the classifier in our model and evaluate the effects of SRMs on classification accuracy. With different SRMs, the accuracies of SVM classifier on binary and multiclass classification datasets are presented in Table 2. It can be seen that all SRMs provide similar accuracies in most cases at any subrate; e.g., with all subrates considered, the average accuracies of SRMs range from 0.7121 to 0.7203 on binary classification dataset, and similar results are obtained on multiclass classification dataset. We also see that the accuracy is gradually improved for any SRM as the subrate increases. The above results indicate that the selection of SRMs has little impact on classification accuracy, and the subrate is a key factor in controlling the accuracy. Therefore, any SRM can be used in our model, and we need to consider the balance between accuracy and subrate in practice.


Subrate RSRMs
DCTFFTBlock DCTBlock WHTBlock Gaussian

Binary classification
0.10.69550.72200.69750.68800.6930
0.20.71850.71350.71350.72000.7055
0.30.71950.71400.72850.72150.7125
0.40.72850.71900.72650.71700.7185
0.50.72350.71950.72900.72700.7145
0.60.72550.72900.72650.72800.7285
Avg.0.71850.71950.72030.71690.7121

Multiclass classification
0.10.85900.85750.83580.84440.8227
0.20.86160.86060.86510.86360.8585
0.30.86510.87370.86660.87370.8606
0.40.86860.87020.87120.87470.8712
0.50.87120.87320.87670.86910.8757
0.60.87470.87820.88030.87320.8762
Avg.0.86680.86890.86600.86650.8609

4.3. Evaluation on Classifiers

To verify the validity of CS, we have compared CS features and BOW features in terms of the accuracies and training time of different classifiers driven by them. The Block DCT is selected as SRM, and the accuracy results are presented in Table 3. It can be seen that, for binary classification, the accuracies of classifiers driven by the CS features go up with the increase of subrate. Though lower than those with BOW feature when the subrate is small, they quickly catch up; e.g., for SVM, the CS feature overtakes the BOW feature when the subrate is 0.3 and outperforms it thereafter. All the classifiers considered, the average accuracy by the CS features is also comparable with that by BOW feature. The same result can be obtained for multiclass classification. As for the training time in Figure 5, whether it is binary or multiclass classification, the CS feature costs far less than the BOW feature, especially when the subrate is small. Table 4 presents average accuracy, precision, recall, and F1 on all classifiers for binary classification dataset. It can be seen that the precision, recall, and F1 by CS features at any subrate are similar to those by BOW features, which indicates that the classification accuracy is reliable for CS features. From the above results, it can be concluded that CS speeds up the training of classifiers while providing the accuracies that can match the BOW feature.


ClassifierBOW featureSubrate R for CS feature
0.10.20.30.40.50.6

Binary classification
SVM0.72200.69750.71350.72850.72650.72900.7265
Decision tree0.62350.63650.63950.64600.63550.64650.6485
AdaBoost0.70600.70200.69750.70750.70350.70200.7110
KNN0.60400.59550.61200.62000.61400.61450.6125
Naïve Bayes0.72750.70350.71300.71250.71700.72000.7150
Avg.0.67660.66700.67510.68290.67930.68240.6827

Multiclass classification
SVM0.87320.83580.86510.86660.87120.87670.8803
Decision tree0.85600.84540.84340.85100.85200.85250.8530
AdaBoost0.77770.75350.77370.77320.78130.78080.7818
KNN0.82520.80800.81460.82070.82420.82570.8252
Naïve Bayes0.77370.73730.74040.74640.74290.74240.7454
Avg.0.82120.79600.80740.81160.81430.81560.8171


MetricsBOW featureSubrate R for CS feature
0.10.20.30.40.50.6

Accuracy0.67660.66700.67510.68290.67930.68240.6827
Precision0.65640.66580.66740.67220.66940.66700.6694
Recall0.68170.66710.67750.68660.68240.68710.6864
F10.66790.66640.67230.67900.67560.67660.6774

4.4. Comparisons on DR Methods

We compare the performance of the proposed CS model with that of some popular DR methods including PCA, ICA, and NMF. PCA learns all principal components from the training set, and, according to the preset subrate, selects part of principal components to construct the transform matrix. ICA and NMF learn their transform matrices at different subrates by numerical iterative algorithms, and their maximum numbers of iterations are both set to be 20 in order to keep the execution time at a moderate level. We use each of the above transform matrices to project all training and testing observations onto a low-dimension space. The proposed CS model uses Block DCT to reduce the dimensionalities of observations at different subrates. Table 5 presents the average accuracies of all classifiers for binary and multiclass classification datasets when using different DR methods. We can see that the proposed CS model obtains higher accuracies than PCA, ICA, and NMF at any subrate for both binary and multiclass classification tasks. The proposed CS model is more stable, and its accuracy increases gradually as the subrate increases, but the accuracies of PCA, ICA, and NMF float up and down as the subrate increases; for example, for binary classification, the accuracy of PCA is 0.6221 at the subrate of 0.1. However, when the subrate is raised to 0.6, the accuracy drops to 0.6091. Table 6 presents the execution time of different DR methods on binary and multiclass classification datasets when using different subrates. PCA learns all principal components, so its execution time does not vary as the subrate increases, and it costs 387.75 s and 275.49 s for binary classification and multiclass classification, respectively. At the preset subrate, ICA and NMF determine the final dimensionalities of observations and learn the corresponding transform matrices, so their execution time increases as the subrate increases; e.g., for binary classification, NMF costs 187.33 s at the subrate of 0.1 and costs 2201.12 at the subrate of 0.6. The accuracies of ICA and NMF can be improved by increasing iteration times, but their execution time can also increase dramatically. Compared with PCA, ICA, and NMF, the proposed CS model costs less execution time; e.g., for binary classification, CS costs only 3.32 s and 4.63 s at the subrates of 0.1 and 0.6, respectively. From the above results, it can be concluded that the proposed CS model obtains higher accuracy with less execution time when compared with PCA, ICA, and NMF. Therefore, the proposed CS model is a reliable DR method.


DR methodSubrate R
0.10.20.30.40.50.6

Binary classification
PCA0.62210.62360.62060.61540.62220.6091
ICA0.57540.58300.58620.59740.59030.6009
NMF0.59260.61270.61930.60670.61570.6000
CS0.66700.67510.68290.67930.68240.6827

Multiclass classification
PCA0.72530.72130.70190.68450.68220.6726
ICA0.49380.51700.53050.54480.54550.5479
NMF0.71120.70800.71230.71230.70960.7063
CS0.79600.80740.81160.81430.81560.8171

Note that SRM in CS is Block DCT.

DR methodSubrate R
0.10.20.30.40.50.6

Binary classification
PCA384.75384.75384.75384.75384.75384.75
ICA369.723094.0017259.2734511.1635281.7350355.25
NMF187.33456.671169.651873.322481.442201.12
CS3.323.643.924.194.584.63

Multiclass classification
PCA275.49275.49275.49275.49275.49275.49
ICA188.77382.82990.196592.1110829.6420559.64
NMF159.21327.14652.831239.351529.072358.88
CS3.103.773.944.034.254.41

Note that SRM in CS is Block DCT.

5. Conclusion

In this paper, we develop a CS-based model for text classification tasks. Traditionally, the BOW features are extracted from the text dataset, and they are the highly sparse representations with a huge dimensionality. It costs a lot to train classifiers by using BOW features. By using the incoherent measuring of CS, we greatly reduce the dimensionality of BOW features, and, at the same time, the RIP of CS ensures that the pairwise distances between BOW features are well preserved in a low-dimensional CS feature space. CS also provides the SRMs that are fast computable with low memory consumption. In the proposed model, different SRMs are constructed to linearly measure BOW features at a preset subrate, generating the CS features that are used to train the classifiers. Experimental results show that the proposed CS model provides a comparable classification accuracy with the traditional BOW model and significantly reduces the space and time complexity required by a large-scale dataset training.

Data Availability

The datasets and experimental codes have been downloaded from SIGMULL Team Website: http://www.scholat.com/showTeamScholarEn.html?id=1234&changeTo=En&nav=4.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61572417 and 31872704, in part by Innovation Team Support Plan of Henan University of Science and Technology (no. 19IRTSTHN014), and in part by Nanhu Scholars Program for Young Scholars of Xinyang Normal University.

References

  1. W. Zhu, P. Cui, Z. Wang, and G. Hua, “Multimedia big data computing,” IEEE Multimedia, vol. 22, no. 3, p. 96, 2015. View at: Publisher Site | Google Scholar
  2. G. Song, Y. M. Ye, X. L. Du, X. H. Huang, and S. F. Bie, “Short text classification: a survey,” Journal of Multimedia, vol. 9, pp. 635–643, 2014. View at: Publisher Site | Google Scholar
  3. K. Kowsari, K. J. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: a survey,” Information, vol. 10, no. 4, p. 150, 2019. View at: Publisher Site | Google Scholar
  4. X. L. Deng, Y. Q. Li, J. Weng, and J. L. Zhang, “Feature selection for text classification: a review,” Multimedia Tools and Applications, vol. 78, pp. 3797–3816, 2019. View at: Google Scholar
  5. L. Qing, W. Linhong, and D. Xuehai, “A novel neural network-based method for medical text classification,” Future Internet, vol. 11, no. 12, p. 255, 2019. View at: Publisher Site | Google Scholar
  6. C. C. Aggarwal and C. X. Zhai, Mining Text Data, Springer, Berlin, Germany, 2012.
  7. G. Salton, A. Wong, and C. S. Yang, “A vector-space model for information retrieval,” Communications of the Acm, vol. 18, pp. 13–620, 1975. View at: Publisher Site | Google Scholar
  8. C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, New York, NY, USA, 2006.
  9. G. Ian, B. Yoshua, and C. Aaron, Deep Learning, The MIT Press, New York, NY, USA, 2016.
  10. W. Zhang, T. Yoshida, and X. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, vol. 21, no. 8, pp. 879–886, 2008. View at: Publisher Site | Google Scholar
  11. D. Coppersmith, S. J. Hong, and J. R. M. Hosking, “Partitioning nominal attributes in decision trees,” Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 197–217, 1999. View at: Publisher Site | Google Scholar
  12. Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997. View at: Publisher Site | Google Scholar
  13. S. Zhang, X. Li, M. Zong, X. Zhu, and R. Wang, “Efficient knn classification with different numbers of nearest neighbors,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 5, pp. 1774–1785, 2018. View at: Publisher Site | Google Scholar
  14. P. Domingos and M. Pazzani, “On the optimality of the simple bayesian classifier under zero-one loss,” Machine Learning, vol. 29, no. 2/3, pp. 103–130, 1997. View at: Publisher Site | Google Scholar
  15. Y. Zhang, R. Jin, and Z.-H. Zhou, “Understanding bag-of-words model: a statistical framework,” International Journal of Machine Learning and Cybernetics, vol. 1, no. 1–4, pp. 43–52, 2010. View at: Publisher Site | Google Scholar
  16. G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández, “Syntactic dependency-based n-grams as classification features,” in Mexican International Conference on Artificial Intelligence, pp. 1–11, Springer, Berlin, Germany, 2012. View at: Google Scholar
  17. H. Abdi and L. J. Williams, “Principal component analysis,” Wiley Interdisciplinary Reviews: Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010. View at: Publisher Site | Google Scholar
  18. V. L. Quoc, A. Karpenko, J. Ngiam, and A. Y. Ng, “ICA with reconstruction cost for efficient overcomplete feature jearning,” Advances in Neural Information Processing Systems, vol. 24, pp. 1017–1025, 2011. View at: Google Scholar
  19. V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons, “Text mining using non-negative matrix factorizations,” in Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 452–456, Lake Buena Vista, FL, USA, April 2004. View at: Google Scholar
  20. C. Huang, L. Zhong, Y. Huang, G. Zhang, and X. Zhong, “A novel method for text recognition in natural scene based on sparse stacked autoencoder,” Journal of Computational Information Systems, vol. 11, pp. 1399–1406, 2015. View at: Google Scholar
  21. E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks,” in Proceedings of the 2015 IEEE International Conference on Acoustics Speech & Signal Processing, pp. 1996–2000, Brisbane, QLD, Australia, April 2015. View at: Google Scholar
  22. W. Xu and Y. Tan, “Semisupervised text classification by variational autoencoder,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 1, pp. 295–308, 2020. View at: Publisher Site | Google Scholar
  23. S. Chakrabarti, S. Roy, and M. V. Soundalgekar, “Fast and accurate text classification via multiple linear discriminant projections,” The VLDB Journal The International Journal on Very Large Data Bases, vol. 12, no. 2, pp. 170–185, 2003. View at: Publisher Site | Google Scholar
  24. A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: replacing minimization with randomization in learning,” Neural Information Processing Systems, vol. 21, pp. 1313–1320, 2009. View at: Google Scholar
  25. E. J. Candè and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, pp. 21–30, 2008. View at: Google Scholar
  26. R. Baraniuk, “Compressive sensing [lecture notes],” IEEE Signal Processing Magazine, vol. 24, no. 4, pp. 118–121, 2007. View at: Publisher Site | Google Scholar
  27. R. Li, X. Duan, X. Li, W. He, and Y. Li, “An energy-efficient compressive image coding for green internet of things (IoT),” Sensors, vol. 18, no. 4, p. 1231, 2018. View at: Publisher Site | Google Scholar
  28. T. T. Do, L. Gan, N. H. Nguyen, and T. D. Tran, “Fast and efficient compressive sensing using structurally random matrices,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 139–154, 2012. View at: Publisher Site | Google Scholar
  29. R. Li, X. Duan, and Y. Li, “Measurement structures of image compressive sensing for green internet of things (IoT),” Sensors, vol. 19, p. 102, 2019. View at: Google Scholar
  30. B. Richard, D. Mark, and Devore, “A simple proof of the restricted Isometry property for random matrices,” Constructive Approximation, vol. 45, pp. 113–127, 2008. View at: Google Scholar
  31. S. Escalera, O. Pujol, and P. Radeva, “On the decoding process in ternary error-correcting output codes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 120–134, 2010. View at: Publisher Site | Google Scholar

Copyright © 2020 Kelin Shen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views172
Downloads220
Citations

Related articles

Article of the Year Award: Outstanding research contributions of 2020, as selected by our Chief Editors. Read the winning articles.