Abstract

Text classification has always been an interesting issue in the research area of natural language processing (NLP). While entering the era of big data, a good text classifier is critical to achieving NLP for scientific big data analytics. With the ever-increasing size of text data, it has posed important challenges in developing effective algorithm for text classification. Given the success of deep neural network (DNN) in analyzing big data, this article proposes a novel text classifier using DNN, in an effort to improve the computational performance of addressing big text data with hybrid outliers. Specifically, through the use of denoising autoencoder (DAE) and restricted Boltzmann machine (RBM), our proposed method, named denoising deep neural network (DDNN), is able to achieve significant improvement with better performance of antinoise and feature extraction, compared to the traditional text classification algorithms. The simulations on benchmark datasets verify the effectiveness and robustness of our proposed text classifier.

1. Introduction

While entering the era of big data with the development of information technology and the Internet, the amount of data is getting geometric growth. We are entering information overload era. The issue that people are facing is no longer how to get information, but how to extract useful information quickly and efficiently from massive amount of data. Therefore, how to effectively manage and filter information has always been an important research area in engineering and science fields.

With the rapid increase of the amount of data, information representation is also diversified, mainly including text, sound, and image. Compared with sound and image, text data uses less network resources and is easier to be uploaded and downloaded. Since other forms of information can be also expressed by text, text has become the main carrier of information and always occupies a leading position in the network resources.

Traditionally, it is time-consuming and difficult to achieve the desired results of text processing, and it can not adapt to the demand of information society for explosive growth of digital information. Hence, effectively obtaining information in accordance with the user feedback can help users to get the information quickly and accurately. Then, text classification becomes a critical technology to achieve free human-machine interaction and contribute to artificial intelligence. It can address the messy information issue to a large extent, so that users can locate the information accurately.

1.1. Text Classification

The purpose of text classification is to assign large amounts of text to one or more categories based on the subject, content, or attributes of the document. The methods of text classification are divided into two categories, including rules-based and statistical classification methods [1, 2]. Among them, the rules-based classification methods need more knowledge and rules base in this field. However, the development of rules and the difficulties of updating them make the application of this method relatively narrow and suitable for only a specific field. Statistical learning methods are usually based on a statistic or some kinds of statistical knowledge; these methods establish learning parameters of the corresponding data model through the sample statistics and calculation on the train set and then conduct the training of the classifier. In the test stage, the categories of the samples could be predicted according to these parameters.

Recently, a large number of statistical machine learning methods are applied to the text classification system. The application of the earliest machine learning method is naive Bayes (NB) [3, 4]. Subsequently, almost all the important machine learning algorithms have been applied to the field of text classification, for example, nearest neighbor (KNN), neural network (NN), support vector machine (SVM), decision tree, kernel learning, and some others [510]. SVM uses the shallow linear model to separate the objective. In low dimensional space, when different types of data vectors can not be divided, SVM will map it to a high dimensional space through kernel function and finds the optimal hyperplane. In addition, NB, linear classification, decision tree, KNN, and other methods are relatively weak, but their models are simple and efficient; then those methods are accordingly improved.

But these models are shallow machine learning methods. Although they have also been proven to be able to efficiently address some of the issues in the case of simple or multiple restrictions, when facing complex practical problems, for example, biomedical multiclass text classification, the data is noisy and dataset distribution is uneven classification and shallow machine learning model and generalization ability of integrated classifier method will be unsatisfactory. Therefore, the exploration of some other new methods, for example, deep learning method, is necessary.

1.2. Deep Learning

With the success of deep learning methods [11, 12], some other improvement for NN, for example, deep belief network (DBN) [13], has been developed. Here, DBN is designed on the basis of the cascaded restricted Boltzmann machine (RBM) [14] learning algorithm, through unsupervised greedy layer pretraining strategy combining the supervision of fine-tuning training methods. It can tackle the problem of complex deep learning model optimization, so that the deep neural network (DNN) has witnessed the rapid advancements.

Meanwhile, DNN has been applied to many learning tasks, for example, voice and image recognitions [15]. For example, since 2011, Microsoft and Google’s speech recognition research team achieved a voice recognition error rate reduction of 20%–30% using DNN model, stepping forward in the field of speech recognition in the past decades. In 2012, DNN technology in the ImageNet [15] evaluation task (image recognition field) improved the error rate from 26% to 15% [16].

Moreover, the automatic encoder (AE) as a DNN reproduces the input signal [17, 18]. Its main principle is that there is a given input; it first encodes the input signal using an encoder and then decodes the encoded signal using a decoder, while achieving the minimum reconstruction error by constantly adjusting the parameters of encoder and decoder [19]. Additionally, there are some improvements to AE, for example, sparse AE and denoising AE [17, 18]. The performance of some machine learning algorithms could be further improved through the use of those AEs [20].

Recently, deep learning methods have a significant impact on the field of natural language processing (NLP) [11, 21].

1.3. Status Analysis

Due to the complex feature of large text data, and different effects of noise, the performance is not satisfactory when dealing with large dataset using traditional text classification algorithms.

More recently, deep learning has been applied to a series of classification issues with multiple modes successfully. Then, the user can effectively extract the complex semantic relations of the text by using deep learning-based methods [11, 22]. With the popularity of deep learning algorithms, DNN has some advantages in dealing with large-scale dataset. In this article, motivated by DNN, the denoising deep neural network (DDNN) is designed and the feature extraction is conducted by using this model.

For the shallow text representation (feature selection), there is a problem of missing semantics. For the deep text representation of the model based on the linear calculation, the selection of the threshold is added to the classifier training, which actually destroys the self-taught learning ability of the text. Meanwhile, for text classification of multilabel and multicategory, there is also a problem of ignoring label dependencies and lack of generalizing ability. To cope with the above problems, some improvements are achieved through deep learning methods. For example, a two-layer replicated softmax model (RSM) was proposed in [23], which is better than latent Dirichlet allocation (LDA), that is, a semantically consistent topic model [24]. However, the model is designed using weighted sharing technique and there are only two layers. In the process of dimension reduction, the missing information of documents is relatively larger, and the ability of noise handling is poor, resulting in little difference between different documents using the model.

In order to avoid such limitations and develop a better approach, this article proposes a DDNN model through the combination of some state-of-the-art deep learning methods. Specifically, in our model, the data is denoised with the help of denoising autoencoder (DAE), and then the feature of the text is extracted effectively using RBM. Compared with those traditional text classification algorithms, our proposed algorithm can achieve significant improvement with better performance of antinoise and feature extraction, due to the efficient learning ability of hybrid deep learning methods used in this model.

The reminder of this article is organized as follows. In Section 2, we give a technique analysis for DAE [25] and RBM [26]. Then, our proposed text classifier is presented in Section 3, where more attention is paid for the implementation of DDNN. Section 4 provides some simulation results and discussions. Finally, the conclusion is given in Section 5.

2. Background

In this article, we use two kinds of state-of-the-art deep learning models, that is, DAE and RBM [25, 26].

2.1. Denoising Autoencoder (DAE)

Generally, the structure of AE [27] is shown in Figure 1. Here, the whole system consists of two networks, that is, encoder and decoder. Its purpose is to make the reconstruction layer output as similar to the input as possible. The coding network will code and calculate the input and then reconstruct the result to by the decoder. And denoising automatic coding is developed according to the automatic coding, it will learn a more robust representation of the input signal and has stronger generalization ability than ordinary encoders by adding noise to the training data.

2.2. Restricted Boltzmann Machine (RBM)

As shown in Figure 2, RBM network has two layers [28, 29]. Here, the first layer is the visual layer (), also called the input layer, which consists of visible nodes. And the second layer is the hidden layer (), that is, the feature extraction layer, and it consists of hidden nodes. If is known, then and all hidden nodes are conditional independent. Similarly, all the visible nodes are also conditional independent when the hidden layer is known, the nodes within the layer are not connected, and the nodes from different layers are fully connected.

3. The Proposed Text Classifier

3.1. Denoising Deep Neural Network (DDNN)
3.1.1. Framework

Here, a DDNN is designed using DAE and RBM, which can effectively reduce the noise while extracting the feature.

The input of the DDNN model is a vector with fixed dimension. Firstly, we conduct the training by the denoising module composed of two layers, named DAE1 and DAE2, using unsupervised training methods. Here, only one of them is trained each time, and each training can minimize the reconstruction error for the input data, that is, the output of the previous layer. Because we can calculate the encoder or its potential expression based on the previous layer , so the th layer could be processed directly using the output of the th layer, until all the denoising layers are trained.

The operation of this model is shown in Figure 3.

After being processed through the denoising layer, the data enters the portion of RBM, which can further extract the feature that is different from the denoising autocoder layer. The feature extracted after this part will be more representative and essential. Figure 4 is the diagram for the RBM feature extraction.

This part is constructed by stacking two layers of RBM. Training can be conducted by training RBM from low to high as follows.

The input of bottom RBM is the output of the denoising layer.

The feature extracted from the bottom RBM is taken as the input of the top RBM.

Because RBM can be trained quickly by contrastive divergence (CD) learning algorithm [30], this training framework avoids the high complexity calculation of directly getting a deep network with one training by dividing it into multiple RBMs training. After this training, the initial parameter values of some pretraining models are obtained. Then, a backpropagation (BP) NN is initialized using these parameters; the network parameters are fine-tuned by the traditional global learning algorithm using the dataset with tags. Thus, the function can converge to the global optimal point.

The reason for choosing DAE here is that, in the process of text classification, data will be inevitably mixed into different types and intensity of noise, which tends to affect the training of the model, resulting in deterioration of the final classification performance. DAE is a preliminary extraction of the original features, and its learning criteria is noise reduction. In the pretraining stage, adding a variety of different strength and different types of noise signals to the original input signal can make the encoding process obtain better stability and robustness. It is shown in Figure 5.

Moreover, the reason for choosing RBM is that RBM is characterized by the fact that it can simulate the discrete distribution of arbitrary samples and it is very suitable for feature expression when the number of hidden layer units is sufficient.

3.1.2. Implementation

The DDNN model consists of four layers, that is, DAE1, DAE2, RBM1, and RBM2. The layer v is both visual layer and the input layer of the DDNN model. Each document in this article is represented by a fixed dimension vector, where , , , and represent the connection weight between the layers, respectively. In addition, , , , and represent each hidden layer corresponding to the output layers DAE1, DAE2, RBM1, and RBM2, respectively. DAE2 layer is the output layer of the denoising module, and also the input layer of the two-layer RBM module. RBM2 is the output layer of the DDNN model which represents the feature of the document, and it will be compared with the visual layer . This layer is the high-level feature representation of the text data. The subsequent text classification task is also addressed on the basis of this vector. For all nodes, there is no connection between the same layer nodes, but the nodes between those two layers are fully connected.

Specifically, the introduction of the energy model is to capture the correlation between variables, while optimizing the model parameters. Therefore, it is important to embed the optimal solution problem into the energy function when training the model parameters. Here, RBM energy function is defined asHere, (1) represents the energy function of each visible node and hidden node connection structure. Among them, is the number of hidden nodes, is the number of visible layer nodes, and and are the bias of visual layer and hidden layer, respectively. The objective function of the RBM model is to accumulate the energy of all the visible nodes and the hidden nodes. Therefore, it is necessary for each sample to count the value of all the hidden nodes corresponding to it, so that the total energy can be calculated. The calculation is complex. An effective solution is to convert the problem into probabilistic computing. The joint probability of the visible and the hidden node is

By introducing this probability, the energy function can be simplified, and the objective of the solution is to minimize the energy value. There is a theory in statistical learning that the state of low energy has higher probability than high energy, so we maximize this probability and introduce the free energy function. The definition of free energy function is as follows:

Therefore,where is the normalization factor. Then, the joint probability can be transformed into

The first term on the right side of (5) is the negative value of the sum of the free energy functions of the whole network, and the left is the likelihood function. As we described in the model description, the model parameters can be solved using maximum likelihood function estimation.

Here, we first construct a denoising function module for the original features. It is mainly composed of a DAE. The two-layer DAE is placed at the bottom of the model so as to make full use of the character of denoising. The input signal can be denoised by reconstructing the input signal through unsupervised learning, so that the signal entering the network is purer after being processed by the encoder. Then the impact of noise data on the subsequent construction of the classifier will be reduced.

The second module is developed using DBN. It is generated through RBM; then the ability of feature extraction in this model will be improved. Furthermore, the model can obtain the complex rules in the data, and the high-level features extracted are more representative. In order to achieve better sorting results, we use the extracted representative feature as an input for the final classifier after further extraction using RBM.

Considering the complexity of the training and the efficiency of the model, a two-layer DAE and a two-layer RBM will be used.

3.2. Text Classification Using DDNN

Here, the final DDNN-based text classifier is developed. And there are three key modules in its architecture, as shown in Figure 6.

3.2.1. Text Preprocessing Module

First, the feature words processed here are mapped into the vocabulary form [3133]. Then, the weights are counted using TF-IDF (term frequency, inverse document frequency) algorithm [34]. In addition, using vector to represent the text is implemented. Meanwhile, it is also normalized.

3.2.2. Feature Learning Module

The DDNN mentioned in Section 3.1 is used to implement feature learning.

3.2.3. Classification Identification Module

In this module, we use Softmax classifier in classification, and its input is the feature which is learned from the feature learning module. In the classifier, the hypothetical text dataset has texts from categories, where the training set is expressed as and represents the th training text, and represents different categories . The main purpose of the algorithm is to calculate the probability of belonging to the tag category, for the given training set . Here, that function is as shown in Each subvector of vector is the probability value that belongs to different tag categories, and the probability value is required to be normalized, so that the sum of probability value of all the subvectors is 1. And represents the parameter vectors, respectively.

After getting , we can obtain the previously assumed function . It can be used to calculate the probability value that text belongs to each category. The category which has the biggest probability value is the final classified result by the classifier algorithm.

4. Simulation Results and Discussions

In this article, simulations are conducted in two steps. First, we analyze the key parameters that affect the performance of the DAE and the RBM models (the basic components of DDNN model) and implement the simulation with appropriate parameters. Second, we compare the DDNN with NB, KNN, SVM, and DBN using the data with noise and the data without noise and verify the effectiveness of the proposed DDNN.

4.1. Evaluation Criterion of Text Classification Results

For the text classification results, we mainly use the accuracy as a classification criterion. This index is widely used to evaluate the performance in the field of information retrieval and statistical classification.

If there are two categories of information in the original sample, there are a total of samples which belong to category 1, and category 1 is positive. And there are a total of samples which belong to category 0, and category 0 is negative.

After the classification, TP samples that belong to category 1 are divided into category 1 correctly, and FN samples are divided into category 0 incorrectly. And TN samples that belong to category 0 are divided into category 0 correctly, FP samples are divided into category 1 incorrectly.

Then, the accuracy is defined asHere, the accuracy can reflect the performance of the classifier.

The recall is defined asIt can reflect the proportion of the positive samples classified correctly.

The -score is defined asIt is a comprehensive reflection of the classification of data.

4.2. Dataset Description

In our simulations, we test the algorithm performance using two news datasets, namely, 20-Newsgroups and BBC news datasets.

The 20-Newsgroups dataset consists of 20 different news comment groups in which each group represents a news topic. There are three versions in the website (http://qwone.com/~jason/20Newsgroups/). And we select the second version, that is, a total of 18846 documents, and the dataset has been divided into two parts, where there are 11314 documents for the train set and 7532 documents for the test set. The distribution of the 20 sample details can be found in that website. Note that, in our simulations, the serial number of those 20 labels varies from 0 to 19.

The dataset of BBC news consists of several news documents on the BBC website (http://www.bbc.co.uk/news/business/market_data/overview/). The dataset includes a total of 2225 documents corresponding to five topics, that is, business, entertainment, politics, sports, and technology. Similarly, we randomly select 1559 documents for train set, and 666 documents for a test set.

4.3. Simulation Results

All the simulations are conducted according to the following. The operating system is Ubuntu 16.04. The hardware environment is NVIDIA Corporation GM204GL [Tesla M60]. The software environment is Cuda V8.0.61 and cuDNN 5.1. Deep learning framework is Keras, while using sklearn and nltk toolkits.

4.3.1. Impact of Parameters

For all deep learning algorithms, the parameter tuning greatly affects the performance of simulation results. For the DDNN, the parameters which we mainly adjust include the plus noise ratio of the data, the number of hidden layer nodes, and the learning rate.

In order to test the robustness of the DDNN, we set the plus noise ratio of the training set to 0.01, 0.001, and 0.0001. The result are shown in the Table 1.

As shown in Table 1, the stability of the model can be guaranteed within the range of plus noise ratio (0.01, 0.001), but when the plus noise ratio is too high, that is, higher than 0.1, the data will be damaged especially for the sparse data, and it will affect the classification performance. Moreover, the performance of the classifier to robust feature extraction will be weakened if the plus noise ratio is too low. Hence, we set the plus noise ratio finally to 0.001. After we conduct the simulation, we set the noise factor as 0.01, 0.02, 0.03, 0.04, and 0.05 to verify the denoising performance of the proposed model.

The number of the input layer nodes is fixed according to the result of the weight using TF-IDF algorithm. Since the main purpose of DAE is to reconstruct original data, we set the numbers of the input layer nodes and output layer nodes to the same value. Because the number of the hidden layer nodes is unknown, we set the numbers of the two hidden-layer nodes in DAE to 1600 and 1500, 1700 and 1500, and 1800 and 1500, respectively. In addition, the numbers of the two hidden-layer nodes in RBM are set as 600 and 100, 700 and 100, and 800 and 100, respectively. Then, we conduct the simulation. And we set the learning rate to 0.1, 0.01, and 0.001. The results are shown in Table 2.

As shown in Table 2, the performance of the DDNN model will be better when the numbers of two hidden-layer nodes are set to 1700 and 1500 for DAE and 700 and 100 for RBM, respectively. And the learning rate should be set to 0.01.

4.3.2. Comparisons and Analysis

In this article, we compare our DDNN model with NB, KNN, SVM, and DBN models.

In text preprocessing, we select the frequency of the first 2000 words to simulation and set batch size with 350. Compared with the DDNN model (two-layer DAE and two-layer RBM) proposed in this article, the DBN model is also set to four layers. The number of iterations in the pretraining phase is 100, and the model updating parameter is 0.01.

Here, we take the BBC news dataset for an example to show the process of training. From Figures 7 and 8, we can see that, with the increase of epoch, the loss of training is decreasing and the accuracy is increasing towards test datasets, which shows that the effect of training is well.

Table 3 compares the results of DDNN with other models using the BBC news dataset and Table 4 compares them using the 20-Newsgroups dataset. Moreover, we compare these models in consideration of different types of data, including the data without noise and the data with a noise factor of 0.01, 0.02, 0.03, 0.04, and 0.05. Here, it is noted that, for each vector of text extracted, the standard normal distribution of noise factor multiplication is added. If a dimension is less than 0, it is directly set to 0. In this article, the accuracy rate (Accuracy), recall rate (Recall), and -Score are observed to evaluate the performance of classifier. Take the calculation of Accuracy, for example. Towards each classifier, we firstly calculate the accuracy of each category according to the metric (7) and then compute the average of these subaccuracies as the result. The simulation data is the optimal classification result after running many times.

After comparing DDNN model with shallow submodel, including KNN and SVM, from those analysis results in Tables 3 and 4, DDNN achieves a better performance. The reason is that when the training set is sufficient, the DDNN can be fully trained, so that the parameters of the network itself can reach the optimal value as much as possible to fit the distribution of training data, and the high-level features extracted from the underlying features are more discriminative for the final classification function.

Compared with the DBN model, DDNN first uses the DAE model to train the classification results more accurately in the case that the two layers of the model are the same (they are all four layers). This is because the first two layers in the DDNN model are with DAE, which can effectively reduce the impact of noise data, and the DDNN model can be more flexible to adjust the parameters. On the other hand, due to the use of DAE as the initial layer, the dimension of data can also be reduced preliminary.

As shown in Tables 3 and 4, the classification performance of NB, KNN, and SVM is obviously decreased when the dataset is adjusted with noise factor, and the DNNN has better antinoise effect for only about 1% decline.

Furthermore, Table 5 shows the running time of different models. We can easily find that, for each sample, the NB classifier holds the shortest running time and SVM classifier holds the longest running time. Meanwhile, it can be seen that the DDNN classifier can keep good classification speed while achieving good classification performance.

5. Conclusion

This article combines the DAE and RBM to design a novel DNN model, named DDNN. The model first denoises the data based on the DAE and then extracts feature of the text effectively based on RBM. Specifically, we conduct the simulations on the 20-Newsgroups and BBC news datasets and compare the proposed model with other traditional classification algorithms, for example, NB, KNN, SVM, and DBN models, considering the impact of noise. It is verified that the DDNN proposed in this article achieves better antinoise performance, which can extract more robust and deeper features while improving the classification performance.

Although the proposed model DDNN has achieved satisfactory performance in text classification, the text used in the simulations is long-type data. However, considering that there are also some short text data in text classification task, we should address this issue using the model DDNN. Moreover, to further improve the computational performance in the implementation of deep learning methods, in the future we can also design some hybrid learning algorithms by incorporating some advanced optimization techniques, for example, kernel learning and reinforcement learning, into the framework of DDNN, while applying it in some other fields.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research is funded by the Fundamental Research Funds for the China Central Universities of USTB under Grant FRF-BD-16-005A, the National Natural Science Foundation of China under Grant 61174103, the National Key Research and Development Program of China under Grants 2017YFB1002304 and 2017YFB0702300, the Key Laboratory of Geological Information Technology of Ministry of Land and Resources under Grant 2017320, and the University of Science and Technology Beijing-National Taipei University of Technology Joint Research Program under Grant TW201705.