Abstract

The special text has a lot of features, such as professional words, abbreviations, large datasets, different themes, and uneven distribution of labels. While the existing text data mining classification methods use simple machine learning models, it has a bad performance on text classification. To solve this drawback, a text data mining algorithm based on convolutional neural network (CNN) model and deep Boltzmann machines (DBM) model is proposed in this paper. This method combines the CNN and DBM models with good feature extraction to realize the double feature extraction. It can realize the tag tree by constructing the tag tree and design the effective hierarchical network to achieve classification. At the same time, the model can suppress the input noise on the classification. Experimental results show that the improved algorithm achieves good classification results in special text data mining.

1. Introduction

With the in-depth study of text classification model in industry and academia, text information presents the phenomenon of exponential growth. Data mining of massive information to extract effective information has become a hot issue.

Text data mining is defined as the induction process of a single or multiple categories of objects based on different characteristics of the document data. Initially, text classification method is mainly used in a large number of naive Bayes machine learning methods [1]. Then, a series of machine learning algorithms including K-nearest neighbor algorithm [2], support vector machine (SVM), neural network [3], least squares [4], and decision tree [5] have been widely used in the field of text classification. Recently, the application of SVM has become a hot research direction for researchers in the field of text classification [6]. However, K-nearest neighbor algorithm, least square, and decision tree are used in simpler models with higher efficiency and can be optimized and improved based on these methods. Literature [7] proposed graph neural networks (GNN) based text inductive classification method. It first adopted GNN to learn fine-grained word representations based on their local structures and then aggregated word nodes into document embedments to obtain classification results. Literature [8] proposed a new term weighting strategy, which makes more effective use of the nonoccurrence information of terms. The proposed weighting strategy also performs intraclass document scaling to better represent the discrimination ability of terms appearing in different document numbers of the same number of classes.

In terms of improving generalization ability, a selective integration theory is proposed, which has achieved good results for text classification applications. However, these models are the kind of shallow methods. When the data to be processed are massive and high-dimensional data, its classification is called complex data classification. When faced with complex data classification problems, the limitations of the algorithm based on this theory will be obvious. Specifically, the generalization ability is insufficient, and the requirements for text classification cannot be met. Therefore, how to obtain a deep machine learning method with strong generalization ability has become the mainstream of research.

The task of text classification can be divided into three steps: text preprocessing, text representation, and classification model construction. To manage the text information well, it is necessary to extract and classify text content scientifically and reasonably. The previous text representation is generally in the form of a count. This approach has two drawbacks. First, this method needs to assume that words are independent from each other, but the actual words are all related to each other, which leads to the neglect of the text language between words. Second, the interference of human factors is required when selecting features, which results in the extracted features having quotient dimensions and sparseness, and the representation and generalization ability of the text are very poor. In addition, for special texts, it has a large number of professional words, abbreviations, large datasets, diverse topics, and uneven label distribution. A simple machine learning model is used by the existing special text categorization methods to reduce the performance of text categorization.

The contributions of this paper are as follows:(1)To overcome the shortcomings of the abovementioned classification methods, a text data mining algorithm based on convolutional neural network (CNN) model and deep Boltzmann machines (DBM) fusion is proposed in this paper.(2)This method combines two models to achieve dual feature extraction. Tag tree is realized by constructing tag tree and designing effective hierarchical network.(3)At the same time, the effect of input noise on classification can be suppressed. It can effectively classify a large number of specialized words, abbreviations, and short text in documents and perform well.

The structure of this article is as follows. Section 2 introduces the general methods of big data mining. Section 3 introduces the recommendation algorithm. Section 4 shows the experimental results and analysis.

2. Common Text Classification Methods in Data Mining

There are two main methods of text classification: rule-based and statistics-based. At present, popular machine learning methods mainly include support vector machine (SVM), logistic regression, naive Bayes, decision tree, K-nearest neighbor, artificial neural network, integrated learning, tag correlation classification, and hierarchical classification methods.

Support vector machine (SVM): the principle of SVM is to find a hyperplane to meet the classification requirements so that the points in the training set can be separated from the classification plane as far as possible. When it comes to convergence training on big data, the SVM speed is very slow. It needs large concurrent computers and equipment resources with large storage capacity to support it. However, its advantage lies in that it can well overcome the influence of sample distribution so that the experimental effect is very good and the generalization ability is very good. SVM is a shallow linear model to classify different data. If the low-dimensional data vector space cannot be classified, the mapping method can be used to find the best hyperplane.

Logistic regression (LR): the LR model selects parameters according to the input variable Z and calculates the output variable, that is, is the possibility of 1. Here, the assumption of logistic regression model is shown in equation (1), and sigmoid function of type S is shown as follows:

The logistic regression model is obtained from equations (1) and (2), where is the eigenvector of the classification target.

3. Improved Text Data Mining Method Based on CNN and DBM

3.1. The Text Classification of CNN and DBM

CNN is a deep learning model characterized by weight sharing, which is an extension of BP neural network [9, 10]. CNN uses the gradient descent method to achieve weight adjustment. Its characteristic is that the direction of weight adjustment is toward the direction of the fastest gradient, which improves the network convergence speed. In terms of feature mapping, if the weights of neurons are consistent, the parallel learning of the network can be realized, which is a feature that CNN differs from neural network.

DBM is the basic modeling unit of network, and it is a model structure composed of RBM with undirected graph connection. The schematic diagram of DBM can be seen in Figure 1. It is mainly composed of unsupervised pretraining and supervised fine-tuning [11], which are roughly consistent with network results when selecting network nodes. DBM is able to efficiently combine local and global feature information [12]. It consists of a set of multiple visual units that make up its input layer (). The hidden layer (h) consists of a number of hidden cells in sequence and finally the output layer, which constitutes the DBM model. Adjacent layers in the model are connected by undirected graphs.

There are three main advantages of DBM. First, the weight can be updated through prior knowledge, which can extract features well. Second, the weight is updated by prior knowledge, which can effectively suppress input noise. The third is to simultaneously sample and calculate the weights of the neighboring nodes [13, 14]. This will give a more accurate text representation. DBM also has its own shortcomings, mainly in the expansion of the number of network layers, and the number of connected nodes increases; the computational complexity is exponential.

3.2. The Improved Text Classification Method

An improved text categorization method proposed in this paper is based on the improvements of CNN and DBM models. There are three steps to improve the CNN and DBM model components. The overall framework is shown in Figure 2.

In order to improve the accuracy of CNN and DBM model classification, the third step of this model, namely, hierarchical classification, is improved.

Figure 3 is a detailed framework diagram of the improved model, and the middle part is the feature extraction layer. In this step, CNN model is adopted to realize local feature extraction and global feature supplement for the input text representation. Then, DBM is used to fuse the two features and finally classify them.

In this framework, the output of CNN is the extracted local text feature , and is the entity feature which is the global feature. The input dimensions of both are the same, and they constitute the input formula of the DBM model, which is expressed as follows:

Then, each time a layer of hidden layer is passed, the corresponding weight is obtained. After model training, pretraining and fine-tuning, model testing, and finally the label of the target sample. In addition, in order to speed up the training speed of the model, ReLu activation function is adopted for training in this paper.

3.3. Hierarchical Classification

The output tag classification of DBM model is realized by designing label tree hierarchy (LTA). The LTA classifies the labels in a tree structure and renames the labels in a sequence of tree structures to form new labels. According to the characteristics of the experimental dataset adopted in this paper, all the labels are divided into two layers for layer processing. The first layer is a rough classification, which corresponds to the father node. The second layer is a fine classification, which corresponds to a leaf node.

There will be some errors through this layering. The error is obtained by the difference between the model classification and the real classification. At the same time, it is fed back to the CNN network in the first step of the model. The model receives the feedback error and corrects it and adjusts the weight until accurate classification is achieved.

3.4. The Evaluation Index of Text Classification Performance

The evaluation index of text classification method is based on the prediction of text classification. In general, there are three categories of indicators, namely, basic indicators, macro- and microaverage indicators, and ROC curve indicators.

The basic performance indicators of text classification include accuracy rate P, recall rate R, measure value F, and similarity S. The accuracy rate P is a measure of the precision of the retrieval system and is defined as follows:

The recall rate R is a measure of the entire document system and is defined as follows:

Select the measure value F1 as the classification index and it is the weighted harmonic average of equations (2) and (3).

The similarity S is defined as

The basic indicators and of classification performance are indicators that measure a certain category. The metrics for the entire dataset are macro and micro. The macroaverage reflects the overall performance of the algorithm, while the microaverage reflects the overall arithmetic performance of the algorithm. These performance indicators are expressed as follows”where represents the number of categories divided. It can be seen that the macroaverage has the characteristics of weight sharing, and the weight of each category is the same. The arithmetic mean of microaverage makes it more susceptible to large categories.

The ROC curve is a comprehensive indicator of continuous variables of sensitivity and specificity. If the area formed by the indicator curve is larger, it reflects the higher accuracy of the algorithm.

4. Experimental Results and Analysis

Dataset 1 used in this experiment is a performance comparison document on the medical dataset. The total number of samples is 9666, which is divided into 39 categories, and the corresponding types are multiclass. Dataset 2 is selected from the dataset. The total number of samples is 1000, which is divided into 168 categories. The corresponding types are multicategories. Dataset 3 is selected from the dataset. The total number of texts reaches 1000000, which is divided into 150 categories. The corresponding type is multilabel. These three kinds of experimental objects can better extend and verify the generalization ability of the model proposed in this section. Table 1 is a comparison of several datasets.

In the experimental subjects selected in this paper, the ratio of the training samples to the test samples of the subjects is 7 : 3. In addition, the sliding window step size of the CNN model is set to 50 to shift, so as to avoid changing the meaning of the representative word.

The performance indicators of different features of dataset 1 are compared. The specific data indicators are shown in Tables 2 and 3. From the data of different indicators, it can be concluded that the performance of the improved model proposed in this paper is better than that of other models regardless of BOW+ or DSE features. As far as the shallow model approach is concerned, the BOW+ has better performance than the DSE, while the DSE represents better performance for the model proposed in this paper and the improved CNN model.

Through experiments on different models and feature representation methods of dataset 2, Tables 4 and 5 show that the performance of the improved model proposed in this paper is superior to other models under different feature representation methods. For the shallow model method, BOW+ has better performance than DSE, while DSE represents better performance for the model proposed in this paper and the improved CNN model.

For the analysis results in Tables 25, more datasets can be obtained for the sample data. BOW+ and DSE feature representation is used to conduct performance comparison experiments for the 9 models, and the experimental results are shown in Tables 6 and 7.

ROC performance experiments were conducted on three datasets. ROC curves of five models on three medical abstract datasets are shown in Figures 46. The abscissa is specificity and the ordinate is sensitivity. The closer the curve is to the upper left, the better the performance. It is not difficult to see from the figure that the improved method has the best performance on medical, BioTex and Medline datasets.

Figures 46 show ROC performance comparison of different models on SVM, LDA, CNN_H, C-B_FLAT, and the improved method dataset. We can draw a conclusion from the pictures. Firstly, deep learning models are better than shallow learning methods. Secondly, hierarchical classification is better than flat classification. Thirdly, the improved model proposed in this paper can have optimal performance for different datasets.

5. Conclusion

According to the characteristics of special text, the existing special data mining method of text classification using simple machine learning model has poor performance. In order to solve this problem, a new improved data mining method based on CNN model and DBM model is proposed. This method combines CNN and DBM models with good feature extraction to achieve dual feature extraction. It can realize the reclassification of tags by constructing the tag level of tree structure and designing an effective hierarchical network. The model can suppress the influence of input noise on classification. The experimental results show that the improved model has good effect on the special domain text. The classification is a part of mining, and further information mining will be analyzed in the next research.

Data Availability

The labeled dataset used to support the findings of this study are available from the author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported in part by the Civil Aviation Safety Capability Fund (0242008) and in part by the Key Laboratory of Civil Aviation Flight Technology and Flight Safety (FZ2020ZZ02) and National Natural Science Foundation of China Civil Aviation Joint Fund (U2033213) and National Key Laboratory Project “Research on Nonlinear Dynamic Characteristics of Helicopter High Power Density Gear Transmission System” (No. HTL-0-19K01).