AI-enabled Decision Support System: Methodologies, Applications, and Advancements 2021View this Special Issue
Chinese Language and Literature Online Resource Classification Algorithm Based on Improved SVM
With the rapid development of network technology and the rapid growth of all kinds of online information resources, a large number of Chinese language and literature resources have emerged on the Internet. Online Chinese language and literature resources are becoming more and more important sources for people to obtain important information. However, existing search engines tend to provide a lot of irrelevant content when searching for information. Therefore, how to quickly and effectively obtain useful resource information and classify Chinese language and literature resources from a large number of information resources is the focus of this paper. This article mainly aims at the current development situation, by consulting a large amount of data, understanding the research status of improved SVM algorithm at home and abroad, showing the idea and advantages of SVM through the algorithm and experimental process, and further improving the SVM algorithm. The improved SVM proposed in this paper greatly improves the efficiency of classification and facilitates the rapid search of information with high matching degree among various online resources of Chinese language and literature.
Since the 1990s, computer networks and database technologies have gradually matured, providing an effective platform for information sharing in various fields. Abundant information, on the one hand, brings unprecedented convenience to people’s lives; on the other hand, it also raises some questions that people must face: how to effectively extract the required information from a large variety of information? How to make the extraction process more accurate and efficient? The data-based machine learning method is one of the results of people’s efforts to solve such problems. The basic idea is to conduct research and analysis on known data, discover laws from it, and use it to predict unknown data. Text categorization is an important application of machine learning theory and an effective method to obtain information. From a mathematical point of view, it is a mapping process, that is, the text of an unknown category is mapped to a known category. After classification, the originally chaotic Chinese language and literature are classified into certain categories, so that people can follow rules when looking up. When the number of Chinese language and literature resources is large, the significance of the classification of Chinese language and literature is obvious. The text classification method based on SVM overcomes the problem of data redundancy. It is based on a limited number of samples and can obtain the best classification effect under the information contained in the existing training text. At the same time, SVM is derived from VC dimension theory and structure risk minimization in statistical learning theory, referred to as SRM principle , which effectively solves the overlearning problem in other machine learning algorithms, that is, SVM is based on a limited number of training samples. Minimizing the training error can ensure that the test error is minimized.
2. Related Work
The research on automatic classification of foreign texts began in the late 1950s, when H. R. Luhn first proposed automatic text classification based on the idea of word frequency statistics. In 1960, Maron published the first paper on automatic text classification. Subsequently, K. Spark, G. Salton, K. S. Jones, and others also conducted research in this field. Subsequently, a variety of new models and methods related to automatic text classification emerged in an endless stream. Compared with English text classification, the biggest difference in Chinese text classification lies in the preprocessing stage . Chinese language and literature texts need word segmentation, while English texts are divided by spaces. In fact, after preprocessing the text, no matter what kind of text it is, it will be converted into the identity matrix of the sample vector, and the subsequent classification process is similar. The research of text classification technology in our country started relatively late. From the simple method of looking up the dictionary to the current text classification method based on statistics, it has gradually matured. Our country’s research on automatic text classification began in the early 1980s and has generally experienced three development stages, i.e., the feasibility study-auxiliary classification system-automatic classification system . In 1981, Mr. Hou Hanqing first explored the automatic classification of Chinese texts. Then, more and more domestic scholars participated in this field , and the main content of classification gradually changed from English to Chinese. After years of research, the gap between Chinese text automatic classification technology and the world’s technical level is also constantly narrowing. More classic text classification algorithms: decision tree method, neural network method, genetic algorithm, Bayesian, and K-nearest neighbor algorithm , have been widely used in web text classification. There are many domestic research studies on text representation and text classification. For example, the Chinese text classification technology based on N-gram  has got rid of the dependence on dictionary and word segmentation and has realized the field independence and time of text classification. In recent years, domestic scholars have cited text classification techniques in different fields and have produced many new methods: group-based classification methods, fuzzy-rough set-based methods, multi-classifier fusion methods, and text classification models based on RBF networks. Ma and Tian  select high-frequency words that appear in the same window unit and use a prioir algorithm to mine the largest frequent word co-occurrence set from these high-frequency words, thereby extending VSM to represent documents. Ma and Tian  compared the effect of support vector machine algorithm and traditional KNN algorithm in text classification through experiments, proposed an improved method for text preprocessing according to the shortcomings of support vector machine algorithm, and proved the effectiveness of the new method. Aiming at the vector space model in text representation, Liu Haifeng et al.  used the category space model to represent the text as a matrix and effectively used the category information of the text to implement a text classification algorithm based on category information . Li Xiangdong et al.  defined the training text set, test set, and classification (column) effect evaluation required for automatic classification based on the analysis of the characteristics of the journal's permanent topic columns and used Jensen–Shannon divergence to calculate the text gap. It also improves the basic principle of the KNN algorithm in determining the k value based on the column dynamics and has achieved good results.
Key objectives of the study include the following:(i)To design and develop an efficient methodology to quickly and effectively obtain useful resource information such as Chinese language and literature from information resources and classify them.(ii)To improve performance of the SVM algorithm and demonstrate the idea of the improved version of the SVM along with its advantages.
3.1. Support Vector Machine (SVM)
The basic idea of support vector machine is to establish an optimal hyperplane as the decision surface , so that the classification interval between positive and negative samples is maximized. Because different kernel functions will lead to different SVM algorithms, choosing the right kernel function is a crucial step. There are different forms of kernel functions such as polynomial kernel, Gaussian radial basis kernel, sigmoid kernel, and so on. The advantage of support vector machine lies in its ability to learn in sparse, high-dimensional space. It can use fewer training samples to minimize the range of empirical errors and the complexity of the classifier, avoid local minima, and effectively solve the problem, with good generalization and good classification accuracy. The important point is that contrary to the neural network, its objective function is a unimodal function, so it can reach the global optimal solution to a large extent.
3.2. Improved Multi-Class SVM Algorithm
In the process of training SVM dataset, singular points with larger Lagrangian multipliers can be converted into support vectors. So many improved support vector machines are present such as RSVM  (robust support vector machine) is based on calculating the distance between various training data centers and data points to establish an adaptive decision surface, but the disadvantage is that it is difficult to determine additional parameters; SVND  (fuzzy support vector machines) can effectively extract and detect singularities from normal data, but this method is usually only applied to a class of classification problems; and FSVM  (fuzzy support vector machines), in which the fuzzy membership of the training data reduces the influence of outliers, but how to choose the membership function is still a big problem. Therefore, this paper proposes an improved multi-class SVM algorithm (WSVM) to solve the singular value sensitivity problem of classic SVM. The basic idea is to assign different weights to different training data, so that SVM can construct a decision-making surface according to the relative importance of each kind of data. The main idea of the SVM model is to transform the original optimization problem into a quadratic programming problem to construct the optimal decision surface. In the SVM, different kernel options are available, and we have used the Gaussian kernel for our study.
The training dataset is as in formula (1), where .
The original optimization problem is as in formulas (2)–(4):where is a penalty parameter, not only to minimize , but also to extremely minimize .
In order to derive the dual problem of (2) and (3), the Lagrange function is introduced:where and are both multiplier vectors of the Lagrange function. Optimization problem:
Use constraint (8) to eliminate , which becomes a problem with only variable . This problem is reduced to a minimization problem, that is, a convex quadratic programming problem:
The weighted SVM algorithm treats these data differently according to different weights of each kind of data. Weighted SVM assigns higher weight to data with important information and similarly assigns lower weight to data with less important information. Weighted training dataset:where is the weight of , is a sufficiently small positive number, and . Like SVM, weighted SVM mainly achieves the correct rate of classification by maximizing the classification interval and minimizing the classification error rate. The difference with SVM is that the improved SVM uses a weighting function to reduce the impact of some nonimportant data on the classification results and enhance the impact of important data on the classification. In the case of data weighting, the optimal decision-making surface is constructed, and the optimization problem is transformed into formulas (15)–(17):
It can be seen from formula (16) that the introduction of greatly reduces and reduces the influence of the slack variable in the optimization problem. Therefore, the visible data are not important for classification. The above weighted optimization problem can be transformed into a convex quadratic programming problem, such as formulas (18)–(20):
is the kernel function and a is the inner product. This paper adopts the Gaussian radial basis function (RBF) with better learning ability and wider convergence range. It can be seen that when , the improved SVM is transformed into the original problem of the support vector machine, and the compromise of in the system can be determined for different . The smaller is, the smaller is for the construction of the maximum edge hyperplane, and vice versa.
SVM is a binary classifier and does not support multi-class classification. However, it has support for binary classification and can separate data points into two classes. Hence, for multi-class classification problems, the same principle is applied after breaking down the multi-class classification problem into multiple binary classification problems.
4. Experiments and Discussion
4.1. Performance Analysis of Chinese Language and Literature Online Resource Classification Algorithm Based on Improved SVM
1329 articles were downloaded from the Internet, which were divided into three first-level categories, of which the training document set included 1038 articles and the test set included the rest of the articles. The performance of a classifier is usually measured by some evaluation indicators. The commonly used performance evaluation indicators in the classification of Chinese language and literature include recall, precision, and F-measure. Different evaluation criteria have different evaluation objectives and calculation methods. Recall rate refers to the ratio of the number of related documents retrieved to the total number of related documents, and the precision rate refers to the ratio of the number of related documents retrieved to the total number of documents returned. The F-measure is the most commonly used method to measure the overall classification effect. Micro-average is the evaluation of the overall performance of classification, which is the arithmetic average of the performance indicators of each instance document as shown in (24):where A represents the number of documents that are correctly classified, B represents the number of documents that actually do not belong to a certain category but are classified into that category, and C represents the number of documents that actually belong to a certain category but are not classified. The recall rate mainly refers to the ratio of texts that meet the requirements in the set of documents and meet the requirements in all the texts that meet the requirements. The accuracy rate refers to the ratio of the documents that really meet the requirements in all the documents retrieved, and it indicates the accuracy of the classification results. Among them, the check rate and the recall rate have a constrained relationship with each other. As the accuracy rate increases, the recall rate will decrease, and vice versa. The F-measure combines these two indicators to evaluate the overall performance of the classifier. This experiment adopts commonly used evaluation methods, recall, precision, and F-measure, as shown in Figure 1.
4.2. System Design of Chinese Language and Literature Online Resource Classification Based on Improved SVM
The traditional Chinese language and literature text classification system based on SVM is divided into two modules: classification function training and test text classification. In these two modules, the processes of Chinese language and literature word segmentation, feature extraction, feature selection, and text vector representation are included. In this way, the two modules have a large number of repetitive functions and strong coupling. In the process of implementing the Chinese language and literature text classification system based on SVM, this paper has carried out a more rational module design, namely, the whole system is divided into text preprocessing module, classification function training module, and text classification module, and Chinese language and literature word segmentation, feature extraction, feature selection, and text vector representation are integrated into the text preprocessing module. It also provides a common input interface for the training text and the test text, and the preprocessing system outputs the feature vector of the training text and the feature vector of the test text according to different situations. In this way, the text vector representation function does not need to be involved in the training module and the classification module, which is beneficial to the development and maintenance of the system and makes the system have better performance. At the same time, this system is a Chinese language and literature text classification system based on the nonlinear separable SVM algorithm. It uses the polynomial kernel function and uses the feasible direction method to solve the algorithm.
4.3. Implementation of the System of Chinese Language and Literature Online Resource Classification Based on Improved SVM
System development is implemented in the Windows environment using Java language, which specifically includes the following: Development platform: Microsoft Windows, Eclipse 3.2+MyEclipse5.0, jsdk1.5, and development language: Java.
In the traditional Chinese language and literature text classification system based on the SVM algorithm, it mainly includes a classification function training module and a text classification module, which correspond to the training of the classification function and the classification of the test text, including the conversion of the text into a feature vector. In the classification function training module, the training text is converted into a feature vector with category information, and in the text classification module, the test text is converted into a feature vector without category information. In fact, these two processes are very similar, and the only difference is whether the obtained feature vector contains known category information. In view of this, this paper adjusts and improves the structure of the traditional SVM-based Chinese language and literature text classification system. The process of expressing training text and test text into feature vectors is integrated into the text preprocessing module. In this way, after the text vector representation process is completed in the text preprocessing module, they can be substituted into the classification function training module and the text classification module for calculation, which is beneficial to the development and maintenance of the system. After modular design, this system is mainly composed of text preprocessing module, classification function training module, and text classification module.
4.4. Experimental Results and Algorithm Performance Analysis
The improvement of this system is mainly reflected in the structure of the system, and its classification efficiency and classification accuracy should be equivalent to the traditional SVM-based Chinese language and literature text classification system. In order to accurately understand its performance, this paper selects 3 sets of data and 1200 Chinese language and literature texts related to various subjects for testing and compares the test results with the test results of the traditional Chinese language and literature text classification system based on SVM algorithm. The specific distribution of text data is shown in Table 1.
During the experiment, when one type of training text is regarded as a positive type, any other type is regarded as a negative type. The specific experimental steps are as follows:(1)The training text is input into the preprocessing subsystem, and after Chinese word segmentation, feature extraction, feature selection, and text vector representation, the training text feature vector with category mark recognition is obtained, and the training vocabulary is established. Then, save the obtained feature vector and training vocabulary as a text file and put it into the specified file.(2)Read the training text feature vector, input it into the classification function training subsystem to solve the quadratic programming problem, determine the classification function parameters and b, and construct a specific classification function.(3)The test text is input into the preprocessing subsystem, and after Chinese word segmentation, feature extraction, feature selection, and text vector representation, the feature vector of the test text without category identification is obtained. Then, save the obtained text vector as a text file and put it into the specified folder.(4)Read the test text feature vector, input it into the classification function, and calculate the function value. Then, output the text category according to the function value.(5)Statistic experiment results: plot 3 sets of data into tables for comparative analysis.
The final result is shown in Figures 2 and 3.
From the experimental results, it can be found that the recall and precision of the Chinese text classification system based on SVM before and after the improvement are equivalent, and the error between them is related to the experimental environment and other factors. But after adjusting and improving the structure of the system, the internal cohesion of each module is stronger, and the coupling between the modules is weaker, which is conducive to the development and maintenance of the system.
With the development and perfection of computer network and database technology, abundant resources make text classification technology gradually become the focus of people’s attention. Because of its solid theoretical foundation and excellent performance, support vector machines have become a research hotspot in text classification and other data mining fields. This article focuses on the key technologies and text classification algorithms of Chinese language and literature text classification. Compared with the traditional SVM-based Chinese language and literature text classification system, this system is unique because of the following reasons. (1) The process of Chinese language and literature word segmentation, feature selection, feature extraction, and text vector representation is integrated into the preprocessing module. A common input interface is provided for the training style and the test text, and the preprocessing system outputs the feature vector of the training text and the feature vector of the test text according to different situations. In this way, the text vector representation function does not need to be involved in the training module and the classification module, so that the coupling between the modules is weakened and the independence is enhanced, which is beneficial to the development and maintenance of the system. (2) We use the feasibility direction method to solve the secondary programming problems involved in the training process. Since the training of the classification function is the key to this system, the following briefly analyzes the time complexity of the classification function obtained by the system from the training library. In this system, suppose the number of training texts is n, the average number of feature words in each text is k, and the text is represented as a t-dimensional vector. In the process of training the SVM classification function, the running time from text segmentation to text vector representation is at least , and the running time is consumed in the process of establishing the training vocabulary and text vector representation. Then, the feasibility direction method is used to solve the quadratic programming problem involved in the SVM algorithm to complete the training process of the classification function. Its running time is , where C is a positive integer much smaller than n, so the total time complexity of training the classification function in this system is , that is, . In addition, in the actual application process, the training time of the SVM algorithm is also related to the quality of the training library, the dimension of the text vector, and the configuration of the computer [15-23].
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The author declares that there are no conflicts of interest.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony, “Structural risk minimization over data dependent hierarchies,” EEE Transactions, vol. 44, 1998.View at: Publisher Site | Google Scholar
X. Y. Zhang, “Research on automatic classification of Chinese text based on rough set theory,” Master’s thesis, vol. 6, no. 5, pp. 10-11, 2005.View at: Google Scholar
J. B. Tan, “Design and implementation of an automatic text classification system for online education resources,” Technology Application, vol. 4, no. 4, pp. 68-69, 2009.View at: Google Scholar
X. G. Zhang, “About statistical learning theory and support vector machines,” Acta Automatica Sinica, vol. 26, no. 1, pp. 32–38, 2000.View at: Google Scholar
F. Huang, “Research on web document feature extraction in basic education search engine,” Nanjing Normal University, Nanjing, China, 2006, Master’s Thesis.View at: Google Scholar
X. Yao, X. D. Wang, Y. X. Zhang, and W. Quan, “Feature selection algorithm based on approximate Markov Blanket and dynamic mutual information,” Computer Science, vol. 39, no. 8, pp. 1046–1050, 2012.View at: Publisher Site | Google Scholar
T. H. Zhang, H. T. Geng, and Q. S. Cai, “An improved SVM and its application in automatic text classification,” Microelectronics & Computer, vol. 22, no. 12, pp. 24–27, 2005.View at: Google Scholar
J. N. Ma and D. G. Tian, “Research on automatic classification of Chinese text based on support vector machine,” System Engineering and Electronic Technology, vol. 29, no. 3, pp. 475–478, 2007.View at: Google Scholar
H. F. Liu, S. S. Liu et al., “An automatic text classification model based on category information,” Modern Library and Information Technology, vol. 5, no. 4, pp. 72–76, 2010.View at: Google Scholar
X. D. Li, P. Xu et al., “Research on Automatic Text Classification Method Based on KNN Algorithm—Taking Automatic Classification of Academic Journals as an Example,” Library, Information and Knowledge, vol. 5, no. 4, pp. 71–76, 2010.View at: Google Scholar
J. Yuan, J. Li, and Bo Zhang, “Learning concepts from large scale imbalanced data sets using support cluster machines,” in Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 441–450, ACM, Santa Barbara, CA, USA, Octobe 2006.View at: Publisher Site | Google Scholar
Q. Song, W. Hu, and W. Xie, “Robust support vector machine with bullet hole image classification,” IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, vol. 32, no. 4, pp. 441–447, 2002.View at: Publisher Site | Google Scholar
C.-Fu Lin and S.-De Wang, “Training Algorithms for Fuzzy Support Vector Machines with Noisy data,” in Proceedings of the 2003 IEEE 8th Workshop on Neural Networks for Signal Processing(IEEE Cat. No.03TH8718), pp. 517–526, Toulouse, France, September 2003.View at: Publisher Site | Google Scholar
Li J. Cao, H. P. Lee, and W. K. Chong, “Modified support vector novelty detector using training data with outliers,” Pattern Recognition Letters, vol. 24, no. 14, pp. 2480–2484, 2003.View at: Publisher Site | Google Scholar
G. Y. Jiang, Y. Zhang, and M. Yu, “Multi-mode and multi-view video coding based on correlation analysis,” Chinese Journal of Computers, vol. 30, no. 12, pp. 2206–2208, 2007.View at: Google Scholar
J. Q. Wang, “Research on the construction of featured database of university library—take Minjiang University for example,” Journal of Minjiang University, 2013.View at: Google Scholar
C.-Fu Lin and S.-De Wang, “Fuzzy support vector machines,” IEEE Transactions on Neural Networks, vol. 13, no. 2, pp. 464–471, 2002.View at: Publisher Site | Google Scholar
J. C. Burges, “A tutorial on support vector machine for pattern recognition,” Knowledge Discovery and Data Mining, vol. 2, no. 2, pp. 135–140, 1998.View at: Publisher Site | Google Scholar
B. Fei and J. Liu, “Binary tree of SVM: a new fast multiclass training and classification algorithm,” IEEE Transactions on Neural Networks, vol. 17, no. 3, pp. 696–704, 2006.View at: Publisher Site | Google Scholar
Yu Lei and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” Machine Learning Research, vol. 5, no. 1, pp. 1207–1220, 2004.View at: Google Scholar
D. Koller and M. Sahami, “Toward optimal feature selection,” in Proceedings of International Conference on Machine Learning, pp. 162–187, Bari, Italy, July 1996.View at: Google Scholar
V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, NY, USA, 1995.
J. Pei, K. Zhong, J. Li, and X. Wang, “ECNN: evaluating a cluster-neural network model for city innovation capability,” Neural Computing & Applications, vol. 6, pp. 1–13, 2021.View at: Publisher Site | Google Scholar