Abstract

The Chinese language is a nation’s symbol, the accumulation of a country’s magnificent culture, and the pearl precipitated in the long history’s washing. The Chinese language is rich and complex, and there are still many topics and issues that merit repeated exchanges and discussions in academic circles. This study proposes a classification method of emotion polarity based on reliability analysis in order to identify the tendency of literary emotion in Chinese language. Support vector machine (SVM), class center, and KNN (K-nearest neighbor) are included in the combined classifier, which effectively improves the accuracy and efficiency of emotion polarity classification. A Chinese literary emotion analysis model based on the method of UKNNC (unbalanced K-nearest neighbor classification) is proposed at the same time by analysing the characteristics of text structure and emotion expression. The experimental results show that, when compared to traditional machine methods, the UKNNC method can analyse text sentiment in fine-grained and multilevel ways, while also improving the accuracy of Chinese literary sentiment analysis.

1. Introduction

The term “Chinese language and literature” specifically refers to a literary work that is created through Chinese editing and then communicated through Chinese works. The analysis of language application and artistic conception features in Chinese language and literature that follows is useful for improving human language expression ability and application. Using the content covered in Chinese language and literature can help students improve their language skills, improve communication with others, expand their thinking abilities, and better taste Chinese language and literature works. In essence, students can only grasp the applied context effectively if they have a broad understanding of Chinese language and literature, which helps to improve students’ overall Chinese literacy.

Emotion analysis of Chinese language and literature refers to judging its emotion tendency by analysing and mining information such as emotions, positions, and opinions in the text [1]. It involves many fields such as data mining, information retrieval, and so on, and has a wide range of application value. Data level processing is very important for unbalanced classification [2, 3]. No matter whether the importance of samples is consistent in practical problems, data level processing will have a relatively positive impact on the final classification results. Traditional learning analysis techniques in the field of education frequently focus on analysing structured data but rarely consider unstructured data, making it difficult to accurately identify learners’ attitudes, emotions, and psychological states [4, 5]. The support of classification methods is essential for successful classification of imbalanced data sets, and a good classification method is the key to success. When traditional classification methods are used to classify imbalanced data sets, however, test samples from a few classes are frequently misclassified into the majority of classes, making it easy to overlook a few classes and resulting in poor classification performance [6]. Because the complex distribution of unbalanced data is difficult to capture, current technology in this field still has many flaws. As a result, investigating an effective unbalanced data classification technology [7] is extremely important.

K-nearest neighbor (KNN) method has become one of the most famous algorithms in the field of pattern recognition [810] and statistics because of its simple algorithm, easy realization, no need to estimate parameters, and high classification accuracy, and it is also one of the earliest nonparametric algorithms applied to automatic text classification in machine learning [11, 12]. However, KNN method needs to store all the training sample data in the process of calculating the nearest neighbor of each sample to be tested, which leads to a large number of similarity calculations for classification and significantly increases the complexity of classification calculation with the increase of sample data set [13], thus reducing the classification efficiency. Based on this, this study puts forward a Chinese literary sentiment analysis model based on the method of unbalanced K-nearest neighbor classification (UKNNC) for learners’ learning experience texts, which classifies the learning experience text information in multiple levels to improve the performance of learners’ sentiment analysis and provide more effective support for teaching design and management.

Emotion analysis of Chinese literature belongs to the category of computational linguistics, which involves many research contents such as artificial intelligence, machine learning, information extraction, information retrieval, and data mining. It is closer to the goal of artificial intelligence than traditional technology, and has important research value in theory and application. Therefore, this study innovatively proposes an emotion analysis model of Chinese language literature based on UKNNC method. The difference between this method and the traditional voting strategy of combined classifiers lies in whether to vote on the categories of samples by multiclass classification by analysing the credibility of samples. Experiments show that the classification speed of this method is lower than that of SVM.

As far as the nature of Chinese language is concerned, it involves a wide range of majors, but as a whole, liberal arts students are the main teaching objects, and some students do not study Chinese language and literature until they enter the university, which also reflects the charm of Chinese language and literature from the side, and makes the analysis of language application and artistic conception get corresponding attention and attention. Literature [14] puts forward Fisher discrimination rate that calculates the distance between positive and negative classes of each feature. The larger the distance, the easier it is to classify the data. Literature [15] puts forward the complexity measurement (CM) as the data set measurement index, and CM calculated the proportion of less than half of the samples in the nearest neighbors of the samples in the data set. The larger the CM, the larger is the cross between positive and negative samples. When IR is used as the data set evaluation index, there is no fixed value range of IR, but the CM range is 0–1, which makes it more comparable. Literature [16] holds that each sample has its neighbor distribution. When the sample is surrounded by samples of the same kind, the classification of the sample is simple, while when the sample is surrounded by samples of different kinds, the classification of the sample will be more difficult. In literature [17], the author used Naive Bayes and maximum entropy method to study the emotion classification of news and commentary corpus. Through experiments, it was found that the accuracy of binary value as feature weight was better. Literature [18] has studied the correlation between the extracted subject and the evaluation words of rhetoric. They introduce the relationship extraction into data mining, take the evaluation words and subjects co-occurring in the same sentence as candidate sets, and apply the maximum entropy model to extract the relationship with various features, such as words, parts of speech, semantics, and positions. The results show that degree adverbs can help improve the performance of subjective relation extraction.

In terms of classification method selection and design, it is primarily based on a thorough examination of the degree of influence of indexes (parameters, structures, and so on) of traditional methods on unbalanced data sets, with corresponding improvements to existing methods or the creation of new classification methods. The original training samples are primarily clustered or blocked in literature [19], and then representative training samples are selected to replace the original training samples. The literature [20] divides samples based on density. The KNN rule of distance weighting is based on the characteristics that the nearest neighbor points near the test sample contribute a lot to classification, whereas the opposite contribution is small, according to literature [21]. Literature [22, 23] proposes a generalised nearest neighbor classification method that assigns different weights to each dimension of classified data. The importance of boundary samples has been improved in the literature [24]. Literature [25] defines the representative function from two factors: the distance between two samples and the included angle and replacing the category attribute function with the defined function when calculating the weight by KNN method. In literature [26, 27], by summarizing the construction of praise and derogatory words, the table of influencing factors, the matching word list, and the clear word list in the field of electronic products, an evaluation system for electronic products was constructed, and the opinion mining of electronic products was realized, with the correct rate reaching 93%.

Emotion classification and extraction, as the basic research of emotion analysis, provide effective support for the application of emotion analysis, but there is still a big gap between the results of classification and extraction and the specific needs of users. Therefore, in order to narrow the gap with users’ needs, the application research of emotion information becomes essential.

3. Research Method

This study examines the problem of Chinese literary emotion analysis, adopts the classification method of emotion polarity based on reliability analysis and the model of Chinese literary emotion analysis based on UKNNC method, and makes an in-depth study with various linguistic features, which greatly improves the effect of Chinese literary emotion analysis as a whole.

3.1. Traditional KNN

On the basis of the nearest neighbor method, the K-nearest neighbors (KNN) method was developed. It is one of the most widely used classification methods in the fields of data mining and machine learning, as well as the US Census Bureau’s default data preprocessing method. It is now widely used in a variety of fields, including classification, clustering, and regression. This study will focus on how it can improve its classification.

Emotion tendency includes two categories: positive and negative. Therefore, emotion recognition can be regarded as two kinds of classification problems, namely, the recognition of positive and negative meanings. Based on this, KNN algorithm is used to identify the emotion of the text. The algorithm is a simple, effective, and nonparametric method, its essence is a predictive supervision algorithm, and its rules are data samples [28].

KNN’s classification idea is very simple, that is, suppose that for a given sample to be classified (where is the dimension of the sample), the computer trains the training data sets of known categories, finds out the nearest neighbor samples most similar to from these training data sets, and then classifies and votes to determine its category.

Suppose the training set isHere, is the training sample set, is the category of training sample set, and test set is .

The traditional KNN method mainly finds out the training samples with the highest similarity with from the training sample set of known categories for each sample to be tested:

The similarity of each sample and in the training sample set is calculated as follows:Here, represent the weight of the -th feature item in samples and respectively, and the weight calculation formula of belonging to class is as follows:

Among them, is the weight function of voting, generally taking 1 or , and the function is as follows:

is classified into the category with the largest class weight . Repeat this process to classify all samples to be tested. Finally, the classified category is classified with its real class label to measure the performance of this classification method.

The original KNN algorithm’s classification principle is simple and straightforward to use, but there is no systematic consideration of the impact of sample data set imbalance on classification, that is, each selected nearest neighbor sample’s representative degree to its class is treated the same. To solve this problem, an improved KNN algorithm is proposed in this study. This method primarily improves the algorithm in two areas: sample set imbalance between classes and sample set imbalance within classes.

3.2. Text Emotion Information Classification

The classification of text emotion information is primarily used to identify opinions, attitudes, preferences, and other related information expressed in natural language. Emotion classification research focuses on unstructured text information such as forums and blogs that come in a variety of formats and are written in a colloquial style. Emotion classification, unlike traditional topic-based text classification, focuses on subjective and objective classification, as well as emotion polarity classification. It is mainly divided into word-level emotion information classification, text-level emotion information classification, and sentence-level emotion information classification, depending on the granularity of the research.

Level-I sentiment information classification includes subjective and objective text classification and text sentiment polarity classification. The process is shown in Figure 1.

The mainstream method used in the classification of level-level emotion polarity is the classification method based on statistical learning. From the existing related research, it is found that the classification method based on statistics is also the most effective method to solve the classification of emotion polarity at present. In the existing research, scholars have tried to apply various statistical classification methods and linguistic features such as words and parts of speech to the study of emotion polarity classification, and achieved good results [29]. In order to further improve the classification accuracy, this study looks at two aspects: optimising the classification model and selecting effective features.

This study proposes a combined classifier method based on reliability analysis for optimising classification models. This study employs a method based on category attribute analysis for feature selection. Experiments show that the method used in this study can effectively improve classification accuracy without slowing down the classification process.

In the process of emotion polarity classification, through the analysis of emotion information, it is found that not all samples need to be discriminated by combined classifiers, and there is no obvious difference in the classification results of various single classifiers for samples that are easy to distinguish.

From Figure 2, we can see that the hollow circle and square respectively represent different types of sample points, and the circle and square which are far from the classification hyperplane respectively represent samples with obvious classification characteristics.

Therefore, if we can find a way to determine which samples belong to easily distinguishable samples and which samples belong to indistinguishable samples, then we can treat the two types of samples with different methods and improve the accuracy and speed of classification. In this section, a classifier fusion strategy based on reliability analysis is proposed to solve the polarity classification problem of these samples.

Assuming that a confidence function is determined according to the classification principle of the main classifier, and the discrimination threshold is set, if is used when the main classifier discriminates the samples, it is considered that the discrimination result of the main classifier has high confidence, and the result can be used as the final classification result of the samples.

Otherwise, the classification of the sample to be marked is determined by the method of voting together by the main and auxiliary classifiers. The discrimination process is shown in Figure 3.

We define the credibility function according to the distance between the test sample in the main classifier and the class center, as shown in the following formula:where is the distance between the nearest class center and the test sample, and is the distance between the next nearest class center and the test sample.

It can be analysed that when the cosine similarity is used to calculate the distance between the sample and the class center, if , it means that the distance between the test sample and the nearest class center is relatively close, and the distance from the next nearest class center is relatively far, and this kind of sample is considered to be relatively easy to distinguish.

If , it means that the distance between the test sample and the nearest class center and the distance between the next nearest class center may be relatively close, and this kind of sample is considered not easy to distinguish.

In the classification problem, apart from the classifier used, the influence of feature selection on the classification results is also very important. The feature selection algorithm adopted in this chapter is a feature selection algorithm based on category attribute analysis [30], which has achieved good results in traditional text classification. The formal description of feature selection algorithm based on category attribute analysis is as follows:

Let us assume that represents a text, and represents a text set with texts and the category mark , . The following formula represents the intraclass distribution law and interclass distribution law of feature in text class .

Assuming that the word frequency of feature in is and the document frequency is , the distribution within the class is as follows:Here, , is the number of times appears in all categories; and , is the number of all texts where appears.

Class distribution is recorded as follows:

Generally, when the components of the distribution vector between classes differ greatly, the feature has a strong ability to distinguish between classes. When the components of the interclass distribution are similar or identical, the feature’s ability to distinguish between classes will be very low, so the feature’s resolution can be expressed as follows:

In the above formula, .

From the above formula, we can find that the feature selection method based on category attribute analysis pays attention to the distribution of features within and between classes and quantitatively describes this distribution through variance mechanism. The features retained by the algorithm usually have obvious category attributes, so these features have strong representation ability to the corresponding categories, whereas those filtered out features have weak or zero representation ability.

3.3. An Analysis of Chinese Literary Emotion Based on UKNNC

The classification algorithm and the traditional oversampling algorithm are separate. To train the classifier, new samples are first generated and then added to the training set. The test results reflect the quality of samples after the classifier has been trained. However, the advantages of the oversampling algorithm in generating samples are not fully exploited in this framework, and the role of the samples generated in oversampling on the classifier cannot be guaranteed.

This study starts with the goal directly, draws on the idea of confrontation architecture, and proposes the concept of expected classifier, which refers to a new classifier trained by the current classifier under the update of synthesised samples, in order to accurately measure the quality of synthesised samples and judge whether samples can truly improve the performance of classifiers.

Assuming that the classifier used is a single-layer neural network, the influence of the new sample on the performance of the classifier is influenced by the update on the of the classifier, that is, if the calculated by the new sample are in the same direction as the calculation gradient of the original sample, they will reduce the loss function value of the original sample on the new network parameters, and the update of the parameters on the new sample is shown in the following formula:

Let the classification cost function use cross entropy, which is defined in the following formula:

The specific steps of the improved KNN classification are as follows:

Input: training sample set .

Among them, is sample category, and , sample to be tested .

Output: the class to which the sample to be tested belongs(1)Standardization:(2)Calculate the class representation and sample representation of each class.(3)Calculate the similarity of each sample and in the training sample set .(4)Calculate the weight of the class to which the selected samples belong, and assign to the class with the largest weight.(5)Repeat the above steps until all samples to be tested are classified.

UKNNC algorithm uses variational self-encoder as the model for modeling in oversampling. When training the classifier, because incremental classifier is used, the modification direction of new sample to classifier can be calculated. When the expected classification result of classifier is poor, it means that the modification direction of new sample to classifier is wrong, so the parameters of generator are modified. The flow of UKNNC algorithm is shown in Figure 4.

4. Analysis and Discussion

4.1. Improved KNN Experimental Analysis

In the experiment, the traditional KNN method and the improved KNN method are used to experiment when the nearest neighbor parameter takes different values, and the results are shown in Figure 5.

When the traditional KNN method is used to classify the experimental data set, the accuracy is the highest when the parameter , while when the improved KNN method is used, the parameter is the best, so is finally selected uniformly after comprehensive consideration.

When , the accuracy, recall, and comprehensive classification rate obtained by using the improved classification method on the above data set are shown in Figure 6.

It can be seen from Figure 6 that the improved KNN method proposed in this study has higher accuracy than the existing KNN method on data sets 3, 4, 5, 6, 7, and 8. Especially, on data sets 3, 4, and 8, the original 0 has been improved to a certain accuracy, but only on data set 9, the accuracy has not changed, which is mainly due to the fact that there are few such data sets and the degree of similarity with category 8 is too great.

To summarise, the improved KNN method proposed in this study adopts the idea of weighted calculation of nearest neighbor samples, aiming to solve the problem of equal weight treatment of nearest neighbor samples in the existing KNN method. As a result, each nearest neighbor sample has a specific weight, reducing the impact of unbalanced data sets on classification results and improving classification accuracy.

4.2. Analysis of Polarity Classification Results

In the experiment of Chinese language and literature emotion corpus, documents are represented by vector space model, feature selection algorithm based on category attribute analysis is used, term frequency-inverse document frequency (TF-IDF) value is used as feature item weight, and Chinese word segmentation system uses ICTCLAS3.0 of Chinese Academy of Sciences. The value of k in KNN method is 100.

Figure 7 shows the classification comparison results of the reliability analysis method at , three single classifiers and voting method under the 50% cross-validation method. Among them, for the credibility analysis method, 44.83% of the texts need auxiliary classifiers to participate in voting decisions.

It can be seen from Figure 7 that among the three single classifiers, support vector machine (SVM) has the best classification effect, followed by the classification results of class center and KNN. The F value of reliability analysis is 1.7% higher than that of SVM, which is basically the same as that of voting method.

Chi-square significance test shows that the performance improvement of the method proposed in this study is statistically significant () compared with other methods, which shows that the classification accuracy of the classification fusion strategy based on reliability analysis can exceed the results of SVM model under the condition of setting a reasonable threshold.

In addition, in order to verify the classification speed of reliability analysis method, the classification speed was tested on a computer with CPU frequency of 2.6 GHz and memory capacity of 3.8 GB. The average test result is shown in Figure 8.

Referring to the classification speed values of each single classifier obtained in Figure 8, we can see that the classification speed of the class center method is the fastest, and that of the SVM method is the slowest, while the speed of the credibility analysis method is obviously higher than that of SVM, which is between the class center and SVM.

Compared with the voting method, the speed of the credibility analysis method is obviously improved, which shows that the combined classification method based on credibility analysis has an obvious improvement effect compared with SVM and voting method in classification speed under the condition of setting a reasonable threshold.

In the evaluation, two groups of evaluation results of credibility analysis method and single classifier SVM are presented, respectively. The specific evaluation results are shown in Figure 9:

From the evaluation results, it can be seen that the results of reliability analysis method are lower than those of SVM on R-Accuracy and Acc_1000. The reason is that the evaluation does not provide training set, but the test set covers a wide range of fields, which leads to a great difference in the feature distribution between the test set and the training set.

The value obtained through the development set cannot well reflect the sample distribution characteristics of the test set, which may lead to the results of reliability analysis method being lower than SVM on some indicators.

To sum up, when the reasonable threshold is determined, the credibility analysis method can get better results than SVM method in classification speed and accuracy. At the same time, because the combination classification method based on credibility analysis can ensure faster classification speed and better classification accuracy, it is more suitable for emotion classification of large-scale online texts.

4.3. Validity Verification of Unbalanced Classification Algorithm

In this section, in order to verify the rationality of the UKNNC algorithm, the classifier uses a single-layer neural network. The network architecture is consistent with that of the UKNNC algorithm, and it is consistent with the UKNNC algorithm in terms of training times and learning rate.

The VAEOS algorithm is used to generate a few samples that are added to the training set to directly train the LR classifier in order to compare the joint oversampling algorithm with the independent oversampling algorithm. To avoid randomness, the average of 10% cross validation is used in this section as a comparison between the joint oversampling algorithm and the independent oversampling algorithm.

Because it is necessary to make assumptions about the form of data distribution in an oversampling algorithm based on data distribution, this type of algorithm has higher requirements for the form of data distribution, as shown in Figure 10.

Ideally, the oversampling algorithm based on data distribution first reconstructs the probability distribution function of data and then conducts more intensive sampling according to the distribution function. The credibility of sampling results depends on the credibility of modeling. The improvement of classifier performance by the oversampling algorithm based on data distribution depends on whether the current sample meets the hypothesis. Therefore, in UKNNC, when there is no need to set the hypothesis of the true distribution form of samples, the modeling reliability of the algorithm for minority classes will increase, so the generated samples have higher reliability, which shows that the classification performance of minority classes is better.

Figure 11 shows the comparison of the classification performance of CGMOS in the case of Naive Bayes classifier and logistic regression classifier. From the experimental results, it can be seen that due to the different data characteristics, the experimental results are better than other frameworks on the data set that fits the classifier.

However, it synthesizes samples for Bayesian framework classifier, so the classification results on multiple data sets are similar to those of VAEOS algorithm on naive Bayes. All the results of UKNNC algorithm are better than those of VAEOS algorithm under LR classifier, which proves the effectiveness of UKNNC algorithm and joint oversampling algorithm.

Because the data sets in this study are all numerical values, the average classification value of Naive Bayes is slightly lower than LR. In order to obtain a better classification effect, firstly, the classifier should be selected according to the data characteristics; secondly, the oversampling algorithm has a higher probability to improve the classification effect. However, the joint oversampling algorithm based on the characteristics of the classifier has better promotion effect on the classification algorithm in its own framework.

After comparison, all the classification indexes are superior to the emotion classification results of other emotion dictionaries, which not only shows the effectiveness of introducing the expression dictionary to the analysis of Chinese literary emotion but also verifies the application value of the UKNNC method proposed in this study.

5. Conclusion

Chinese development has been shaped by thousands of years of precipitation, resulting in unique aesthetic characteristics that are extremely appealing. We should pay attention to the beauty of temperament when writing poetry. Music is frequently used in the creation of ancient poems. The beauty of images should be considered when describing lyricism. Different images are used to convey specific emotions, and the same images can convey joy and sorrow on multiple levels. The use of reliability analysis to classify emotion polarity is proposed. Experiments show that this voting method based on credibility analysis can outperform SVM in terms of classification accuracy, though its classification speed is slower than SVM’s when the appropriate credibility threshold is selected. Simultaneously, a UKNNC-based emotion analysis model of Chinese language literature is proposed, with the emotions of each paragraph in the text obtained through the two-layer model. The experimental results show that, when compared to the traditional machine learning analysis method, the emotion analysis method proposed in this study significantly improves the accuracy of Chinese literary emotion recognition.

Currently, the expected classifier is calculated using synthetic samples and random gradient descent. There is still a lot of analysis and design work to be done on how to use a more accurate gradient descent method and how to avoid prematurely falling into a local optimum in order to stop training.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.