Abstract

With the increasing application of the perceptron genetic algorithm neural network in Chinese-English two-way translation, there are many translation problems to be solved. In order to solve the translation problem of Chinese-English parallel corpus, the multilayer perceptron method, genetic word alignment model (GA), language model, and neural network method (including the translation model and bilingual pretraining model) are designed, which are combined into the ga-mlp-nn combination model to measure the parallelism of Chinese and English sentences from different emphases. The results show that the ga-mlp-nn model has good performance in filtering high-quality parallel corpus. The final experimental results show that compared with a single system, the improved multisystem fusion method based on weight multiplication has achieved better results in the test set. In the last five groups of evaluation results, the system submitted in this paper ranks second and first in multiple datasets, which has a certain reference value for the research of corpus filtering.

1. Introduction

With the development and in-depth application of information technologies such as mobile Internet and mobile communication, the number of online users in major social media platforms and e-commerce platforms has increased sharply. As one of the important ways of information interaction between enterprises, platforms, and consumers, online comments came into being with a huge number. An online comment is the user’s comment on a specific topic through the network platform. It is the user’s feeling and evaluation based on their own use or experience of products/services [1]. Because the online comments are rich in consumers’ feelings and expectations after the product/service experience, involving preferences, needs, and evaluation in terms of quality, design, price, and service, the commercial value is becoming increasingly prominent, which has attracted great attention from enterprises, platforms, and consumers. However, due to the cross-platform interaction and information diffusion effect, the comment information of some topics, especially hot issues, is growing explosively. In the face of a large amount of comment information, it is often difficult to effectively distinguish the valuable comment information and correctly judge the true situation of products/services. In addition, due to the anonymity of users, the difference of motivation to evaluate products/services, and noncontact in the network platform, the quality of comment information is uneven and the authenticity of comments is uneven. In view of the overload and uneven authenticity of online comments, the cost of information screening is greatly increased and the efficiency is greatly reduced. It is not operable to manually screen the useful information of online comments [2]. Therefore, how to scientifically and effectively mine the useful information of online comments has become an important research topic, which is very important for the decision-making efficiency and effect of enterprises, platforms, and consumers.

Most of the existing studies on comment usefulness use the number of comments, text depth, and text emotion as the main factors affecting comment usefulness and rarely consider the impact of the professionalism of the comment content on comment usefulness. The innovative contributions of this paper include the following: (1) The multilayer perceptron method and statistical method are used to screen and filter the expected value. (2) The GA MLP neural network is used as classifier for comment usefulness recognition. This paper creatively analyzes the influence of the proposed features on the usefulness of comments and obtains the best feature combination. And (3) compared with a single system, the improved multisystem fusion method based on weight multiplication has achieved better results in the test set. It provides a reference value for the translation of Chinese-English parallel corpora.

Based on this, this paper uses the knowledge adoption model and multilayer perceptron neural network to identify the information usefulness of online comments. This paper is divided into five parts. The first part is the research background, and the second part is the literature review to analyze the research results of the problem. The third part is the introduction of the multilayer backbone straight up genetic algorithm neural network and related systems. The fourth part is the specific experimental analysis. It shows how the multilayer perceptron genetic algorithm neural network deals with the translation of Chinese and English parallel corpora and compares it with the traditional methods. The fifth part is the round of the article.

Theoretically, the usefulness of comment information depends on whether one thinks the that comment information is valuable and helpful to varying degrees [3]. In the existing studies, Yuanyuan et al. [4], Mudambi and Schuff [5], and Baek et al. [6] studied the factors affecting the usefulness of online comments from the text characteristics and found that the factors such as comment emotion, comment depth, comment grade, and reviewer reputation have a significant impact on the usefulness of comments. Chen and Xie constructed a normative model to judge the usefulness of comments and help consumers determine the best matching products [7]. Liu Wei and Pengtao and Zhang et al. constructed a theoretical model of two-way analysis based on the IAM (information acceptance model) to effectively identify the key factors affecting the usefulness of online comments on e-commerce platforms [8, 9]. Rui and Jian studied the impact of the inconsistency between the star rating of online comments and the average star rating of products on the usefulness of comments based on the attribution theory [10]. Liu et al. analyzed the impact of the number of comments, reviewer professionalism, reviewer reputation, and other factors on the usefulness of community online comments. In terms of the selection of influencing factors of comment usefulness, most of the existing studies consider the internal factors such as the star rating, number of comments, depth of comments, comment emotion, and external factors such as the identity, professionalism, and reputation of the comment publisher but few studies analyze the influencing factors from the professionalism of the comment content [1118]. Because the comment information of each field contains its specific domain words, there are serious domain barriers. At present, there is no general method to effectively extract the factors related to the comment quality from the comment text. Existing studies show that compared with the feature extraction method without considering domain words, the feature extraction method based on the domain dictionary can enhance the expression ability of text features and improve the classification performance [19]. Therefore, this paper proposes a feature extraction method based on the domain dictionary to construct a measurement index of the comment text quality, so as to improve the recognition effect of comment usefulness.

In the existing research, there are a large number of theories that can be used to explain the usefulness of online comments, one of which is the knowledge adoption model (GA) [20]. The model holds that the information receiver’s perception of information usefulness is the direct decision of knowledge adoption, and for a given information, the determinants of perceived information usefulness include the quality of the information itself and the credibility of the information source. Many scholars have studied the usefulness of online comments based on the GA model, For example, Cheung et al. and Huang et al. used the GA theory to study the usefulness of online comments [21, 22]. Erkan and Evans analyzed the usefulness of online comments and its effect on purchase intention on the basis of the GA theory [23]. It is not difficult to see that the GA theory has achieved good results in explaining the usefulness of online comments. In order to more comprehensively analyze the usefulness of comment information, this paper uses the GA model to construct the characteristics of the comment usefulness classification model from two aspects: the quality of the comment text and the credibility of the comment source.

At present, many researchers construct the judgment comment usefulness as a text binary classification problem. The solution to this kind of problem can adopt the text classification method based on machine learning, mainly including the support vector machine (SVM), decision tree, neural network, and Bayesian [24, 25].

3. Construction of the Multilayer Perceptron Genetic Algorithm Neural Network System (GA-MLP-NN)

Although a single perceptron cannot solve the XOR problem, it can realize the segmentation of complex space by combining multiple perceptrons. The two layers of perceptron are combined according to a certain structure and coefficients. The first layer of perceptron realizes two linear classifiers to divide the feature space, and an XOR operation can be realized by adding a layer of perceptron on the output of the two perceptrons. That is, it is composed of multiple perceptrons. Multilayer neural networks are also trained by the gradient descent algorithm, which is an algorithm dedicated to finding the extreme point of loss function to minimize the value of loss function. The so-called “learning” is to improve the model parameters in order to minimize the loss through a large number of training steps.

The methods used in the system submitted in this paper can be divided into three categories: multilayer perceptron method, statistical method, and neural network method. The overall architecture of the system is shown in Figure 1. The multilayer perceptron method mainly designs a series of rules to filter corpus whose quality obviously does not meet the requirements. The statistical methods include the Zipporah system, genetic word alignment model, and language model. The purpose of filtering is achieved by statistical feature information on a large number of clean corpus. The neural network method includes a translation model and bilingual pretraining model. The model with strong generalization ability is trained on the clean corpus, and then, the translation corpus is filtered. Finally, according to the results of different methods, the excellent methods are weighted and fused to obtain the final clean corpus.

3.1. Multilayer Perceptron

The equivalent filtering methods of the sentence length ratio, maximum sentence length, and unique sentence are used to filter the corpus. According to his work, this paper formulates four rules: (1)According to the length filtering rule, the sentence pairs with the length of more than 80 words at the source or the target end will be scored 0 points; otherwise, they will be scored 1 point(2)According to the length ratio restriction rule, if the length ratio between the source and target sentences exceeds 1.7, 0 point will be recorded; otherwise, 1 point will be recorded(3)Language identification rules: langid (https://github.com/saffsd/langid.py) is used to identify the source and target languages. If the language is incorrect, 0 point will be recorded; otherwise, 1 point will be recorded(4)For the deduplication rule, record 1 point for the first occurrence of repeated sentences; otherwise, record 0 point

With the help of the abovementioned four rules, a four-dimensional feature can be obtained for a given sentence pair and the value of each dimension is 0 or 1.

3.2. Statistical Method
3.2.1. Zipporah System

As a part of the fusion system and achieved good results, the Zipporah system is a fast and scalable system, which can select “good data” of any size from a large number of translation data pools. For the training of neural machine translation model, the principle is as follows: firstly, the sentence is mapped to the feature space, which contains two features: adequacy score and fluency score. Then, logistic regression is used for binary classification and the categories are “good data” and “bad data,” respectively. Finally, equation (1) is used for normalization to obtain the score of parallelism where is the score of the Zipporah system.

3.2.2. Genetic Word Alignment Model (GA)

There are few word alignments for nonparallel sentence pairs, so this paper considers using word alignment for corpus filtering. First, use the fast align (https://github.com/clab/fast_align) word alignment tool to train on the translationless Chinese-English parallel corpus provided by the 16th National Machine Translation Conference (CCMT 2020), and then, predict the translation corpus. The word alignment score of sentence pairs can be obtained directly. In the fast align tool, the word alignment score is calculated by logarithmic summation of word alignment probability, so the longer the sentence, the smaller the word alignment score, which means that the system prefers short sentences. In order to reduce the impact of sentence length on the word alignment score, equation (2) is used to calculate the parallelism score in this paper: where is the word alignment score of sentence pairs and and are the length of the source and target sentences, respectively.

After the word alignment scores of sentence pairs are processed according to equation (2), they are sorted according to the scores from high to low. After statistics, it is found that the number of sentence pairs with word alignment scores greater than or equal to −4.5 is about 4 million, about 100 million words. In this paper, it is determined that the quality of these sentence pairs is good and their scores after normalization should be high, so equation (3) is designed to normalize the scores:

3.2.3. Language Model

Because the language model can filter out nongrammatical data, this paper considers using the language model to filter the corpus. This paper selects the corpus without translation to generate the language model and uses the language model to calculate the perplexity () score of the dataset to be filtered.

Specifically, the srilm (https://github.com/BitSpeech/SRILM) tool is used to train a 5-gram language model for Chinese and English materials on the bilingual corpus without translation and this language model is used to calculate the confusion score of Chinese and English sentences in the bilingual corpus to be filtered. This paper uses two scoring strategies: the sentence level confusion score and word level confusion score.

In order to standardize the two piecewise functions to distinguish the differences between them, this paper standardizes the confusion score. In the standardized operation, a series of piecewise functions are designed according to experience. For the sentence level confusion score of Chinese corpus to be filtered, the normalized piecewise function designed in this paper is shown in formula (4) as follows:

For the sentence level confusion score of the English corpus to be filtered, the designed normalized piecewise function is shown in equation (5) as follows:

In addition, this paper considers the word level confusion score, calculates the word average confusion score of each sentence on the Chinese-English dataset and the word average confusion score on the overall dataset, and designs two piecewise functions to normalize the difference between them. Therefore, when calculating the average word confusion score of the overall dataset, this paper ignores sentences with a confusion degree of more than 10000.

For the word level confusion score of Chinese corpus to be filtered, the designed normalized piecewise function is shown in equation (6) as follows:

Finally, each parallel sentence pair will get 4 feature scores.

3.3. GA-MLP-NN Neural Network

The neural network is a highly parallel information processing system, which has very strong adaptive learning ability, does not depend on the mathematical model of the research object, and has good robustness to the changes of system parameters and external disturbances of the controlled object. GA-MLP-NN is a deep neural network, which overcomes the weakness that the perceptron cannot recognize linear inseparable data and can get better expression effect. The model is shown in Figure 2.

3.3.1. Translation Model

Based on the following assumption, if sentences and are parallel sentence pairs, then, the semantics of and are similar; when a is translated into , the semantics of and are still similar. In order to realize the abovementioned assumption, we should first train an English-Chinese translation model, then use the translation model to translate English sentences into corresponding translations, and finally calculate the similarity between the translation and the reference translation. For similarity calculation, this paper uses two indicators: word-based editing distance and cosine similarity based on a pretrained word vector, and finally forms a two-dimensional similarity feature.

3.3.2. Model Design

According to the abovementioned introduction, if you want to calculate the similarity between the translated translation and the reference translation, you should first get the translated translation, so you need a translation model. This paper adopts THUMT (https://github.com/THUNLP-MT/THUMT.git), an open-source neural machine translation tool of Tsinghua University. The system has less dependence and simple training and is suitable for rapid training of the neural machine translation system.

The training set data comes from the parallel corpus provided by the CCMT 2020 Chinese-English translation task, which is segmented and lowercased, and the sentence pairs with a length of more than 150 words are filtered to form about 10 million pairs of training data. The development set is the development set specified by the CCMT 2020 Chinese-English parallel corpus filtering task.

The main training parameters are selected by default and run for about 20 rounds. The five models with the highest bilingual translation evaluation results (BLEU) in the development set are saved; then, the models are averaged and fused into a final model, with the direction of English → Chinese, which is recorded as . Then, is used to decode the Chinese and English sentences of parallel sentences with translation to obtain the corresponding Chinese translation.

3.3.3. Word-Based Editing Distance

This index is essentially the editing distance, but the granularity for calculating the matching degree of two sentences is words, not a single character. Let and be Chinese sentences after two word segmentation, where is the translation of the English source sentence ; then, the editing distance can be calculated iteratively by equation (7) as follows: where and are the number of words in and sentences, respectively, is the distance between words before sentence and words before sentence, and and are the word of and the word of , respectively.

In the calculation process, is regarded as the translation and as the reference. In translation data, as the target does not necessarily correspond to the source. When does not correspond to , the distance between and is large. It is considered that this sentence pair is less parallel to and , so the sentence pair can be filtered. On the contrary, a small distance means that the similarity between the translation model and the actual reference is high and the parallelism between the source sentence and the target sentence is high. According to the editing distance, the parallelism score of the final sentence pair is shown in equation (8) as follows:

3.3.4. Cosine Similarity

Since the translation model can translate the English source sentence into the corresponding Chinese translation , the semantic similarity between and can be calculated only with the help of Chinese word vectors. The reason why two separate sets of Chinese and English word vectors are not used in this paper is that the language differences will cause the deviation of semantic space. This leads to inaccurate semantic similarity calculation. The data used for training the Chinese word vector is the same as the Chinese data in the machine translation training set. The training tool adopts the gensim (https://radimrehurek.com/gensim/models/word2vec.html) toolkit. The training window takes 5, and the words with word frequency lower than 5 are removed. Considering the high pressure of similarity calculation, the dimension is taken as 128 dimensions and trained for 10 rounds and the model is finally saved as .

For and sentence pairs, is the Chinese translation of ; then, the parallelism score of the sentence pair can be obtained by using and cosine function, as shown in equation (9) as follows:

3.3.5. Bilingual Pretraining Model

Considering that the pretraining model contains a lot of semantic knowledge, this paper uses the sentence-BERT (sense bidirectional encoder representations from transformation) model to fine tune the Chinese-English monolingual corpus given by CCMT2020; Chinese and English sentence vectors are obtained. However, the sentence vectors obtained in this way may have the problem of misalignment in the vector space between different languages, that is, sentences with the same meaning in different languages are mapped to different positions in the vector space. Therefore, when evaluating the parallelism between sentences in two different languages, this paper uses the ratio of the square of the Mahalanobis distance as the measurement index.

The Mahalanobis distance represents the covariance distance of data, which is an effective method to calculate the similarity of two unknown sample sets. Using the Mahalanobis distance is equivalent to the method of data conversion to eliminate the correlation and dimensional difference between different feature dimensions in the sample, so that the Euclidean distance can effectively measure the distance from the sample points to the distribution in the new distribution. Suppose that vector represents . If the covariance matrix is a multivariable random vector of , the Mahalanobis distance to the center is calculated as shown in (10) as follows:

Here, means finding the power of each element in .

In this system, firstly, each sentence vector is standardized so that it follows a random distribution with a mean value of 0. For each recentered Chinese-English sentence vector pair , three cases in the change space are considered, such as formula (11) as follows: where , , and represent the vectors of splicing vectors , , and , respectively, in the Markov space. Through the abovementioned three cases, the following ratio of the squares of Markov distances can be used to measure the parallelism between two language sentences, as shown in formula (12) as follows:

If two sentences have the same meaning, the possibility of the sentence to vector in the Markov space should not be less than the probability of the vector of isolated single sentences and in the Markov space. The greater the value of , the higher the parallelism between the two sentences.

Finally, normalize the value and use equation (13) to measure the parallelism between two sentences:

That is, the smaller the , the higher the parallelism between the two sentences.

4. Results and Discussion

4.1. Data Processing

The development set, training set, and test set of the corpus filtering system in this paper are the Chinese-English news test set from WMT 2018 and WMT 2019 (including 3981 sentences and 2000 original texts and corresponding reference translations, respectively), the Chinese-English parallel corpus without translation in CCMT 2020 (9.02 million Chinese-English sentence pairs), and the parallel corpus with translation in CCMT 2020 (34.32 million Chinese-English sentence pairs).

Among them, the Jieba (https://github.com/fxsjy/jieba) word segmentation tool is used for Chinese corpus and Moses (https://statmt.org/moses/) script word segmentation and lowercase processing are used for English corpus. Because the amount of data is too large to prevent video memory overflow during decoding, the translation data after lowercase is truncated. At the same time, in order to alleviate the problem of out of vocabulary (OOV) and improve the processing ability of the model for rare words and OOV, this paper uses the method based on subword segmentation to segment Chinese corpus and English corpus with byte pair coding (BPE, http://github.com/resennrich/subword-nmt). In addition, in order to prevent the memory shortage and long decoding time caused by loading and decoding 34 million sentences at one time, this paper divides the translation data, each containing 2 million pieces of data. Finally, remove the sentences with a length of more than 150 words, and then, remove the sentences with language errors.

4.2. Evaluation Method

After the translation corpus is scored, it is sorted according to the score from high to low, so as to realize corpus filtering. This paper selects parallel sentence pairs containing about 100 million words, uses Marian, a neural machine translation tool designated by CCMT 2020 organizers, takes the previously selected parallel sentence pairs as a training set, trains them on Marian, and then tests them on the test set designated by CCMT 2020 organizers. The Bleu index commonly used in the field of machine translation is used as the evaluation index to evaluate the quality of the filtered corpus.

The final contestant shall provide the CCMT 2020 organizer with two filtered corpora of 100 million words and 500 million words. The CCMT 2020 organizer will take the corpora submitted by the contestants as the training set, use the Marian tool for training, ensure that all parameters are consistent during the training process, and conduct the test on the specified test set as the final result of the contestant.

4.3. GA-MLP-NN Single-System Experiment

Because of the specific relationship between each language system and Zipporah system, the translation system can be selected as the basis of each language system. The scores of translation data are sorted from high to low according to each system. It should be noted that if some systems have multiple scores, each score is added or multiplied to obtain the comprehensive score and the weight is 1.0. Then, use Marian, the machine translation tool provided by CCMT 2020 to train the neural machine translation system. Calculate the BLEU value between the translation results on the development set and the reference translation. Select the dominant features according to the corresponding Bleu value of each system, and try to combine the dominant features to get a better ranking.

Due to the limitation of computing resources, this paper only trains 10 rounds for each system and takes the highest Bleu value in the development set as the final score of the system. Refer to Table 1 for the score of each system. Among them, the random system randomly disrupts the data and similarly samples the parallel corpus of 100 million words, the random system 0 randomly disrupts the data only once, and the random system 1 randomly disrupts the data five times. In addition, in order to explore the influence of the domain on performance, this paper collects 1409 Chinese news samples and 1434 Chinese non-news samples from nontranslation parallel corpus, divides 200 news and 200 non-news samples into development sets, and trains a domain two classifier based on the convolutional neural network (CNN). As can be seen in Table 1, the results of random system 1 even exceed most systems. The best one is the similarity index between translation and reference based on the translation model. The domain classifier is the worst because the domain classifier is mainly used to select news corpus, and the results show that the proportion of news corpus in the test set may not be high, resulting in poor performance. It is noted that the top sentences in the corpus filtered by the translation model are not very sensitive to the sentence length, so a large number of sentences with moderate length are expected to rank first. Although the rule system can treat long sentences and short sentences indiscriminately, it cannot measure the degree of parallelism, so the effect is not prominent when it plays a role alone.

The domain classifier is used for the test of translation data, and the prediction probability of news data is taken as the score. For the performance of the domain 2 classifier, refer to Table 2. It can be seen that the performance of this classifier is high, but in Table 1, the translation performance based on this classifier is very low, so it can be considered that in this task, the domain has little influence on the translation model. Therefore, the classifier is only used for verification, and it is not included in the final system in this paper.

4.4. Parameter Setting of the GA-MLP-NN Model

In the parameter setting of the GA MLP NN model, the process of superparameter optimization is as follows. (1)Select a set of superparameters (automatic selection)(2)Build the corresponding model(3)Fit the model on the training data and measure its final performance on the validation data(4)Select the next set of superparameters to try (auto select)(5)Repeat the abovementioned process(6)Finally, the performance of the model on the test data is measured

The key of this process is that given many sets of superparameters, the history of verification performance is used to select the next set of superparameters to be evaluated. It is necessary to systematically and automatically explore the possible decision space. According to experience, the recognition effect of the GA MLP NN classifier with the best performance is found.

The superparameter setting has an important impact on the recognition effect of the GA-MLP-NN classifier. Therefore, this paper has carried out many experiments on the super parameter setting of each group. Taking the number of iterations as an example, considering that the smaller the loss value of the model, the better the robustness of the model, the number of iterations of the model can be set through the loss value of the model. Specifically, the maximum number of iterations is set to 1, 50100, …, 500 and the maximum number of iterations for loss value convergence is selected according to its corresponding loss value. As shown in Figure 3, the loss value does not decrease after the maximum number of iterations is greater than or equal to 300 and is stable at about 0.3. Therefore, the maximum number of iterations of the GA-MLP-NN model is finally set to 300, as shown in Figure 4.

4.5. GA-MLP-NN System Experiment after Fusion

In this paper, it is considered that the translation model system, genetic word alignment model system, language model system, and bilingual pre training model system are relatively potential systems. Therefore, the combination of these systems is given priority to the fusion test, that is, to form the GA-MLP-NN combination model. The method of multisystem fusion is relatively simple, that is, the scoring of each system is fused. Then, reorder. There are two fusion methods: multiply by weight and add by weight. In most cases, only the fusion with the weight of 1.0 is tried. Table 3 shows some experimental results. It can be seen that the overall performance of the fusion system exceeds that of the single system. The main reason for the better performance of the fusion system is that different systems measure the parallelism of sentence pairs from different starting points, so the sentence pairs can be evaluated more comprehensively after multisystem fusion, which also shows the effectiveness of method fusion.

4.6. Submission System

It is not found that the more integrated the systems, the better performance. After a large number of tests, it is found that the robustness and BLEU value of “1,3,4” combination are high. Considering the complexity of the system, this paper selects “1,3,4” combination as the main system, because the multilayer perceptron method has been proved to be an effective means to improve the translation performance in WMT 2018 and WMT 2019 corpus filtering tasks. In addition, the pre training model has advantages in semantic extraction, so the combination of “1,2,3,4,6” is selected as the subsystem. The final evaluation results are shown in Table 4. It can be seen that the main system system2 submitted in this paper ranks second and first except the iwslt2020 dataset. Because the IWSLT2020 dataset is spoken language corpus, there are some differences between news corpus and spoken language corpus. This leads to the poor performance of the system on IWSLT2020 dataset, which also indicates that the training field will affect the filtering results.

4.7. Comparison between GA-MLP-NN and Classical Methods

In this paper, the recognition results of the GA MLP NN model are compared with those of classical methods such as naive Bayes, SVM, and support vector machine. The value, recall, precision, and accuracy of the experimental results are analyzed and compared. The better the parameter value is, the higher the translation effect is. The results show that GA MLP NN model has good filtering performance for high-quality parallel corpus.

In order to further verify the recognition effect of the proposed method, the recognition results of the proposed method are compared with those of classical methods such as naive Bayes, SVM, and support vector machine. That is, the optimal feature combination (, , , and ) and all features (, , , , , and ) obtained based on the GA theory are used as feature representation, and the feature representation obtained by the word bag method is used to train the SVM classifier, naive Bayesian classic classifier model, and GA-MLP-NN neural network classifier. The value, recall, precision, and accuracy of the experimental results are shown in Figures 56.

As can be seen in Figure 6, the value, precision, and accuracy of the three classifiers trained by the feature representation based on the word bag method are lower than the recognition results of the classifiers trained by the other two feature representation methods based on the GA theory. The values of the three classifiers based on the optimal feature combination are better than those based on all feature combination and word bag method. Among them, the value of the GA-MLP-NN classifier based on optimal feature combination is 3% higher than that based on all feature combinations and 33% higher than that based on the word bag method. The value of the SVM classifier based on optimal feature combination is 2% higher than that based on all feature combinations and 11% higher than that based on the word bag method. The value of naive Bayes based on the optimal feature combination is 1% higher than that based on all feature combinations, and 7% higher than that based on the word bag method. It can be seen that the selection of the optimal feature combination significantly improves the classification effect of the classifier and further verifies the feasibility and superiority of the method proposed in this paper.

As can be seen in Figure 7, GA-MLP-NN is much better than other classifiers in value, recall, and accuracy. For the optimal feature combination of , , , and , the value of the SVM classifier is 73.3% and the value of the GA-MLP-NN classifier is 13.5% higher than that of the SVM classifier, reaching 86.8%. For the optimal feature combination of , , , and , the value of the naive Bayesian classifier is 70% and the value of the GA-MLP-NN classifier is 16.8% higher than that of the naive Bayesian classifier, which fully reflects the superiority of the GA-MLP-NN classifier trained by feature combination proposed in this paper.

5. Conclusion

In this paper, the ga-mlp neural network is used as a classifier for comment usefulness recognition. This paper creatively analyzes the impact of the proposed features on the usefulness of comments, measures the parallelism of Chinese and English sentences from different emphases, and obtains the best feature combination. Finally, the experimental results show that the value of naive Bayes based on the optimal feature combination is 1% higher than that based on all feature combinations and 7% higher than that based on the word bag method. It can be seen that the selection of the optimal feature combination significantly improves the classification effect of the classifier and further verifies the feasibility and superiority of the method proposed in this paper. Compared with a single system, the improved multisystem fusion method based on weight multiplication has achieved better results in the test set. The application of the perceptron genetic algorithm neural network in Chinese-English two-way translation is solved. The system presented in this paper ranks second and first in multiple datasets, which has a certain reference value for corpus filtering research. However, this paper does not carry out the actual simulation verification of the adopted model and there are still some limitations. Further research is needed in the future.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This study is supported by the Hainan Provincial Social Science Project of Foreign Language Base: a study on the construction of Hainan’s local translation talent team under the background of new liberal arts (HNSK(JD)21-41).