Complexity

Volume 2019, Article ID 1712569, 10 pages

https://doi.org/10.1155/2019/1712569

## Analysis of College Students’ Public Opinion Based on Machine Learning and Evolutionary Algorithm

Correspondence should be addressed to Jinqing Zhang; moc.361@1930tmgnahz

Received 21 April 2019; Accepted 9 September 2019; Published 11 November 2019

Guest Editor: Gonzalo Farias

Copyright © 2019 Jinqing Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The recent information explosion may have many negative impacts on college students, such as distraction from learning and addiction to meaningless and fake news. To avoid these phenomena, it is necessary to verify the students’ state of mind and give them appropriate guidance. However, many peculiarities, including subject focused, multiaspect, and low consistency on different samples’ interests, bring great challenges while leveraging the mainstream opinion mining method. To solve this problem, this paper proposes a new way by using a questionnaire which covers most aspects of a student’s life to collect comprehensive information and feed the information into a neural network. With reliable prediction on students’ state of mind and awareness of feature importance, colleges can give students guidance associated with their own experience and make macroscopic policies more effective. A pipeline is proposed to relieve overfitting during the collected information training. First, the singular value decomposition is used in pretreatment of data set which includes outlier detection and dimension reduction. Then, the genetic algorithm is introduced in the training process to find the proper initial parameters of network, and in this way, it can prevent the network from falling into the local minimum. A method of calculating the importance of students’ features is also proposed. The experiment result shows that the new pipeline works well, and the predictor has high accuracy on predicting fresh samples. The design procedure and the prediction design will provide suggestions to deal with students’ state of mind and the college’s public opinion.

#### 1. Introduction

Youth is the most important period for college students to establish a mature outlook on life and values. In college, students’ perception on life and various things includes public opinion, which can also influence the ideology of students in turn. The advent of Internet has increased the diversification of mass media, which makes it possible for people to obtain information that they are interested in at anytime and anywhere. However, the quality and reliability of information show increasing difference. Some untrue and negative information might pollute public opinion in college and cause harmful influence on students’ state of mind. For personality, research studies have shown that students who are addicted to Internet and wireless mobile devices such as smartphones relate to increase in stress and anxiety while decrease in academic performance and satisfaction with life [1, 2]. These impacts could make students take a pessimistic view and feel their lives meaningless which show strong relationship with depressive disorder and even suicide. For society, the spread of rumors could make students more suspicious and treat social media and government as liars instead [3]. When students enter society after graduation, their distrust on government will leave room to disharmony. A student’s state of mind is the cell of public opinion in college, and there have been strong evidences showing that students in positive environments are more likely to make great achievements [4]. To protect students from the negative impact of information explosion, colleges should focus on giving guidance to students with problems in mind, take responsibility for helping them correct their outlook on life and values, and make them be willing to fight for the development of the whole human race.

However, students are usually not willing to seek guidance on state of mind because many of them do not want to be regarded as “sick.” This requires the colleges to actively implement guidance on students. But if students tend to hide their problem, there will be problems for colleges to know who needs to be guided when facing thousands of students. One of the methods is using machine learning (ML) tools such as the neural network (NN) to predict students’ state of mind. ML tools can automatically learn the function from students’ features to their state of mind and make prediction quickly and accurately as long as there are enough training data. With precise prediction on students’ state of mind, colleges can adjust the guidance according to the students’ own features to enhance its effectiveness.

ML has been widely used to predict people’s opinion on things by doing text analysis on data collected from Internet, but it might not be so much useful when predicting students’ state of mind. That is because prediction of student’s state of mind has several peculiarities: (1) focus on subject: this work is focused on the people who make judgements, but not the judgements they have made; (2) multiaspect: to enhance the correctness of the analysis, the predictor should learn plenty of information from different aspects, but students might not publish some of this information forwardly on the Internet; (3) low consistency on aspects: different students would like to pay attention to different matters, thus it is opinionated to make an answer on a certain question as a public criterion. To meet these peculiarities, more abundant data should be collected for a single sample which covers most aspects of opinions related to a student’s daily life, and the data of different samples should have good consistency on their content. If only text-based data from the internet are collected, the data set will be not effective enough. On the contrary, the traditional method of using the questionnaire to get the data can better meet the requirements. The questionnaire used is well designed to cover most of the aspects about college students, and the questions with scale can help quantify students’ sentiment on different issues. The way of using a questionnaire can also force students to answer the same question so that the data between different samples can have high consistency on aspects of content.

The ML tool used as predictor is the NN. For a predictor, one of the most important criteria is generalization performance, which means the prediction accuracy on fresh samples. However, the high dimension of samples will make itself too sparse to fill the sample space. In the training process of NN, the lack of samples can cause overfitting [5]. An overfitting NN fits the training set well but has poor prediction accuracy on fresh samples. As a result, a new way is needed to solve this problem. This paper will introduce a way that uses singular value decomposition (SVD) to reduce the dimension directly and add a closed loop based on genetic algorithm (GA) on the training process to relieve overfitting. After obtaining a NN with good generalization performance, a method of calculating importance of each features is also proposed, which can help colleges combine macroscopic policies and microscopic guidance and strengthen the overall effectiveness.

Section 2 reviews the related work. Section 3 introduces the process of using SVD to pretreat the data set. Section 4 introduces the method of getting a predictor with good generalization performance, also the way of calculating features importance. Section 5 describes the details of the experiment and shows the results. Section 6 concludes our study and introduces future work.

#### 2. Related Works

Early research studies on mining humans’ opinions have been done. Pang et al. [6] collected the review data from IMDb and used different tools of machine learning such as naive Bayes classification, maximum entropy classification, and support vector machines to classify audiences’ sentiment towards movies. Khan et al. [7] analyzed abundant text on Twitter that related to specific products and services and summarized the user’s overall views of those objects to help the producers and servers improve their works. Zhan et al. [8] designed an algorithm that not only mined opinion from customs reviews but also automatically pointed out the salient topics from these opinions, which can make the analysis more targeted. Zhou et al. [9] did the research to transfer customs’ reviews into answers of a questionnaire generated by the algorithm automatically and analyzed the collected data to point out what were the main points to improve user’s experience. Not only there are research studies focusing on objects, but also several others that try to focus on people. For example, Kosinski et al. [10] used “Facebook Likes” to predict a range of highly sensitive personal attributes and get high accuracy on some classification problems. Baik et al. [11] used buying behaviors to predict people’s score on four different personality traits and showed better precision when compared with previous studies. Besides the abovementioned research studies in different applications, some researchers also summarize the work in the whole field of public opinion mining. Pang and Lee [12] focused on improving the methods to address the new challenges raised by opinion mining. Tsytsarau and Palpanas [13] tried to give a definition on opinion mining to clarify what is the basic work that should be done to mine public opinion. Ravi and Ravi [14] divided research studies into different levels and summarized the characteristics of each levels. These summaries provide researchers powerful tools to do opinion mining and give criteria to assess their work.

The method of using a questionnaire to collect data has been widely used in many situations when it is necessary to establish a person’s comprehensive personality profile. Topp et al. [15] reviewed 213 relevant articles to check the utility of a questionnaire named the WHO-5 Well-Being Index and confirmed its validity both in depression screening and outcome measuring in clinical trials. Garfinkel et al. [16] used a questionnaire to measure interoceptive sensibility, which is an important dimension of one’s interception. It could help explain cognitive, emotional, and clinical associations of interoceptive ability. Duckworth and Yeager [17] considered a self-report questionnaire is more efficient in studies of assessing internal psychological states like feelings of belonging when compared with other measures.

From previous research studies, it is clear that the method of using a questionnaire is good at collecting comprehensive data from a single person, and the data between different persons have high consistency on aspects. The collected data can be a good training material for human-focused opinion mining to learn the inner connection between students’ behaviors and their state of mind. In this paper, the combination of the two methods overcomes the peculiarities and can make precise prediction on students’ state of mind.

#### 3. Data Collection and Pretreatment

This section will introduce what is the source of the data about college students’ state of mind and describe the pretreatment method on data, including outlier detection and dimension reduction. Both of them are based on SVD.

##### 3.1. Data Source

The data used in experiment come from a survey on students’ state of mind that was conducted by Northwestern Polytechnical University in September, 2017. The students who had been surveyed were from different grades (including some masters and doctoral students). Under screening and checking, the total number of efficient sample data is 953.

The questionnaire consists of 30 questions, which are well designed to cover most aspects of students’ daily life and their opinions. In terms of content, these questions can be divided as follows: (1) basic information: gender, grade, subject, and so on; (2) individual development: information of personal development since university entrance and future plan after graduation; (3) focus of attention: the focus of event happened recently; (4) mind identity: agreement on some policies and opinions; (5) school work evaluation: satisfaction with school work and direction of improvement. In terms of form, these questions can be divided into a single-choice question, multiple-choice question, scale question, and essay question.

Questions in different types need different primary pretreatments to get the original data set. Options in single-choice questions and multiple-choice questions are extended to independent variables, and the variable values were decided according to whether the options are selected or not; the answers of scale questions can be directly added into the data set; most of the questionnaires were left blank on essay questions so that they are ignored. After primary pretreatment, the sample vector dimension is extended to 160 dimensions. One of the variables is selected as sample label, and the rest are features of students. The sample label is given according to the students’ evaluation on their own state of mind: the label 1 is positive, which means they do not need to be guided; the label 0 means the students are not mature and need to be guided.

##### 3.2. Meaning of SVD

SVD can be considered as the generalization of eigen decomposition from square matrix to matrix in any size [18]. In this case, the original data set is , which means there is *m* samples in the data set and each sample has *n* features. After the SVD process, there will be orthogonal matrixes and that present *S* as follows:

In (1), has the structure of as , where is a diagonal matrix and 0 is the zero matrix. is the singular values of *S* sorted in the descending order. If 0 is removed, the related vectors in *U* can be deleted so that and .

An *n* dimension coordinate system can be established in the space of student samples whose axes relate to sample features, and every student samples can be represented by a point. The coordinate of sample is , which is the row vectors of *S*. Then, the process of SVD can be considered as a coordinate transformation within the sample space, and each column vector of *V* represents a base vector of the new coordinate system. The new base vectors can be given abstract meanings according to their relationship with original features. All the new base vectors are perpendicular to each other because *V* is an orthogonal matrix. Let , so that

From (2), it can be found that each row vector in represents the coordinate of a sample in the new coordinate system. Meanwhile, the singular values that relate to different base vectors represent the dispersion of samples on these directions. If the singular value is large, the samples’ projections on its related base vector are widely distributed, which means there is abundant information stored.

##### 3.3. Application of SVD in Outlier Detection

As larger singular value related to base vector which has a scattered distribution, it can be known that the bias on the base vector with small singular value will contribute more to a sample’s deviation. As a result, the bias on base vector with small singular value should be given a high weight when calculating the total deviation of a sample. Before calculating sample’s deviation, the singulars need to be sorted in the descending order as . The calculation formula of weight is as follows:

The bias of student sample *i* on new base vector can be represented by *Z*-score. The calculation formula of *Z*-score is as follows:where is the element of and represents the mean of all elements in column vector . The total deviation of the sample is calculated by the following equation:

After calculating deviations of all samples, a self-adapting threshold will be set. If a sample’s deviation goes beyond the threshold, it will be deleted as outliers to make the data set more credible. A training set with high reliability will improve the generalization performance of the predictor.

##### 3.4. Application of SVD in Dimension Reduction

It is found that larger singular value relates to more information, which means singular value can be used to help reduce the dimension of data set. The specific way to reduce dimension is to delete singulars with small values and its related vectors in *U* and *V*. Then, matrixes can be reconstructed as , , and . *k* is the number of reserved singulars, and formula (1) will be written as .

However, even some new base vectors with small singulars might have high correlation with label, which means they can help increase the classification accuracy of the predictor. To protect them, the correlation between a base vector and sample label should be added in criterion. The importance score of a base vector is calculated by the following equation:where is the correction between original features and label and is the element of *V*, which represent the relationship of original features and new base vectors.

The amount of information carried by a matrix can be measure by its Frobenius norm (*F*-norm). The *F*-norm of is calculated by the following equation:where singular value is sorted by its in the descending order. After base vectors with smaller scores have been deleted, the amount of remaining information can be represented by the *F*-norm of . And the percentage of the information reserved can be calculated by the following equation:where *k* is the number of reserved base vectors.

The reduction on dimension of the sample space can prevent overfitting caused by sparsity of samples and strengthen the generalization performance of the predictor. Furthermore, because the noise carried by the data set is more likely to have smaller variance than the useful information, the dimension reduction can also weaken the impact of random noise on the data set.

#### 4. Prediction on Students’ State of Mind

This section will describe how the BP algorithm can be used in training NN for predicting students’ state of mind. However, it is found using only BP algorithm will lead to overfitting, so a new algorithm which combines GA is proposed to relieve overfitting. After getting a NN with good generalization performance, a method of calculating importance of different features are also proposed.

##### 4.1. BP-NN

BP algorithm is a common algorithm in ML. So, a NN trained by BP algorithm is established to predict the student’s state of mind at first. After dimension reduction, the data of student samples can be represented by . Here, *m* is not the total number of student samples, but the sample number after deleting outliers from the data set, and *k* is the number of remained new features of each sample student. Also, it should be , but in fact, is *Z*-scored by (4) to fit the standard normal distribution on each features. This pretreatment will balance the learning rate of parameters in different nodes. Then, a data set is obtained, where is a row vector of and is the label of the *i*th student sample.

The NN that is used to predict includes three layers. The input layer consists of *k* nodes for inputting the data vector . The output layer has only one node for outputting the prediction of samples. The hidden layer’s node number *l* is adjustable to fit the actual demand. , , and *O*, respectively, represent the *i*th input node, *h*th hidden node, and output node. The parameters of NN include connection weights between and , connection weights between and *O*, thresholds of , and threshold *θ* of *O*. The thresholds of nodes make NN become a nonlinear function, so that is used as its equivalent function, and the output of NN is

The optimization goal of BP algorithm is usually the mean square error (MSE) between the output and label. The MSE can be calculated by the following equation:

BP algorithm uses the strategy of adjusting parameters along the adverse direction of the gradient of *E* to decrease the error between prediction and real label. For example, the variation of for each training round can be calculated by the following equation:where *μ* is the learning rate, which decides the speed of training.

Set the function between a student’s features and his state of mind as . The use of BP algorithm can help decrease the difference between and rapidly, so that the trained NN can be used as a predictor to make good prediction on student’s state of mind.

##### 4.2. Description and Analysis on Overfitting

However, BP algorithm did not work well in the primary experiment. To test the usefulness of the predictor, the data set *D* was divided into training set and test set randomly. It can be found from Figure 1 that the variation of the MSE of the NN’s prediction on and shows difference.