Abstract

Mental health issues are alarmingly on the rise among undergraduates, which have gradually become the focus of social attention. With the emergence of some abnormal events such as more and more undergraduates’ suspension, and even suicide due to mental health issues, the social attention to undergraduates’ mental health has reached a climax. According to the questionnaire of undergraduates’ mental health issues, this paper uses keyword extraction to analyze the management and plan of undergraduates’ mental health. Based on the classical TextRank algorithm, this paper proposes an improved TextRank algorithm based on upper approximation rough data-deduction. The experimental results show that the accurate rate, recall rate, and F1 of proposed algorithm have been significantly improved, and the experimental results also demonstrate that the proposed algorithm has good performance in running time and physical memory occupation.

1. Introduction

The mental health and wellbeing of undergraduates have deteriorated over the last decade. Before the COVID-19 pandemic, higher education was facing a “mental health crisis” [1, 2]. The rapid onset of the COVID-19 pandemic has introduced countless additional stressors, and faculty concern over student wellbeing has increased. Over the past ten or twenty years, the depression has increased from about 25% of undergraduates in 2010 to almost 30% of undergraduates in 2020, and the anxiety of undergraduates has increased from 22% in 2014 to 31% in 2020. Suicidal ideation of undergraduates has increased from 6% in 2010 to 11% in 2020 [3]. The frequency of mental health management organization in undergraduates varies from university to university. Definitely influence of the pandemic on mental health concerns within undergraduates is a big concern. The pandemic has affected the economic development of many countries, and the cooperation of relevant countries on the pandemic has also led to conflicts. The widespread public reports on the Internet and the media have made simple and inexperienced undergraduates unable to distinguish. So, the management and plan of undergraduates’ mental health are important under COVID-19 pandemic [4, 5].

Keywords are words that express the central content of a document. Keywords from a document can accurately describe the document’s content and facilitate fast information processing. There are two main types of keyword extraction algorithms, which are unsupervised keyword extraction method and supervised keyword extraction method [68]. Unsupervised keyword extraction method does not need manually labeled corpus, but it uses some methods to find important words in the text as keywords for keyword extraction. In unsupervised keyword extraction method, candidate words are firstly extracted, and then each candidate word is scored, so top-K candidate words with the highest score are output as keywords. According to different ranking strategies, there are different algorithms such as term frequency-inverse document frequency (TF-IDF), TextRank, and latent Dirichlet allocation (LDA). The supervised keyword extraction method regards the keyword extraction process as a binary classification problem. At first, the candidate words are extracted, and then each candidate word is labeled, so the keyword extraction classifier is trained. When a new document is coming, all candidate words are extracted, and then the trained keyword extraction classifier is used to classify each candidate word. Finally, the candidate words labeled as keywords are used as keywords [9, 10].

Accordingly, the main contributions of this paper are summarized as follows. (i) I study the TextRank keyword extraction algorithm. (ii) An improved TextRank algorithm based on upper approximation rough data-deduction is proposed.

The rest of this paper is structured as follows. Section 2 reviews the related work. In Section 3, I propose an improved TextRank algorithm based on upper approximation rough data-deduction. The experimental results are shown in Section 4. Section 5 concludes this paper.

Many strategies of management and plan for undergraduates’ mental health have been proposed. In [11], the authors studied to examine student perspectives about college mental health including the primary mental health issues affecting students, common college student stressors, student awareness of campus mental health resources, and mental health topics students wanted more information about. Little research existed into the trends associated with on-campus service utilization for mental health concerns of college students. Rates of broad service utilization existed, but no published study had examined the direct relationship between a range of common mental health symptoms and on-campus service utilization. In [12], the authors studied to explore which common mental health concerns were associated with specific on-campus service utilization in undergraduate students and whether endorsement of more mental health concerns would predict a higher number of services utilized. In [13], the study investigated the moderating role of perceived social support in the relationship between academic demands (measured as perceived academic stress) and mental health of undergraduate students in full-time employment. A growing number of developing countries had experienced worsening air pollution, which had been shown to cause significant health problems. However, few studies had explored the impact of air pollution on the mental health of university students, particularly in the Chinese context. In order to address this gap, in [14], through a large-scale cross-sectional survey, the study aimed to examine the effects of air pollution on final-year Chinese university undergraduates’ mental health by employing multivariable logistic regression.

The TextRank algorithm plays an important role in keyword extraction. In [15], the author presented an automatic keyword extraction algorithm based primarily on a weighted TextRank model. In the model, word embedding vectors were used to compute a similarity measure as an edge weight. As a typical keyword extraction technology, TextRank had been used in a wide variety of commercial applications, including text classification, information retrieval, and clustering. In these applications, the parameters of TextRank, including the cooccurrence window size, iteration number, and decay factor, were set roughly. In [16], the authors conducted an empirical study on TextRank, towards finding optimal parameter settings for keyword extraction. The keyword weight propagation in TextRank focused only on word frequency. To improve the performance of the algorithm, in [17], the authors proposed semantic clustering TextRank, a semantic clustering news keyword extraction algorithm based on TextRank. In [18], the authors introduced a new human-annotated Chinese patent dataset and proposed a sentence-ranking-based term frequency-inverse document frequency algorithm for patent keyword extraction, motivated by the thought of “the keywords were in the key sentences.” In the algorithm, a sentence-ranking model was constructed to filter top-K-s percent sentences from each patent based on a sentence semantic graph and heuristic rules. In [19], the authors introduced a word network whose nodes represented words in a document and defined that any keyword extraction method based on a word network was called as a Word-net method. Then, the authors proposed a new network model which considered the influence of sentences and a new word-sentence method based on the new model. In [20], the authors proposed an ontology and enhanced word embedding-based methodology for automatic keyphrase extraction from geoscience documents.

There are also some other methods for keyword extraction. In [21], an enhancement of the term weighting was proposed particularly in the form of a series of modified term frequency-inverse document frequencies, for improving keyword extraction. In [22], the authors proposed an improved rapid automatic keyword extraction method, which used the word string matching feature in the dictionary method to correspond to the relevant execution action function. In [23], a novel text mining approach based on keyword extraction and topic modeling was introduced to identify key concerns and their dynamics of on-site issues for better decision-making process.

3. Improved Algorithm Based on Rough Data-Deduction

TextRank keyword extraction algorithm is a graph-based ranking algorithm, which is derived from Google’s PageRank algorithm [24]. TextRank firstly divides the target text into several meaningful words and constructs the candidate word graph and then uses the voting mechanism to rank the candidate words to achieve keyword extraction. The task of keyword extraction is to extract several important words from the target text. TextRank algorithm uses the local correlation between words (i.e., cooccurrence sliding window) to determine the correlation between candidate words and then performs iterative calculation and ranking of candidate keywords. Rough set theory is originally used for text classification to speed up classification and improve accuracy. Rough data-deduction is based on rough set theory, which integrates approximate information from upper approximation concept into data reasoning process. This paper introduces upper approximation-based rough data-deduction to TextRank keyword extraction algorithm, and the extracted keywords are used in undergraduates’ mental health management and plan.

In TextRank keyword extraction algorithm, the candidate keywords in the text are the graph model constructed by the cooccurrence correlation, and then the weights of each node are calculated by the average transition probability matrix for many times until convergence. After convergence, words are ranked in descending order according to their weights, and the first N words are selected as the extracted keywords. This method is more concise and effective, but it has certain limitations. In [25], the convergent operation utilizes a clustering strategy to group the population into multiple clusters. The use of cooccurrence window only considers the correlation between local words, so some words closely related to a certain keyword may be ignored, but keywords from a document are not just limited to the keywords around the words. When doing text keyword extraction, I should fully consider the words in the text as well as some potentially related words. Words with potential correlation will have an important impact on the whole iterative ranking process, and the potential relation can be discovered by the theory of rough data-deduction. Therefore, this paper proposes an improved TextRank algorithm based on upper approximation rough data-deduction.

Based on the word sense similarity of mental health words, the candidate keywords are divided. As there may be a group of words with similar word sense in a document, the weight of this group of words should be increased to improve the accuracy of extraction results when describing the same important content. TextRank algorithm only considers the word sense themselves and ignores the contribution of words with similar word sense. Therefore, the improved algorithm takes the word sense into account and divides the candidate words by word sense, which can extract keywords more effectively.

The rough data-deduction space M = (U, N, D) is introduced to describe the keyword extraction of undergraduates’ mental health issue structurally. U is the universe of discourse (UOD) and the dataset composed of candidate keywords of undergraduates’ mental health. N is a set of equivalence relation, and E ∈ N. If and only if p is similar to q, then p, q ∈ U and <p, q>∈E. D⊆U × U is defined as D = {<p, q>|p, q ∈ U and there is a relation between p and q}.

Assuming that deduction correlation is defined as equation (1) by using rough data-deduction,where cw1cw6 are the candidate keywords from the text through word segmentation and filtering, and the deduction correlation is determined by the degree of the association rules, that is, pointwise mutual information (PMI).

At the same time, for equivalence relation E ∈ N,where the equivalence division is based on the similarity between the candidate words.

In rough data-deduction, for candidate word cw1, the algorithm obtains cw2 and cw3 based on similarity rule and then divides cw1, cw2, and cw3 into one dataset, and cw4cw7 can be similarly divided. Then, cw4 can be obtained from cw1 based on the degree of the association rules of PMI, as well as cw5, cw6, and cw7. According to rough data-deduction, for cw1, [cw1]E = {cw1, cw2, cw3}, and [cw1E] = {cw4, cw6}. E ([cw1E]) = {cw4, cw6}, so cw1Ecw6. For candidate word cw6, [cw6]E = {cw4, cw6}, and [cw6-E] = {cw5}. E ([cw6E]) = {cw5, cw7}, so cw6Ecw7. Cw1 = Ecw7 can be obtained from cw1Ecw6 and cw6Ecw7. As described, there is also a potential correlation between cw1 and cw7, which can provide a certain contribution rate for calculation. The association between candidate keywords is established by the above rules, and the association weight can be added to the iterative calculation process as contribution rate to improve the accuracy of keyword extraction.

The upper approximation-based rough data-deduction to TextRank keyword extraction algorithm is summarized as follows.

Step 1. Based on TextRank algorithm, the text related to undergraduates’ mental health is preprocessed, which includes clause, word segmentation, and part of speech (POS) tagging, and candidate keywords are obtained.

Step 2. The candidate keywords are divided into different equivalence classes according to their similarities. This paper is divided based on WordNet and Wikitext. For any two candidate words cw1 and cw2, the division rule is defined as follows:where s1 and s2 are the similarities calculated by WordNet and Wikitext, respectively. ω1 and ω2 are the two weights assigned to s1 and s2, and ω1 + ω2 = 1.
Assuming that candidate word cw1 is distributed in WordNet WN, and cw2 is distributed in Wikitext WT, the intersection of WN and WT is WW. The value strategy of ω1 and ω2 is summarized as follows:(i)When cw1WW and cw2WW, the similarity to cw1 and cw2 is calculated based on WN and WT, respectively, which are denoted as s1 and s2. In this paper, ω1 = ω2 = 0.5.(ii)When cw1WN and cw2WN, or cw1WT and cw2WT, cw1 and cw2 are calculated as s1 or s2 based on WN and WT, where one of the ω1 and ω2 is 1, and the other one is 0.(iii)When cw1WN and cw2WT, the synonym set of cw2 is searched based on WT, and then the similarity with cw1 is calculated based on WN, and the maximum value is denoted as s1. If cw2 has no synonym in WT, then s1 = 0.2, ω1 = 1, and ω2 = 0.(iv)When cw1WN and cw2WW, the similarity to cw1 and cw2 is calculated based on WN and denoted as s1. Then, the synonym set of cw2 is searched in WT, and then the similarity to cw1 is calculated based on WN, and the maximum value is denoted as s2. If cw2 has no synonym in WT, then s2 = s1, and ω1>ω2. In this paper, ω1 = 0.6, and ω2 = 0.4.(v)When cw1WT and cw2WW, the similarity to cw1 and cw2 is calculated based on WT and denoted as s2. Then, the synonym set of cw1 is searched in WT, and then the similarity to cw2 is calculated based on WN, and the maximum value is denoted as s1. If cw1 has no synonym in WT, then s1 = s2, and ω2>ω1. In this paper, ω1 = 0.4, and ω2 = 0.6.Here, the calculation of word similarity based on WordNet is defined as follows:In equation (4), s1(WW1, WW2) is the similarity calculated by the set of independent minimum semantic units. s2(WW1, WW2) is the similarity of feature structure of minimal semantic unit of correlation. s3(WW1, WW2) is the similarity of the characteristic structure of the relational sign. The parameter δm (1 ≤ m ≤ 3) is adjustable and meets the requirement of δ1 + δ2 + δ3 = 1. In this paper, δ1, δ2, and δ3 are set as 0.6, 0.25, and 0.15, respectively. Equation (5) can obtain the sense similarity. When there are multiple senses in a word, equation (5) is used to calculate the maximum similarity among all combinations of senses, that is, the similarity to two words, where i is the sense number of the word cw1, and j is the sense number of the word cw2.
The calculation of word similarity based on WT is defined as follows:where dt(WW1, WW2) is the distance function of word codes WW1 and WW2 in the tree structure. j is the total number of nodes in the branch layer, which indicates the number of direct child nodes of the nearest common parent node of two words. di represents the distance between branches where two words are located in the nearest public parent node.

Step 3. The correlation of association rules in rough data-deduction is defined as follows:where cw1 and cw2 are two candidate keywords in the text. p(cw1, cw2) is the probability of cw1 and cw2 appearing in the same sentence. p(cw1) is the probability of occurrence of cw1, and p(cw2) is the probability of occurrence of cw2.
According to the correlation, the candidate keywords with direct correlation are determined, when PMI(cw1, cw2) ≠ 0, there is a direct correlation between cw1 and cw2, and cw1, cw2 and their correlation degrees are stored in the correlation set. Meanwhile, the rough data-deduction relation D can be established according to the correlation degree.
Then, by using the rules of rough data-deduction, I get the correlation between the other candidate keywords in all the different equivalence classes, and these words and their correlation degrees are stored into the correlation set.

Step 4. According to the correlation set obtained in Step 3, candidate keyword graphs with weights are constructed. Then, according to the equation of TextRank algorithm, the weight of each candidate keyword is calculated iteratively until convergence.

4. Experiment and Results Analysis

4.1. Experimental Data and Evaluation Criteria

The experiment selects 26300 questionnaires of mental health management of undergraduates from Xinxiang University with 23 schools and 60 majors, which consist of psychological distress, depression, suicidal tendency, and self-evaluation related to mental health within 300 to 1000 words. In particular, these undergraduates are distributed for different grades uniformly. The 19000 valid questionnaires are obtained by excluding questionnaires with self-evaluation less than 300 words to test the effect of proposed method in this paper. The questionnaires use silver ink with a metal oxide [26]. 10 keywords of each questionnaire are extracted and ranked by the importance. In this paper, ω1 and ω2 are set 0.5.

In addition, for comparison purposes, TextRank, a keyword extraction using supervised cumulative TextRank (KESCT) [27], scientific research project TF-IDF (SRP-TF-IDF) [28], and high representation tags LDA (HRT-LDA) [29] are selected. Three evaluation indexes commonly used in classification are used to compare and evaluate the quality of experimental results, which include precision (P), recall rate (R), and F1 (F). P is the accuracy of extraction results. R is the coverage degree of the extraction results to the correct keywords. F is a comprehensive evaluation index of harmonic average of P and R.

4.2. Experimental Results

It is found in the experiment that the two important parameters can affect the keyword extraction result of TextRank algorithm, which are the cooccurrence window size and the number of keywords, while the implementation of TF-IDF algorithm based on statistical feature and the algorithm proposed in this paper are not affected by the cooccurrence window size. I set the number of extracted keywords as 10, and the value of the comparison window is within [4, 10]. The F1 under different cooccurrence window sizes is shown in Figure 1.

It can be seen from Figure 1 that TextRank algorithm has different extraction effects under different cooccurrence window sizes. In the same test set, this paper compares the effect of different cooccurrence window sizes, and when the cooccurrence window size is 5, the original TextRank algorithm has the best extraction effect with high F value. Therefore, in order to ensure the effectiveness of the proposed algorithm, the cooccurrence window size is set to 5.

The initial window value is set to 5, and P, R, and F are calculated with the number of keywords within [3, 10]. The calculation results are shown in Table 1.

At the same time, in order to observe the experimental results of five algorithms conveniently, the P, R, and F of the algorithm are plotted, as shown in Figures 24.

Figure 2 describes the variation trend of the accuracy of the five algorithms when extracting different numbers of mental health keywords. As can be seen from Figure 2, with the increasing number of mental health keywords extracted, the accuracy of each algorithm decreases, but the accuracy of the algorithm proposed in this paper is always higher than other four baselines. The TextRank algorithm based on rough data-deduction proposed in this paper will integrate upper approximation information into the process of data-deduction so that the mutual deduction between data presents the characteristics of approximate entailment or imprecise association, and the potential association between candidate keywords can be mined. If the potential association is added to the iterative calculation of the weight of each candidate keyword, more accurate extraction results can be obtained. Therefore, the accuracy of the algorithm proposed in this paper is theoretically higher, and its accuracy P value is higher than other four baselines.

Figure 3 describes the change of recall rate of five algorithms when extracting different numbers of mental health keywords. In Figure 3, the recall rate of the algorithm proposed in this paper is higher than that of other four baselines, and the recall rate increases with the increasing number of mental health keywords. The SRP-TF-IDF algorithm relies too heavily on word frequency and does not use correlation between words at all. KESCT algorithm adopts the cooccurrence window principle. Although the relation between words is considered, the algorithm is more inclined to put forward frequent words due to its limitations, which may ignore important words with low word frequency that can describe the topics. However, the rough data-deduction used in this paper can expand the correlation range and enhance the coverage of the keywords of the correct correlation in order to improve the recall rate of the algorithm. The influence of word frequency decreases with the increasing number of keywords, and the advantages of the algorithm proposed in this paper will be more obvious.

Figure 4 describes the F values of five algorithms when extracting different numbers of mental health keywords. When evaluating the experimental results, it is expected that both P and R should be as high as possible. However, in most cases, the two values are contradictory. Therefore, F value should be used to comprehensively consider the two values, which can reflect the effectiveness of the whole algorithm. Keyword extraction based on rough data-deduction can mine the potential association between candidate keywords theoretically, which increases the candidate words and range of the association. The keyword extraction based on rough data-deduction adds the potential association to the iterative calculation of the weight of each candidate keyword, so the extraction results will be more accurate, that is, the algorithm is also more effective.

According to the experimental results, the proposed algorithm has higher P and R than the other four baselines. The F will be higher with the higher P and R, and the higher F can indicate the effectiveness of the algorithm. In conclusion, the accuracy rate, recall rate, and comprehensive evaluation index F1 of the proposed algorithm are higher than those of the four baselines, which indicates that the improved TextRank algorithm based on upper approximation rough data-deduction is more effective in mental health keyword extraction.

The test set of this paper uses the self-evaluation in the questionnaire of undergraduates’ mental health. The text length is generally less than 500 words, which is mainly concentrated in 300–500 words. This paper divides the test set by the number of self-evaluation words. Each test set randomly selects 30 texts of corresponding text words to compare the running time and physical memory occupation of the five algorithms.

As can be seen from Figure 5, when the number of words in the text is 300–400, the number of deduction and semantic calculation is small, and the running time of the algorithm is also short. The number of deduction and semantic calculations increases with the increasing number of words. Compared with TextRank, the running time of the proposed method is still shorter than that of the other three baselines, which is similar to TextRank’s efficiency.

It can be seen from Figure 6 that the physical memory occupation of the proposed method is small with good efficiency. When the number of words in the text is 600–800 or 800–1000, the physical memory occupation of SRP-TF-IDF and KESCT is nearly the same.

This paper manages the undergraduates’ mental health through the keyword extraction. The results show that academic problem, emotional problem, interpersonal problem, anxiety problem, sexual problem, and adaptation to college life are the universal mental health issues of undergraduates. Currently, how to deal with mental crisis is an urgent problem that colleges cannot avoid. The proposed bounded area elimination algorithm in [30] analyzes the feature extraction, and the idea of feature extraction is similar to the TextRank keyword extraction algorithm proposed in this paper . Timely plan of mental health crisis is to provide supports and help to those who have experienced personal crisis so that they can restore their mental balance and have full confidence in life.

5. Conclusions

In a fast-paced society, there is more competition among undergraduates, that is, they are facing the dual pressure of enrollment and employment, and mental health is very important for them. Mental health is the necessary condition and foundation for everyone’s all-round development in today’s society, which is also a necessary psychological quality for undergraduates. This paper introduces upper approximation-based rough data-deduction to TextRank keyword extraction algorithm, and the extracted keywords are used in undergraduates’ mental health management and plan. The comparison experiments reveal that the proposed algorithm outperforms four baselines in terms of accuracy rate, recall rate, F1, running time, and physical memory occupation.

The future works are stated as follows. (i) The deduction rules of rough data will be further refined and improved, so as to get better extraction effect. (ii) The words related to mental health may be incomplete in WordNet and Wikitext, which results in unsatisfactory keyword extraction. The following research will consider using a corpus of mental health related to achieve keyword extraction.

Data Availability

All data used to support the findings of the study are included within the article.

Conflicts of Interest

The author declares no conflicts of interest.