Abstract
Online posts have gradually become a major carrier of network public opinion in social media, and the social network hotspots are the important basis for the study of network public opinion. Therefore, it is significant to extract hotspots for monitoring Internet public opinion from online posts textual big data. However, the current hotspot extraction methods are focused on the users’ features that are based on textual big data with spam and lowquality content. Meanwhile, these methods seldomly consider the time span of posts and the popularity of users. Accordingly, this article presents a hotspots information extraction hybrid solution of online posts’ textual data. Firstly, a filtering strategy to obtain more highquality textual data is designed. Secondly, the topic hot degree is presented by considering the average number of replies and the popularity of the participant. Thirdly, an improved coword analysis technology is used to search the same topic posts and Bisecting kmeans clustering algorithm using repliers’ popularity and key posts are designed for studying and monitoring the hotspots of online posts in a valid big data environment. Finally, the proposed algorithms are verified in experiments by extracting the hotspots of online posts from the dataset. The results show that the data filtering strategy can help to obtain more valuable information and decrease the computing time. The results also demonstrate that the proposed solution can help to obtain hotspots comparing the traditional methods, and the hot degree can reflect the trend of the online post by comparing the traditional methods.
1. Introduction
With the rapid development of mobile communications and networks, the Internet increasingly integrates into our life. It is reported that there are now more than 4 billion Internet users around the world. Most Internet users spend an average of six hours surfing the Internet, and 3 billion people now use social media, such as Twitter, blogs, Bulletin Board System (BBS), and podcasts [1, 2]. It is known that online posts have gradually become an important tool in social media for the exchange of information. An increasing amount of public opinion is now spread by social media, especially through BBS [3–5]. Since hotspots directly reflect public opinion, studying and monitoring the hotspots of social media becomes more important for public affairs.
Social media has become one of the most important and popular carriers and distributors of the current online public opinion [6, 7]. Compared to the traditional public opinion channels, online posts have some unique features, such as a wider audience range, greater influence, faster propagation speed, and large amount of data [8–10]. For obtaining and monitoring public opinion hotspots, an increased number of studies focus on this field from different perspectives. In general, the current studies mainly use natural language processing, data mining technologies, machine learning, and other methods to monitor hotspots and explore propagation [11–14].
Currently, text data is still an important medium for information dissemination on social networks [15, 16]. To study complex dynamics in social networks, the extraction of hotspots from massive textual data becomes one of the important steps. On the one hand, the current hotspots’ extraction methods are simple to collect the userrelated feature and mostly based on textual big data with spam, irrelevant, and lowquality content. In social media, there are many spam information [17–19], such as paid posters and fake replies, as shown in Figure 1. Advertising posts and replies are a good example in BBS. Such corresponding users’ featured information based on invalid or incomplete data can be very different from real one, especially for hotspots and public opinions. On the other hand, the main methods seldomly take into account the time span of posts and popularity of repliers. Firstly, time span of posts is a significant factor in hotspots extraction, as hotspots of social networks are the collective action of users in a short time (for example, a collective response to an event in BBS). Secondly, it is reported that popularity users (repliers and main posters) play a significant role in Internet public opinion [20]. However, few studies address these problems. Due to the complexity and features of social media such as BBS, monitoring of public opinion hotspots still faces the following challenges:(1)How to obtain more valuable data by filtering a large amount of spam textual data(2)How to find the key posts according to the association among multiple posts for the same topic(3)How to search real hotspots by considering the valuable repliers and key posts
Accordingly, for solving the problems, a hotspots’ information extraction hybrid solution of online posts’ textual data is proposed based on the feature of users in social networks. The solution contains three main steps. Firstly, a textual data filtering strategy is used to obtain a more valid dataset. An improved coword analysis technology is used to search the same topic posts. Secondly, bisecting kmeans clustering algorithm based on poster popularity and key posts are proposed to obtain the hotspots of online posts. Then, the hot degree is proposed to search the real hotspots. The proposed methods are implemented in a real experiment, where the results demonstrate the effectiveness of the solution.
The rest of the paper is structured as follows. Section 2 discusses related studies on the Internet public opinion and current challenges. Section 3 introduces hotspot monitoring and public opinion communication characteristics. Section 4 introduces the cluster hotspot monitoring based on PR values and bisecting kmean algorithms. Section 4 presents the results of an experiment using the proposed methods and our dataset. Section 5 concludes and discusses the paper.
2. Related Work
In this section, we present existing studies on the public opinion analysis of BBS and monitoring hotspots. These studies are used as a basis for our work. We review related research from two aspects: network public opinion and hotspots.
2.1. Public Opinion
In [21], the natural language processing and machine learning techniques are used to interpret sentimental tendencies related to users’ opinions and predict real events. In [22], a public opinion dynamics model for an onlineoffline social network context is provided and conditions to form a consensus in the proposed model are analyzed. In [23], the authors propose a method to recognize network public opinion leaders by using Markov logic networks, and a recognition system is designed and implemented. In [24], a crossnetwork public opinion spreading model is created in a combined social network environment. Two network nodes are assumed in this paper. In [25], the author constructs a dictionary monitoring sentiment computing model using text words and labels as the input parameters. In [26], a new method is provided for sentiment computing for news and events by constructing a word emotion association network. The authors provide a word emotion computation method to obtain initial words. These studies mostly focus on public opinion based on the assumption that the dataset is always valid.
2.2. Hotspots
Zhao et al. [27] present a Social Sentiment Sensor (SSS) system on Sina Weibo to detect daily hotspots and analyze sentiment distributions related to these topics. Clusters of topics that describe the same issue are formed and ranked based on popularity to exploit the resulting hotspots. In [28], the authors use a clustering method to obtain candidate topics on BBS and the evolution theory to calculate the heat of candidate topics and obtain hotspots based on it. Hao and Hu [29] propose a method based on a baseline model to solve the topic drift problem of network BBS. Liu and Li [30] adopt text mining approaches based on a vector space model and kmeans clustering to group Internet public opinion hotspots. Li [31] uses an emotion analysis technology to analyze the emotional polarity of network BBS Chinese texts and a kmeans algorithm and the SVM to cluster the contents of posts considering each class as a hot topic. Chen et al. [32] design a similarity analysis algorithm of Internet public opinion based on information entropy, which can cluster and identify hotspots and crisis events.
The above studies provide a useful basis for this study, but there are some gaps that need to be filled, especially regarding the public opinion analysis of online posts. The unresolved issues are related to the validity and usefulness of data and the popularity of posters when used in the clustering of hotspots. Based on the above research, a data filtering strategy is introduced to improve the quality of data. Improved coword analysis and Bisecting kmeans clustering algorithms are designed using time spans and popularity to obtain more accurate results.
3. Mathematical Model of Online Post
In this section, we first build a mathematical model of online posts based on their characteristics and then use it to study hotspots.
Let S_{all} be the set of all posts at time t. Let S_{valid} and be the set of valid and invalid posts during period t. So, S_{all} can be represented as follows:
Assume that there are m valid posts, so S_{valid} can be expressed as .. For any post, the post usually contains the first replies set . Each has its replying content , replying time , and the second reply number . The details are shown in Figure 2.
(a)
(b)
For time period t, the all topic set iswhere n is the size of the valid topics set T and is similar to T on the topic set in period t. In other words, and T have similar keywords.
It is known that all posts (S_{all}) during a period have multiple topics; therefore, getting the post relevant to the one topic is the basement for extracting the hotspots. Each topic corresponds to the keywords set, which contains similar keywords’ set (). The valid keywords set has valid keywords, and we can assume that its keyword set is ().
We use the notation to indicate that post s is relevant to T_{i}. That is, s contains the keywords of T_{i}. We can use the following equation to define this relationship between a post and a topic:
According to (3), the relevant posts are the post contents that contain the keywords of the one topic. Therefore, we can obtain the relevant posts set of topic T_{i} as
Definition 1. Post lifespan iswhere and represent the time of the first and last reply of post s during a certain time, respectively, and is the number of days disregarding hours and minutes.
Definition 2. The reply number (the total number of replies of one post) is equal towhere is the total number of replies of post s andis the total number of the first replies.
Definition 3. Post participants (p) are the post creator and repliers. Let be the participant’s participation number:where and are the first repliers and second repliers, respectively.
Definition 4. (degree of participation (DoP)). The degree of participation of post s_{i} is the ratio of the reply number and participant reply number
Definition 5. (minimum number of replies (MNR)). The MNR is the least reply limitation of the post s_{i}. Let be the MNR of the post s_{i}. MNR aims to simplify the valid post dataset without considering the posts with no replies and few replies.
It is clear that different posters and repliers have different influences. To calculate such influence, we use the popularity of a participant as the value. The popularity of a participant mostly depends on two factors: the frequency and the popularity of replies. The frequency can be expressed as the average number of participants in posts, and the popularity can be expressed as the number of replies per day or per post. Accordingly, the post participant p’s popularity () can be expressed aswhere fr is the frequency of the posts of participant p, N is the total number of the discussion posts of p, TSP is the total number of repliers to all discussion posts of p, and a and b are the coefficients corresponding to the frequency and the total number of repliers, respectively.
We use different values to denote different post values for a topic. The value of a post is mainly determined by two criteria: the average number of replies and the popularity of the participant. Accordingly, the following formula is used to calculate the value:where are the coefficients corresponding to the average number of replies and popularity, respectively.
Based on the above discussion, we can calculate the topic hot degree (HD)aswhere is the valid post set about topic T and are the coefficients of the total number of posts and post values, respectively.
Definition 6. Hotspot is the topic that gets the maximum value of the hot degree. For the topic set , the problem of hotspot search becomeswhere S(T_{i}) is the valid post set of topic T_{i} and the constraint conditions in equation (12) are restricted vales scope of the post, topic, and keywords, respectively.
Definition 7. Hot post is the post with the maximum value of a post. And, the hot post of topic T_{i} can be expressed as follows:It is clear that the main problem related to formulas (12) and (13) is to obtain valid posts, replies, and keywords.
4. Spam Data Filtering Mechanism and Improved Cluster Hotspot Monitoring Algorithm
To determine the hotspots of BBS, we use a filtering mechanism to obtain more valuable data and an improved cluster hotspot monitoring algorithm to find hotspots. We focus on text data filtering, extracting keywords, constructing the common word matrix, and searching the hotspots and hotspots.
The main process involves the following steps: the identification of postspamming and fake replies to increase the post and reply values, the application of a text rankbased keyword extraction algorithm to calculate the PageRank Value of the candidate keywords and obtain their PR values, the determination of the keywords based on posts’ PR (PageRank) values, and the construction of the coword matrix for these keywords, as shown in Figure 3. As a result, we determine the hotspots by sorting the above results.
4.1. Post Filtering
Paid posts and invalid replies are wellknown phenomena in BBS networks. However, they affect the results of hotspot and hot postsearch. Accordingly, we must filter out invalid data. We adopt the following rules. Rule 1(degree of participation): if the degree of participation (DoP) is above a predefined constant, we consider the post as a paid post or spam post. And, we delete the post from the post set. The DoP is the ratio of the participant’s reply number and the total reply number of the post. Rule 1 can be formulated as follows: where is the DoP constant. Rule 2 (MNR): if the MMR of the post is less than a predefined constant, we consider the post as an invalid post or spam post. And, we delete the post from the post set. The MNR is the ratio of the total reply number and the participant’s number of the post. Rule 2 can be formulated as follows: where is the MNR constant. We use Rule 1 and 2 to filter out paid or spam posts. It is known that, in a real BBS, there are many fake replies, which are not related to the topic, such as advertising. Such replies must be deleted from the post set as well. Rule 3(lexical filtering): the predefined vocabulary set is denoted as for topic T_{i}. If a reply text () does not contain an element of, it is considered invalid and gets deleted from the replies set. The rule for getting a valid reply can be described as follows: where means the reply text contains a predefined vocabulary element of A, F_{all} is all reply for the topic T_{i}, and f_{ij} is the jth reply text of topic T_{i}. Algorithm 1 shows the details of the posttext filtering for the topic T_{i}. In algorithm 1, the text data filtering can be mainly divided into the following steps. Firstly, the relevant post set of T_{i} can be obtained with equations (3) and (4). Then, according to Rule 1 and 2, the degree of participation and the minimum number replies are employed for deleting the invalid posts which are beyond the constraints of (14) and (15). Moreover, using Rule 3, the invalid replies are selected and removed from the reply set F_{all}. Finally, the valid post and replying sets (S, F) are returned.

4.2. TextRankBased Keyword Extraction Algorithm
Based on Section 4.1, we can obtain valid posts and replies by adopting the filtering mechanism. In this section, we further extract keywords based on their text ranks. The main steps are as follows. Step 1: we divide the text of a reply into a word list. Then, we order the words in the list. Namely, . Step 2: after filtering the element of list(V) according to the following rules, we obtain the list of candidate keywords. Step 3: we use the following synonym processing rule to build candidate keywords. Rule 4 (synonym processing): let C be the synonym keywords set: where are the main word and its synonym set, respectively. If a word is a synonym, we use the main keyword to replace it and then merge the same words and build candidate keywords (list(X)). Step 4 (build candidate keyword map): the candidate keyword map G=(i, list(i)), where i is the candidate keyword, list (i) is the set of words coexisting with i in the window, and, for the word j in list (i), the cooccurrence number between i and j is denoted as weight . Step 5 (iterate operation): we set the number of iterations (L), according to the page rank algorithm [33, 34], we can calculate the page rank value (PR value) of each word and then construct a sequence according to the reverse order of PR values. The formula for the PR value is where PR(i) refers to the PR value of keyword i, j denotes the keywords coexisting with i, l denotes the keywords coexisting with j, and d is the damping coefficient.
4.3. Common Word Matrix for Obtaining the Same Hotspots and Hot Posts
Common word matrix: n keywords are selected according to their PR values (Section 4.2). The keyword set is . The position of the coword matrix corresponds to the semantic distance between two keywords. The formula for the semantic distance between two keywords (, ) iswhere represents the number of cooccurrence events between keywords (, ). The smaller the semantic distance between two keywords is, the more likely the two keywords belong to the same hotspot. Therefore, the common word matrix (CA) can be represented as
Searching the hotspots and hot posts: the common word matrix CA is transformed into a point set. Then, all the points are treated as the first cluster, and the first cluster is divided into two parts. Select each cluster that can minimize the SSE (sum of squared errors) value and divide it into two new clusters. This loop continues until the number of clusters equals the predefined number K.
The hotspots and keywords are obtained based on the above steps. Then, we can obtain the hot post using equation (10). By using the strategy explaining in equation (11), we can get the topic hot degree for every topic. Also, the hotspot and hot post by sorting can be identified.
5. Experimental Results
Experiments were conducted to evaluate the performance of the proposed algorithm using a real dataset. The results of the experiment are used to analyze the proposed approach. This section covers the simulation parameters, setup, and results.
5.1. Dataset and Experiment Setup
Dataset: the dataset is gathered from three online post websites (W1 (Baidu Tieba post): https://tieba.baidu.com; W2: https://bbs.tianya.cn; W3: http://www.xici.net). The three websites are the most famous online BBS platforms in China, which have more than 150 million active users in 2020. The post textual data was collected from these BBS and covers the whole year of 2018. Figure 4 shows a screenshot of the online post of W1, which is a classical online community post based on textual data.
To test the proposed strategies, we select three typical subjects (“Computer game” (S1), “Exam” (S2), and “Nanfang College” (S3)) from the above BBS websites. The data was obtained using a crawler. Meanwhile, data visualization software was designed for analyzing these textual data of BBS, and the data filtering algorithm was used in the software, as shown in Figure 5. The dataset obtained has 16,373 posts and 100,197 replies from January 1, 2018, to December 31, 2018. The parameters of the experiment are shown in Table 1.
5.2. Results’ Analysis
Valid posts and replies: the proposed data filtering mechanism is used to obtain valid data from the dataset. Figure 6 shows the results of the valid posts and replies of the above three subjects (S1, S2, S3). It is a wellknown fact that, by using filtering strategies, we can effectively delete spam posts and replies. Figure 6(a) provides the comparison of the results of our filtering rules and the raw data in some different subjects for different subject posts. Our mechanism can decrease the number of posts by 109 and 513 by using Rules 1 and 2, respectively, compared to the raw data that was not filtered in S1. Accordingly, by using our filtering rules, more than 13% of the invalid post is obtained. Figure 6(b) shows the results of the filtering of online post repliers based on the proposed methods in different subjects by using rule 3. Similarly, the methods can effectively filter out invalid replies. Particularly, Rule 3 can filter more than 30% invalid replies.
(a)
(b)
The results demonstrate that the proposed filtering mechanisms can decrease the number of invalid posts and replies. Also, the filter can reduce datasets and improve the efficiency of searching for hotspots. Furthermore, the results show that the proposed data filtering algorithm has different post and replies effects on a different subject. In other words, the larger the scope of the subject, the bigger the post and replies. Topics with a wide scope of topics are more likely to have spam posts and replies.
Precision: for verifying the data filtering algorithm performance of precision, the part raw data (10%) of the subject of S3 is selected. Then, these BBS data are filtered by the manual and the proposed data filtering algorithm, respectively. Figure 7 shows the precision percentage results of posts and replies of subject S3 by using the proposed method in different BBS websites. From the results, it is easy to get that the precision of filtering posts is more than 92%, and the precision of filtering replies is large than 85%. The results demonstrate that the proposed data filtering algorithm has a good effect on the precision of spam posts and replies.
Computing time: computing time is an important metric to evaluate the performance of the data filtering algorithm. Therefore, the computing time results to collect the number of users in different subjects is given in raw and filtered data, as shown in Figure 8. It is easy to get that the S1 spends the most computing time in three subjects. And, the proposed filtering can decrease more than 15% computing time. In other words, the filtered data used less time to search the number of users than the raw dataset in all subjects. The results show that the presented strategy can save more computing time by using a data filtering algorithm.
Hot degrees: we use hot degrees to search for hotspots and posts. When the hot degree of a post reaches 3, we consider it a hotspot. After calculating hot degrees and searching for hotspots, five hotspots were selected based on their hot degrees. Hot degrees of different topics can be obtained using the hot degree calculation method. Meanwhile, the same five hotspots of the maximum number of post and repliers are calculated in the same dataset. The results of different metrics are shown in Figure 9. Figure 9(a) provides the hot degrees of different topics from the 90th to 105th days. The values of the hot degrees of the five topics are 3.6, 5.6, 5.6, 11.75, and 5.08. It is noticeable that topic 4 is the hottest topic during this period. Figures 9(b) and 9(c) show the total numbers of post and repliers in the same time period. It can be seen that topic T1 is the hotspot using the different degree, and it has the highest values of the three metrics. Namely, the hot degree can reflect the hotspots of online posts.
(a)
(b)
(c)
The values of hot degrees on subject S1are obtained in different datasets from different websites. Figure 10 shows the hot degree results from the 5th to 235th days related to topic 1. From Figure 7, it can be seen that topic 1 has three peaks at the 60th, 120th, and 190th days, and the hot degree of three websites has the same trend in different online post websites during the monitored period. The proposed hot degree can directly reflect fluctuation trends. Specifically, a hot degree can demonstrate the trend in terms of repliers and users, as in our strategies we merge post users and replies. In summary, the proposed method can effectively solve the social media hotspot problem.
6. Conclusion
The online posts have become public platforms for expressing personal opinion, so their monitoring and online hot topic search gained more significance. Considering the weight of different users, the extraction of hotspots from massive textual data with spam data become one of the important bases for study the public opinion of the social network. By collecting and analyzing text information on online posts, current hotspots can be obtained. This article adopts a data filtering mechanism, common words, and clustering technology for online hotspots search, using a time span, poster popularity, and PR values. Then, hot degree is used to evaluate the hotspots of online posts based on the number of replies and the popularity of the participant. The proposed methods are implemented and applied to a BBS dataset. The results show that the proposed method can effectively filter out invalid data, compress datasets, save more computing time, and improve performance. At the same time, the results demonstrate that the proposed method and hot degree can also reflect changes in the trend of the hotspots of online posts.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest.
Authors’ Contributions
HuiRu Cao conceived and wrote the manuscript; Songyao Lian analyzed the data and performed the experiments; ChouJun Zhan and Xiaomin Li analyzed the experimental results.
Acknowledgments
This work was supported by the Ministry of Education in China Liberal Arts and Social Sciences Foundation under Grant no. 20YJCZH004, Natural Science Foundation of Guangdong Province of China under Grant 2019A1515011346, Featured Innovation Projects of Guangdong Province universities of China under Grant no. 2019GKTSCX075, and National Science Foundation of China Project under Grant no. 61703355. This work was also partially supported by colleagues at the Department of Electronic Communication and Software Engineering of Sun Yatsen University.