Abstract

Bots are now part of the social media landscape, and thus, a threat to cyber-physical-social systems (CPSSs). A better understanding of their characteristic behaviors and estimation of their impact on public opinion could help improve the algorithms to identify bots and help develop strategies to reduce their influence. The cosine function-based algorithm is able to compare the similarity between tweets and restore the course of information circulation. Combined with malicious features of an account, our method could effectively detect bots. We implement SEIR model to compute tweets with the hashtag #Huawei 5G and divide the trend propagation into the following four phases: formation, fermentation, explosion, and decay of trend. Sentiment analysis revealed the change of emotion and opinion among normal users in different stages and the manipulation attempt of bots behind it. Experiment results show that bots have very limited relation to users’ stance in whole. In early phase bots could affect those who are neutral. The influence of bots declines in later stage. Polarized views can hardly be changed.

1. Introduction

For the past few years, devices or systems have been transformed into smart connected ones, which are widely named as Internet of Things (IoT) and cyber-physical systems (CPSs). Integrating social networks to CPSs results in a new paradigm called cyber-physical-social systems (CPSSs). CPSSs include the human-to-device communications and device-to-device communications and create continuous interaction of human-device relationships [1]. In recent years, with the development of artificial intelligence technology, machine algorithms can automatically generate fake news stories, spam, misinformation, and other deceptive content. Bots are used to distribute misleading information over social media networks. They are used as a tool for public opinion manipulation. By selling anxiety, creating conflicts, spreading rumors, and slander, bots have been used to manipulate online discussion and shape public life. Their activities are also associated with spamming and harassment. Bots are regarded as initiator of current challenges towards CPPSs, such as privacy concerns, ethical issues, safety, and security.

There are the following three types of bot accounts: hecklers, hacker bots, and honeypots. Hecklers attack the opponent or promote own interest by using analogous talking points. Hacker bots compromise the social media accounts of celebrities to post disinformation or gain access to private communications. Honeypot accounts are used to contrive friendships between normal users and bots, building trust through direct messaging or email conversations and then trick users into making payments into fraudsters’ accounts [2]. Thus, malicious social media bots are a cyber threat to information security and to our society.

Machine-learning algorithms could help identify bots. Their judgement bases mainly on the behavior characteristics of accounts. To improve the efficiency of bot detection we need a better understanding of bots’ activity in social media and their actual impact on public opinion. Therefore, we choose Twitter, one of the most popular social media in the world, as our sample. Latest Pew Research study shows that Twitter is by far the most used social platform among journalists [3]. While much of the reporting is influenced by tweets opinion, researchers estimate that from 9% to 15% of the active Twitter accounts are bots [4]. Automated accounts can have an outsized impact, as the 500 most active bots shared 22% of the total links on Twitter [5]. Our research focuses first on distinguishing between bots and normal users, second on bots’ activity in debate of the hot topic #Huawei 5G and their impact on public opinion.

2.1. Feature-Based Bot Detection

Compared with normal users, bots show great differences in behavior patterns, in social networks and in the way of information dissemination. Using extracted features to train bot detection classifier is the current major practice. Features refer to the following three categories: the text content, the behavior patterns, and the relationship to others. The choice of features has direct impact on classifier quality. Researchers worldwide are looking to optimize algorithms though better characteristic classification.

Content-based bot detection may be accomplished in the following two ways: (1) Through text similarity calculation. Automation controls a large number of fake accounts. That same or similar tweets or hashtags appear at the same instant in a large number is a typical act of bots. (2) Through specific content in posting like URLs. Bots often induce users to visit the target website by sending them advertisements, phishing, or pornographic information which contains URLs pointing to a same target. Content-based bot detection methods generally involve natural language processing, text similarity calculation, text sentiment tendency analysis, word segmentation, and other technologies. Andriotis and Takasu [6] present a supervised approach to detect automated accounts on Twitter using four datasets that contain users’ metadata, content, and sentiment features. Kumar et al. [7] proposed a neural network ensemble of Text CNN and LSTM model with BERT embeddings to classify tweets as bot tweets or not based on the tweets’ textual content. Heidari et al. [8] create new features based on textual information of online comments. The new set of sentiment features are extracted from a tweet’s text and used to train bot detection models. The advantage of the content-based bot detection is that the trained model can effectively distinguish bots with known characteristics quickly. The disadvantage is that bots can evade detection by disguising URLs, changing text generation templates, and using automated tools to produce a large number of texts with different words but similar meanings.

Behavior-based bot detection depends on abnormal behavioral features such as abnormal posting time distribution, frequency, and time series. Methodologically, pattern matching, time series analysis, user behavior modeling, and statistical analysis are used. Chu et al. [9] proposed a detection system that consists of two main components, a client-side logger and a server-side classifier, to distinguish bots from humans. Dorri et al. [10] present a semisupervised collective classification technique that combines the structural information of the social graph with the information on the social behavior of users in a unified manner to detect social botnets in a Twitter-like SNS. The behavior-based bot detection is suitable for bots with typical abnormal behaviors. It has higher robustness and bots can hardly evade.

SNS-based bot detection relies on uncovering abnormal connections among accounts. Bots do not have normal social relationships like human. Their social networks distinguish in partial structure and attributes. Social Networks-based bot detection involves technologies such as network construction, social network analysis, complex network, graph mining, community detection, network presentation, and cluster analysis. Sengar et al. [11] compiled activity and profile information of users on Twitter and using NLP and supervised machine learning to achieve the objective classification. Zhang et al. [12] propose a rectified linear postsynaptic potential function for spiking neurons and a spike-timing-dependent BP-learning algorithm for DeepSNNs. Their model bases on statistical features and user bidirectional voting. The statistical features include bidirectional propagation between trust users and neighbor nodes. Social Networks-based bot detection can discover individual bot and collaborative army of bots.

2.2. Information Diffusion in the SEIR Model

The Susceptible-Infective-Removed (SIR) model is one of the most widely used models for the information diffusion research in social networks. Many researchers have devoted themselves to improving the classic SIR model in different aspects. Rui et al. [13] proposed a Susceptible-Potential-Infective-Removed (SPIR) model that analyses the diffusion process based on the discrete time according to simulation. Sang and Liao [14] proposed a novel information dissemination model in mobile social networks based on the traditional SIR model (SEIRD) to know the evolution trend of information over time. Jia et al. [15] distinguished two propagation channels of rumor spreading on social networks, proposed an improved SIR model, and established the corresponding mean-field equation. Fu et al. [16] investigated the dynamics of competitive information diffusion over a connected social network by presenting a modified SIR model and found that innovators and larger network degree can help enlarge the coverage of the information among the population but they cannot help one information to compete with the other one.

3. Data Collection and Methods

3.1. Data Sets

We crawled through Twitter API all tweets tagged with the hash #Huawei 5G from November 1, 2019 to December 31, 2019 and got 22,950 samples. The data include content features (Table 1) and account features (Table 2).

3.2. Filtering Nonoriginal Tweets by Cosine Function-Based Algorithm

Since there are a lot of repetitive contents, lexical and semantical filtering is necessary. We use the cosine similarity algorithm to analyze the similarity of contents and eliminate duplications. The cosine similarity can compare the similarity of two tweets in terms of content. Each tweet is given two attribute vectors, A and B. The remaining string similarity θ is given by the dot product and vector length as follows:where and represent the components of vectors and . The similarity is measured by cosine of the angle. When two vectors point to the same direction and overlap, the value is 1, indicates that the contents of two tweets are identical; when it is the 90°angle, the value is 0, means that the contents are independent. When the value is greater than 0.8, it is regarded as repeated semantic duplicate content and is eliminated.

3.3. Bot Identification Based on Text Content and Account Feature

Bots are problematic because they manipulate information, spread misinformation, promote unverified information, and adversely affect public opinion. To detect bots, we first start with the behavioral characteristics to find out suspicious accounts and then combine with malicious account characteristics to identify bots.

First, excluding tweets with identical content; then, sorting tweets with cosine similarity greater than 0.6 by post time to find the earliest atomic sentence. Based on that, the later added statements are shown in timeline and the organizational development of contents is restored. Next, marking the time point when the replies increase rapidly and labeling the statement that caused this change as key sentence. Finally, using account push to find out the account which first posted the key sentence. The more the key sentence repeats, the larger its value is, and the more suspicious the account is. The flow is as shown in Figure 1.

Consider that the accounts that caused a large number of retweets could be either activity of organized bots or influential opinion leader, further detection is needed. We extract user account characteristics including all postings, original postings, video postings, fan following ratio, number of replies, number of likes, number of favorites, length of registration, and whether certified. We purpose a logistic linear regression function to perform machine learning on the extracted user features to detect bots.

3.4. Information Diffusion on Hot Topics According to the SEIR Model

The SEIR model of infectious disease dynamics is divided into four stages. Applying to information diffusion, they are Stage S (formation period), in which the news spread on the Internet but has not triggered public attention jet.; Stage E (fermentation period), in which the public forms opinions, attitudes, and emotions about the event gradually; Stage I (outbreak period) is characterized with drastic fluctuation, large volume of information, and clear emotional tendency. Stage R (dissipation period) in which public attention decreases and the topic of heat vanishes.

Most social networks are dimensionless, a scale-free model reflects the normality better than the one based on random networks. We use the Fitness model as archetype, with bots in the network at the beginning. For each additional time interval, a new batch of bots with rank m will join in. And the present bot i could connect to the new bot. The probability of connection hangs on the rank and adaptability of the old one. The relationship is shown as follows:where is the probability of connection, is the rank, and is the adaptability. Let N (t) be the total number of bots at time t, equal to plus t. Although infectious disease and information do not spread exactly the same, they do share some similarities. They need connection and disseminators. The Susceptible-Exposed-Infective-Recovered-Susceptible (SEIRS) model is used to describe the information diffusion. Let S (t), E (t), I (t), and R (t) be the amounts of bots at time t and represent: S is the one who has not been contacted; E is the one who may have interest in the topic; I is the one who join the discussion; R is the one who quit the issue. Their transformation is shown in Figure 2.

are the triggering rate of bots in different stages. The triggering rate of bots depends on the topic and on the rank of other bots in the same scale-free network. We modeled them as follows:

is the transition probability from S to E. is the infection rate of the topic. Suppose that the total number of bots in E induced by bot i equals . In this time unit, the number can be expressed as follows:

, the model of information diffusion on scale-free social networks can be expressed as follows:

4. Results and Analyses

4.1. Experimental Environment

The experimental hardware environment is shown in Table 3.

4.2. Trend in the SEIR Model

We analyze the topic “Huawei 5G” on Twitter according to the SEIR model. The timeline is shown in Table 4.

We used natural language processing to analyze the four stages of tweets content into the following three tendencies: positive (blue), negative (green), and neutral (yellow). Figure 3 reflects the trend of emotional tendencies in four stages.

This shows that (1) neutral tweets were the most in all phases; (2) opinions are the most divergent in Stage I. The battle of views drives the discussion to the peak; (3) in Stage E, the participation drops slightly than the beginning, the public are looking forward further information to their own judgement. Thus, Stage E is an import time point for manipulating or guiding public opinion; (4) in Stage R, neutral netizens quit the discussion, while the dissenter stay. It indicates that social network users with a clear emotional orientation are more willing to follow the topic.

4.3. Verification of Bot Detection

We had manually mark 10,179 accounts for the dataset. Account considered as bot is marked as 1, otherwise as 0. The dataset is used for algorithm training and performance verification.

We used TP (True Positives), FN (False Negatives), FP (False Positives), and TN (True Negatives) to classify the 2000 accounts of the prediction model. TP indicates that the model correctly detected the bot; TN indicates that the model incorrectly detected the bot as normal user; FP indicates that the model incorrectly detected the normal user as bot; and FN indicates that the model incorrectly detected the bot as normal user. Accuracy tells how many times the model was correct overall. Precision is how good the model is at predicting a bot account. Recall tells how many times the model was able to detect a bot account. The formula for calculation is as shown in Table 5.

In this experiment, the values of TP, TN, FP, and FN are 347, 1637, 353, and 357, which means the accuracy is 99.2%, precision is 98.3%, and recall is 97.2%. It indicates that the bot detection model used in the experiment has high accuracy and can provide support for further experiments.

4.4. Bot Activities

First, we count the number of tweets in the whole process and visualized as shown in Figure 4.

We can see that (1) bot activities are found in all four stages; (2) among the total 22950 tweets, 19111 were from normal users, that is 83.27%; 3839 were from bots, that is 16.73%; and (3) the trend peaked at 18:01:27 on 12.05.2019. At that point 437 tweets were posted, 86% were from normal users and 14% were from bots.

The following Figures 58 show the bot activities at each stage:

It is to see that (1) bots were most active in the S and R stage and contributed 18.54% and 18.74% of all tweets, 2% more than the average, while normal users were more active in the E and I stage and posted 85.57% and 84.31% of all tweets. This shows a clear difference between their behaviors and (2) bots were highly involved in early discussion and promoted the formation of hot topic. When the topic of heat drops, bots will once again inrush into the discussion and reverse the trend.

4.5. Bot Influence on Public Opinion

In the period from 11.01.2019 to 12.31.2019 we get 22,950 tweets on the whole. We categorize them according to affective tendencies into three groups, red for affirmative, yellow for negative and green for neutral. The emotional tendency of normal users and bots are shown in Figure 9.

Visualizing Figure 10 with a parallel coordinate chart we get Figure 9. For the same color, dark one represents the affective tendency of normal users, while light one shows the public opinion including bots as a whole. That the two lines are almost in parallel, means that the influence of bots is minimal.

To get a better understanding of the difference of bots and human in sentiment tendencies, we show them in percentage. The light color o indicates normal users and the dark indicates bots, Figure 11.

We assume that bots only perform tasks and won’t change mind, while normal users do. Normal users with neutral attitudes continuously decline from stage E, while the one with opposing attitudes increase, indicating that the neutrals become opponents. Reasons for the change could be: first, the approval bots reduce 19.73% from S to E stage and second and the negative bots increased significantly from E to I stage. Normal users with supportive attitude remain stable. They are less influenced by external opinion and stick to their own views.

5. Conclusions

Bots are automated accounts used to engage in social media. They are often blamed for opinion manipulation on divisive issues, because bots can spread information rapidly and amplify specific content strategically. Working in unison, they can maximize impact and give the illusion of large-scale consensus.

Our study focuses on improving bot detection techniques and evaluating the diffusion of discussions on social media by bots. First, we used a cosine function-based algorithm to judge the similarity between texts and extract content related features. In combination of malicious account features through machine learning process we are able to identify bots efficiently. Then, we applied the SEIR Model from epidemic study to trends analysis. A controversial topic experiences generally formation (S), fermentation (E), explosion (I), and decay (R) of trend. Our question is whether and to what degree can bots influence users’ perspectives. Bots have very limited on people’s view if we take the trend as a whole, but in different stages their influence is diverse. In early phase bots could affect the users who take a neutral position. Bots can win the amorphous middle by setting an echo chamber around them. In opposite, the late-moving birds catch no worm, because the influence of bots declines in later stages.

Not all bots engage in public opinion manipulation. From the sentiment analysis we can see that even within the bot group, neutral ones are the majority. A possible explanation could be that those neutral tweets were posted by good bots and fake good bots. The latter post promotion, critics and general content in combination, in order to better imitate a human user and thus evade detection methods. Good bots often identify themselves clearly as bots and tweet useful information and latest news. Making good bots more influential on social media networks is a way to combat malicious automation.

Bots exist on all kinds of social media platform. Current research focuses mainly on Twitter bots. One reason for this is the unwillingness of platforms to share data on account activity. It makes researchers difficult to analyze message frequency, networks, or employ other techniques to identify bots. Some bot behavior may be universal, but there are some platform-specific characteristics which deserve more attention.

Data Availability

The data supporting the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was part of the project “Computational Communication Strategy for Chinese Technology Image on Twitter” (no. 20CXW016) which is supported by National Social Science Fund of China. This work was also supported by the National Natural Science Foundation of China (no. 62102049) and the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province (Grant no. SKLACSS-202110).