Data-driven Modeling and Dynamic Analysis of Complex Networks: Applications to Social NetworksView this Special Issue
Modeling the Interaction Networks about the Climate Change on Twitter: A Characterization of its Network Structure
This work studies the interaction networks (replying, retweeting, and quoting) that arise on Twitter in relation to such a relevant topic as climate change. We detected that the largest connected component of these networks presents low values of average degree and betweenness, as well as a small diameter compared to the total number of nodes in the network. The largest connected component of retweeting and quoting networks also exhibits very low negative assortativity. The quoting and retweeting networks have a more hierarchical structure than the replying network. We also find that the process of emergence of new links in the interaction networks can be properly modeled (with high accuracy) through a Support Vector Machine model using the embeddings provided by the Node2Vec algorithm. A Random Forest model using certain similarity measures as explanatory variables between nodes also provides high accuracy. In addition, we analyze the communities existing in each interaction network by means of the Louvain method. The cumulative probability distributions of hashtags per community are also examined.
Many real systems can be characterized as graphs, where nodes symbolize objects and links represent the relations between them. The social networks, which consist of individuals and their relations and whose analysis has drawn interest from several fields, can be studied as graphs .
Some investigations propose methods of classifying interactions on social networks according to emotions or frameworks for categorizing or comparing approaches that use social context [2–4]. Also, link prediction is a relevant matter, since it allows the identification of hidden links from the observable part of the interaction networks or the anticipation of future links from the current network topology. Pieces of research exist which carry out a comprehensive review, analyzing and discussing the state of the art of the link prediction on social networks . Node-based metrics, topology-based metrics, and social-theory-based metrics are studied. Additionally, several growth models have been proposed in the literature [6, 7]. Barabási and Albert’s growth model  also named preferential attachment model which has been frequently utilized to generate scale-free networks and provided a foundation for understanding the mechanisms that give rise to certain properties in various real networks. There are also investigations that propose that the degree is not the only key factor in influencing the growth of scale-free networks, as also a “fitness” exists in each node which symbolizes its propensity to attract links [9, 10]. Research explains that social networks are at best weakly scale-free , exhibiting different characteristics. In , the authors describe that, considering the analyzed 251 social networks, half of them lack any direct (power law is itself a good model of the degrees) or indirect evidence (power-law distribution is not necessarily a good model of the degrees, but it is a relatively better model than alternatives) of scale-free structure (50% not scale-free), while indirect evidence is slightly less prevalent (41% superweak).
This research aims to examine the interaction networks on Twitter which are concerned with climate change. In these networks, the nodes are Twitter users, and the links are the exchanges that exist between them. A total of 631,027 tweets were analyzed. The objectives of this investigation are to describe the structure and to characterize the link formation process in replying, retweeting, and quoting networks. Additionally, we aim to discover if within these networks there are communities of users with common interaction patterns. To our knowledge, this has not been done previously in relation to a matter such as climate change.
2. Materials and Methods
2.1. Overview of Used Resources
2.1.1. T-Hoarder Tool
For downloading data from Twitter, the T-Hoarder tool  was utilized. It is a software program that is able to perform tweet crawling and data filtering and display summary information about Twitter activity on a specific topic. The tool provides two APIs (Application Programming Interfaces) to download data: the Rest API and the Streaming API. The first works in a synchronous way, with the restriction of searching for data from the previous week. The second makes it possible to carry out real-time data downloading.
T-Hoarder tool is implemented in UNIX as an operating system and utilizes Python as its programming language. Using T-Hoarder, the following data can be collected for each Tweet: tweet ID, timestamp, author, text, app, author ID, followers’ author, following author, statuses author, location, URL, geolocation, name, bio, URL media, type media, and lang. A more detailed description of these fields can be found in the Supplementary Materials (available here) Document.
A total of 839,968 tweets were downloaded from Twitter from March 2021 to May 2021.
2.1.2. Software Programs
Several programs in Python  and R  were developed for the purpose of carrying out the following functionalities:(i)Handling of the data, which was performed through R language, and pandas library in Python was used.(ii)Network characterization and graph handling, which was performed by applying the Networkx package in Python  and the igraph package in R .(iii)Modeling was carried out utilizing the scikit-learn and StellarGraph packages in Python. The caret, LiblineaR, and e1071 packages were also used in R. The estimation of similarities between nodes was done utilizing the link prediction package in R.(iv)Communities using CDLIB .
The Gephi platform was used to draw the interaction networks .
2.2. Overview of Used Methods
2.2.1. Obtaining the Tweets
A total of 839,968 tweets were downloaded from Twitter utilizing the T-Hoarder tool from March to May 2021. The tweets include messages produced from Twitter users as well as their interactions (retweeting, replying, and quoting). It must be noted that only tweets in the English language are considered. For the purpose of deciding on the most appropriate keywords to filter out the tweets which are unrelated to climate change, a small group of individuals were gathered in order to perform the selection. In the cases where more than 15 keywords were proposed, they would be grouped utilizing an affinity diagram . After that, the keywords would be filtered utilizing a multiple voting system. In the end, seven keywords were taken. These were global warming, greenhouse effect, climate change, climate crisis, climate disaster, climate emergency, and climate action. Both with and without spacing to consider hashtags, a total of 589,272 retweets, 94,084 replies, and 156,612 quotes were obtained. Additionally, a filtering system is executed to eliminate possible duplicates which could happen from how the data was acquired (Streaming and Rest APIs). After this, 631,027 tweets remained.
2.2.2. Building the Interaction Network
We use graphs to study the interactions network on climate change on Twitter. In each tweet, node1 is the author’s ID and node2 is the user’s ID with whom this interacts. Because there are interactions with several users within the same tweet, as many links are generated as interactions exist. The types of interaction are replying, quoting, and retweeting. The generated graphs are undirected and unweighted. Based on each kind of interaction, three types of graphs are created.
The main statistical network parameters are shown in Table 1. These are the number of nodes and links, as well as the maximum and the mean degree. Information on the total connectivity of the network, and other data such as the number of nodes and links, as well as the size of the giant component is also included. Various characteristics of the GC are analyzed such as number of nodes and links, diameter (d), average path length (), average degree (k), betweenness centrality (bc), and assortativity coefficient (r). These metrics are defined as follows.(i)The degree of a node l, k(l), for an undirected graph, G, such as an interaction network on Twitter, is [20, 21] where is the element of the adjacency matrix. if node is connected to node and 0 otherwise. symbolizes the total number of nodes in the network.(ii)The betweenness centrality of node in , , is [21, 22] where is the total number of shortest paths from node to node and symbolizes the number of those paths that pass through .(iii)The degree assortativity  is defined as a Pearson correlation between the “expected degree” distribution and the “joint degree” distribution  in . The first distribution represents the probability distribution of passing through the links of and discovering nodes with degree at the termination of the links. The second distribution symbolizes the probability distribution of a link having degree on one termination and degree on another termination. In the undirected case, the normalized Pearson coefficient of and provides the assortativity coefficient of the network, which can be described as [23, 24] Here and are the expected value or mean and standard deviation of . If a network is perfectly assortative ( = 1), then its nodes join only with other nodes with analogous degree.(iv)The average path length of is described as the average number of links that must be passed through the shortest path between any two pairs of nodes and . If it is considered that when ; that is, if any path between and exists, then can be described as(v)The diameter of , , symbolizes the length (in number of links) of the longest geodesic path between any two nodes .
2.2.3. Modeling the Interaction Network
(1) Overview of the Process. The process of link formation between pairs of nodes is modeled using two mechanisms:(i)The first procedure uses the Node2Vec algorithm . As is well known, Node2Vec is a semisupervised method for scalable feature learning in networks, which utilizes a custom graph-based objective function that uses the Stochastic Gradient Descendent method . The algorithm provides a feature representation that maximizes the likelihood of preserving network neighborhoods of nodes in an s-dimensional feature space. A -order random walk approach is utilized to produce the network neighborhoods for each node . To generate a feature representation of links for two nodes and , the authors of the algorithm define a binary operator over the corresponding feature vectors and with the purpose of generating a representation such as , where is the space dimension for the pair . The operators for any pair of nodes is established even if a link does not exist between them. All operators produce link embeddings that have equal dimensionality to the input node embeddings. Then, given any two nodes, named and , and their feature vectors, and , the operators are defined as follows:(i)Hadamard: (ii)l1: (iii)l2: (iv)Average: (ii)The second procedure uses certain similarity indexes. In this method, analogously to , we calculated local, quasi-local, and global similarity metrics between nodes. Specifically, the local measures were resource allocation , Leicht-Holme-Newman , common neighbors, cosine , cosine similarity on L+ , hub promoted , Jaccard , hub depressed , preferential attachment , and Sørensen . The global similarity measures used were average commute time , Katz , L+ directly , matrix forest , and random walk with restart . Finally, the following quasi-local measures of the similarity were utilized: graph distance and local path . These indexes are explained in detail in the Supplementary Materials Document.
As input variables, the model has either the characteristics that have been obtained for each link, through algorithm, or the similarity indexes between pairs of nodes. The output variable of the model is the mark of whether or not a link exists between a pair of nodes (its value is 1 or 0). Using supervised learning, from examples in which both the input and output variables are known, the model anticipates the value of the output for new inputs, corresponding to cases not utilized in the learning (training process).
Cross validation is applied to generate the model. As an alternative of splitting the dataset into training and testing subsets, in the aforementioned mechanism, equal partitions of the dataset are carried out. The model is trained times: each time one of the partitions is chosen as a test set, and the model is trained with the remaining folds. Each fold is utilized once as a test set. Hence, at the end, there are various predictions about the whole dataset. Because of the above, evaluations of any parameter determining the efficiency of the model exist. This parameter can be average :
In this investigation, we consider , , , , and . All these parameters are defined in Section 2.2.2.
Finally, an independent evaluation of the previously indicated parameters is performed utilizing a validation set.
As a dataset to which the cross-validation procedure was applied, we took an amount corresponding to 75% of the total links (t75). The same value for nonexistent links between unconnected pairs of randomly selected nodes was considered (t75). As a validation set, 25% of the total number of links (t25) was contemplated and an analogous value (t25) was taken for nonexistent links.
(2) Obtaining Embeddings through Node2Vec. Node embeddings are calculated through Node2Vec in such manner that the nodes which are close in the graph remain close in the embedding space. It is a two-stage process which first involves running random walks on the graph to obtain context pairs and second uses these walks to train a Word2Vec model . To compute the embeddings, we use the StellarGraph package in Python, and several parameters must be specified:(i)p, which controls the probability in a walk of going back to the node from which one comes. Values in [0.1, 2] were taken in steps of size 0.1.(ii)q, which manages the probability of exploring undiscovered parts of the graphs. It determines the dimensionality of Node2Vec embeddings, that is, the size of the feature vector. Values in [0.1, 2] were taken in steps of size 0.1.(iii)num. walks, which defines the number of walks made from each node. Values in [0.1, 10] were chosen in steps of size 0.5.(iv)walk length, which symbolizes the length of each random walk. Values in [19, 38] were chosen in steps of size 5.(v)window size, which specifies the context window size for Word2Vec. Values in [39, 40] were selected in steps of size 1.(vi)num iter, which represents the number of SGD iterations (epochs) to run. Values in [38, 41] were selected in steps of size 1.
To optimize the hyperparameters, we calculate the closeness centrality (or nearness centrality) of all nodes in G. This parameter can be defined asIn the above formula, is named farness centrality. symbolizes the number of links in the shortest path between nodes and .
Considering the embeddings obtained for each node , according to each selected hyperparameter option, we calculate the metric , which can be described asIn the above formula, is the Euclidean distance between the vectors corresponding to and nodes. Finally, the correlation between and is obtained.
The Pearson correlation  or the Spearman correlation  would be utilized, depending on whether and variables would exhibit a normal distribution or not. The normality of the distribution is checked utilizing the Anderson-Darling test  with a significance level . The considered hypotheses are as follows:(i): “the samples derived from a normal distribution.”(ii): “the samples did not derive from a normal distribution.”
If , must be rejected and is taken; otherwise must be accepted.
One combination of hyperparameters for which a correlation greater than 0.9 is obtained was selected. We use p = 1.0, q = 1.0, dimensions = 64, num. walks = 5, walk length = 50, window size = 10, and num iter = 1.
It must be noted that if a new node is added to the network, the execution of Word2Vec is required on the whole graph to generate the new embeddings.
(3) Obtaining Similarities between Nodes. As we have previously indicated, to calculate the similarity between nodes, the link prediction package in R is used. The similarity indexes are described in the Supplementary Material Document.
2.2.4. Building the Model
Three models were tested: Random Forest (RF) , Logistic Regression (LR) , and Support Vector Machines (SVM)  model. Each model is optimized by 5-fold cross-validated grid search over a parameter grid in order to find the best parameters for each one.
The link prediction package is only available in R language. As a result, we implement the model in R, if the similarity indexes are used as explanatory variables, whereas if the embeddings are obtained from Node2Vec, it is developed in Python language.
In Python language, the applied hyperparameters for the RF model are as follows: number of trees in the forest (n_estimators) and minimum number of samples required to split an internal node (min_samples_split). In R, the utilized hyperparameters are as follows: number of trees in the forest (max_depth) and number of variables randomly sampled as candidates at each split (mtry) (minimal node size (min.node.size) is equal to 1). Minimal node size Gini is taken as splitting rule in both R and Python languages.
For LR and SVM models, the hyperparameters are identical in R and Python languages. They are, for the LR model, the best inverse of regularization strength (Cs) and an L2  penalty term (penalty). For the SVM model, the inverse of regularization strength (C) and Kernel coefficient (gamma) are utilized.(i)RF model (Python): n_estimators: [100, 150, 300], max_depth: [3, 5, 10, None], and min_samples_split: [2, 5, 10](ii)RF model R: num.trees: [100, 150, 300], mtry: [2, 5, 10], and max_depth: none(iii)LR model (Python and R): Cs: [1, 5, 10, 20], penalty: [l1, l2] (caret package with ranger method, which we used to build the RF, does not allow us to specify limits for ).(iv)SVM model (Python and R): C: , gamma:
To measure the performance of the model, the following metrics are computed:(i)Area under curve (AUC). Receiver Operating Characteristic curve (ROC) is a probability curve where each point symbolizes a true positive rate (TPR)/false positive rate (FPR) pair corresponding to a one decision threshold. TPR and FPR can be described as In the above formula, TP, TN, FP, and FN symbolize the true positives, the true negatives, the false positives, and the false negatives, respectively. If ROC(t) is the function associated with the ROC curve, the area under curve (AUC) can be denoted as The AUC allows us to estimate the performance of the classifier, establishing its ability to discriminate between classes; that is, 0 indicates that there is no link between a pair of nodes and 1 indicates that a link between a pair of nodes exists.(ii)Accuracy, in binary classification, as the one we are dealing with, symbolizes the proportion of correct predictions (both true positives (TP) and true negatives (TN)) among the total number of cases examined. It is computed as follows :(iii)Sensitivity or Recall, which represents the ability of the classifier to correctly identify the positive samples (1: a link between a pair of nodes exists), can be defined as (iv)Specificity or Selectivity, which symbolizes the ability of the classifier to identify a negative sample (there is any link between nodes), is defined as (v)Precision is defined as(vi)F1 score can be interpreted as a weighted average of the precision, that is, the ability of the classifier not to label a sample as positive that is actually negative, and recall the ability of the classifier to find all the positive samples. This score is in the range [0-1] and it is defined as(viii)Geometric mean measures the balance between classification performances in both the majority and minority classes. A low GMean indicates poor performance in the classification of the positive cases even if the negative cases are correctly classified. This score is defined as
2.2.5. Community Analysis
According to , one of the more analyzed forms within large-scale networks is the modular structure. In this context, a community is a dense subnetwork, that is, a set of densely connected nodes within a larger network. These conglomerates can be revealed utilizing the information encoded in the network topology. To analyze the presence of communities in the replying, retweeting, and quoting interactions networks, various algorithms are evaluated.
(1) Louvain Algorithm. This method consists of both modularity optimization and community aggregation phases. In the first stage, each node is initially allocated to a community. After that, the corresponding modularity is estimated by eliminating node l from its community and placing it in its neighbor j’s community. If a gain exists in the modularity, l is moved to this new community; otherwise, it remains in its original community. This mechanism is repeated for all nodes in the network .where symbolizes the sum of all the weights of the links inside the community that l is moving into. represents the sum of all the weights of the links to the nodes of the community to which moves. is the weighted degree of . is the sum of the weights of the links between and other nodes in the community that is moving into. is the sum of the weights of all links in the network. If the network is unweighted, the weight of each of its links is 1.
In a second stage, a new network is constructed in which the nodes are the communities obtained in the previous stage. The two stages are repeated until modularity cannot be increased further .
(2) Leiden Algorithm. It is based on the Louvain method. This procedure introduces a refinement phase in addition to the modularity optimization and community aggregation phases, making it slightly more complex . Analogously to the Louvain algorithm, this algorithm also begins by allocating each node to a community. After that, individual nodes are moved from one community to another to obtain a gain in modularity. The next step involves the refinement of the individual communities found in the previous step. This refined partition is obtained as follows.
Initially, the refined partition is set to a unique partition, in which each node is in its own community . The algorithm then locally brings nodes together in : nodes that are on their own in a community in can be merged with a different community. It should be noted that mergers are performed only within each community of the partition obtained in the first stage. In addition, a node is joined to a community in only if both are sufficiently well joined to their community in the first stage .
After the refinement step, the primary community might split into multiple subcommunities ensuring well-connected communities. After that, assembly of the community nodes is carried out based on the refined partition . These steps are repeated until no more improvement can be made in terms of modularity .
(3) Label Propagation Algorithm (LPA). It operates as follows: first, the network is initialized, so that each node is allocated a unique label. Then, every node selects a great number of neighbors, applying a label to itself. If more than one label is utilized by the same maximum number of neighbors, one of them is selected at random. After various repetitions, the identical label tends to be connected with all elements of a conglomerate . reaches convergence when each node has the majority label of its neighbors .
(4) Surprise Algorithm. [39, 41, 53] The authors propose a different global performance measure which is named “Surprise” to evaluate the computation of conglomerates. They establish the community structure of a network computing the distributions of intra- and intercommunity links with a cumulative hypergeometric distribution . The method assumes that a null model exists through which links between nodes emerge at random. The departure of the observed partition from the expected distribution of nodes and links into conglomerates is measured considering this null model. The following cumulative hypergeometric distribution is utilized :where is the maximum possible number of links in a network , with being the number of nodes. is the observed number of links, is the maximum possible number of intracommunity links for a specific partition, and is the total number of intracommunity links observed in that specific partition.
This parameter makes it possible to estimate the exact probability of the distribution and nodes for the established communities in the network for a specific partition .
The four aforementioned algorithms have been chosen because they have been shown to be effective in several forms of research [39, 50, 55, 56] and because, for the analyzed network, their execution is completed in a short time (less than 2 minutes), when run on a computer with the following characteristics:(i)Processor: 11th Gen Intel(R) Core(TM) i7-1160G7 @ 1.20 GHz 2.11 GHz(ii)RAM: 16.0 GB
Three metrics are calculated to assess the performance of the community detection algorithms:(i)Modularity , which evaluates the strength of divisions. A large modularity represents dense connection between nodes within communities and sparse connections between nodes located in different communities. Then the modularity Q of a partition is defined as  where m is the number of links, is the element of the adjacency matrix of G, are the degrees of nodes and , and is a resolution parameter, which is equal to 1 if and are in the same community and 0 otherwise. is a resolution parameter. If it is lower than 1, Q inclines towards larger communities. If, on the contrary, its value is higher than 1, Q is in favor of smaller communities. A value equal to 1 is taken in this investigation. The value of modularity is in the range [−1/2, 1] for unweighted and undirected graphs [13, 58].(ii)Performance . where and are nodes and , represents two nodes belonging to identical community and joined by a link, represents two nodes belonging to different communities and not joined by a link, and are the communities where nodes and are located. is the number of nodes in .(iii)Coverage.
It can be defined as the ratio of the number of intracommunity links by the total number of links .
High values of correspond with appropriate partitions. Consequently, the partition corresponding to its maximum value in should be the best . Therefore, to study the interaction networks, the partition provided by the method with a higher value of is selected but considering, at the same time, that the method allows us to obtain a good value for the performance and coverage metrics. All this is done with the smallest possible number of communities.
2.2.6. Analysis of Probability Cumulative Distributions of Hashtags
We also analyze the cumulative probability distributions of hashtags for all types of interactions, globally and by community. The Kolmogorov-Smirnov test  with a significance level equal to 0.05 is utilized for the comparison of the distributions. The following hypotheses are considered:(i)Null hypothesis : “the samples derive from the identical distribution.”(ii)Alternative hypothesis : “the samples derive from different distribution.”
If p value < 0.05 is obtained in the test, the null hypothesis must be rejected; otherwise, it must be taken.
3. Results and Discussion
3.1. Main Structural Properties of the Interaction Networks
Figure 1 shows the graphs corresponding to replying, retweeting, and quoting interactions networks for the utilized tweets from March 11 to May 26 .
Tables 1 and Table 2 show the structural properties of the interaction networks from March 11 until May 26 in 2021 as well as their . As previously mentioned, the betweenness centrality estimates the number of times a node lies on the shortest path between other nodes. With respect to the , as can be observed in Table 2, the average betweenness is low (0.008) in all analyzed networks. This means that there are few users who play an intermediation role linking other users. The average degree is also low (2.8), as shown in Figure 2, where many nodes exist with a low degree, while only a few exhibit a high degree. Both average path length and diameter are similar in retweeting and quoting networks. However, these are much higher in the replying interaction network, demonstrating that there is worse connectivity. The connection of a network is determined by its diameter, which defines the capacity of any two nodes to interact with each other.
We also analyze the k-core decomposition  in , which allows us to detect specific subsets (k-cores) in the graph. These are calculated by recursively eliminating all the nodes of degree lower than , until the degree of all remaining nodes is higher than or equal to . Those highest values of correspond to nodes with a higher degree and more central position in . The k-core decomposition determines a hierarchy of nested subgraphs, in which the 1-core comprises the 2-core, which equally includes the 3-core and so on, until the highest k-core is obtained. Higher values of imply a more relevant and central subgraph.
We have identified for replying, quoting, and retweeting networks 2, 10, and 22 k-cores. For the replying network, the highest percentage of nodes is in the first k-core (94.16%). Meanwhile, in quoting and retweeting networks, the largest proportion is in the first and second k-core ((83.72/12.66%) and (76.60%/14.37%)). With respect to the percentage, the users registered 5.84% in the highest k-core for replying interaction and 0.04% in the rest of the networks. According to the above, these users that show the highest k-core have a larger relevance.
3.1.1. Modeling the Replying Interactions
Link formation for each type of interaction is modeled at various time periods. This is because, particularly in retweeting and quoting networks, a high number of links exist from March to May 2021. The days on which tweets were downloaded for the periods analyzed are detailed in the Supplementary Material Document.
(1) Modeling from Obtained Embeddings Using Node2Vec. Table 3 depicts the best hyperparameters for each model using Node2Vec for replying interactions from March 11 until April 10, 2021. Figure 3 shows the new interactions for that time (503 new links were formed). Tables 4 and 5 display the performance metrics for each operator and model used. According to the results, the model that exhibits a higher Accuracy is the SVM model. Specifically, operator l1 is utilized and hyperparameters are taken as C: 0.01 and gamma: 0.01. More periods of analysis are included in the Supplementary Material Document. At those times, the best model was also SVM but using the Hadamard operator, which exhibits a slight difference with respect to operator l1, hyperparameters are taken as C: 100.0 and gamma: 0.01. However, as we see later, a better Accuracy is obtained for both cases by using the similarities between nodes as explanatory variables.
(2) Modeling from Obtained Similarities between Nodes. For replying interactions from March 12 until April 12 in 2021, Table 6 displays the best obtained hyperparameters for each model using the similarity metrics between nodes indicated in 2.2.2. Table 7 depicts the performance metrics for each operator and model utilized. The results show that the best Accuracy is obtained for the RF model, with hyperparameters num.trees = 300, mtry = 5, and min.node.size = 1. This Accuracy is slightly higher than that achieved if the model is built using the obtained embedding applying Node2Vec. The study of other time periods has been incorporated in the Supplementary Material Document. At these times, a slightly better accuracy is obtained for the RF model with hyperparameters num.trees = 300, mtry = 10, and min.node.size = 1.
3.1.2. Modeling the Retweeting Interactions
(1) Modeling from Obtained Embeddings using Node2Vec. In Table 8, the best obtained hyperparameters for each model utilizing Node2Vec for retweeting interactions from May 21 until May 26 in 2021 can be observed. Table 9 and Table 10 display the performance metrics for each operator and applied model. Figure 4 displays the new interactions in the aforementioned period; 2,581 new links were built. Additional time periods have been included in the Supplementary Material Document. Similarly to what happened in the replying interaction networks, for all analyzed time intervals, the best obtained model is SVM, applying the Hadamard operator. The hyperparameters of the model vary depending on the period. Particularly, the utilized hyperparameters for the period specified above were C: 10.0 and gamma: 0.01.
(2) Modeling from Obtained Similarities between Nodes. For retweeting interactions from May 21 until May 26 in 2021, Table 11 displays the best obtained hyperparameters for each model using the similarity metrics according to 2.2.2. In Table 12, the performance metrics for each model used can be seen. More times are described in the Supplementary Material Document. The highest Accuracy is obtained for the RF model, with hyperparameters num.trees = 300, mtry = 10, and min.node.size = 1. This accuracy is slightly higher than the one achieved for the best model using the obtained embeddings through Node2Vec. It can be noted that this happens for all analyzed times.
3.1.3. Modeling the Quoting Interactions
(1) Modeling from Obtained Embeddings using Node2Vec. In Table 13, the best obtained hyperparameters for each model using Node2Vec for quoting interactions from April 12 until April 22 in 2021 are depicted. Table 14 and Table 15 show the performance metrics for each operator and model applied. Additional times are included in the Supplementary Material Document. Figure 5 shows the 12,651 new links between nodes for the aforementioned time period. The best value for the Accuracy is obtained utilizing the Hadamard operator, for the SVM model with hyperparameters max_depth: none, min_samples_split: 2, and n_estimators: 300. The SVM model using the Hadamard operator also exhibits the highest Accuracy for the rest of studied time intervals.
(3) Modeling from Obtained Similarities between Nodes. For quoting interactions from April 12 until April 22 in 2021, Table 16 depicts the best obtained hyperparameters for each model using the similarities between nodes. Table 17 displays the performance metrics for each operator and model utilized. The best accuracy is achieved for the RF model taking as hyperparameters num.trees = 150, mtry = 5, and min.node.size = 1. As in replying and retweeting interaction networks, a slightly higher accuracy than that received using the embedding calculated using Node2Vec is obtained. Other times are described in the Supplementary Material Document. For all times, a higher Accuracy for the RF model utilizing similarities as explanatory variables is also observed.
4. Community Analysis
Tables 18, 19, and 20 show a summary of the three used metrics for the evaluation of the community detection algorithms. The number of communities identified in each method is also shown. We choose the method that gives a good value for all the performance parameters considered and also provides a smaller number of communities.
Once the candidate algorithms to be used for the community detection were analyzed and having selected one of them as the most appropriate, the probability cumulative distribution of hashtags for all interaction networks was checked, both globally and by community. Figures 6–8 and 9 show the collected results. Over the total of 631,027 analyzed tweets, only 28,952 containing hashtags were identified (18,439 retweeting, 10,333 quoting, and 180 replying). All hashtags were formatted from #Word 1 Word 2….Word to Word 1_Word 2…._Word , where tw is the maximum number of words in each tweet. Those hashtags found more than 100 times in retweeting interactions were as follows: “C_O_P26,” “C_E_Ebill,” “environment,” “C_O_P26,” “C_E_E_Bill,” “Earth_Day,” “C_O_V_I_D19,” and “Clean_Delhi.” The most frequently used hashtag in replying interactions was “Climate_Brawl” which was contained 22 times. “C_O_P26,” “C_E_Ebill,” “C_O_P26,” and “C_E_E_Bill” hashtags are detected more than 100 times in quoting interactions. Figure 10 shows the word cloud representation of hashtags per interaction type.
Climate change and its effects are a relevant topic today. This concern has been highlighted in various international initiatives such as 2030 Agenda for Sustainable Development established by the United Nations , the climate emergency declared by the European Parliament , or the Net Zero World initiative established by the United States . Social networks are a good chance for people to voice their opinions. Twitter is a social network that has more than 340 million users , and, for this reason, the analysis of the interactions that happen on such a site can have a high relevance. We detect that the replying, retweeting, and quoting interaction networks can be appropriately described through two models:(i)An SVM model that utilizes the embeddings provided by Node2Vec algorithm and the Hadamard operator.(ii)An RF model that uses as explanatory variables certain metrics describing the similarities between nodes.
We found the most relevant used hashtags by type of interaction and also found that the cumulative probability distributions of hashtags by community are similar. Globally, the cumulative distributions of replying and retweeting interactions exhibit a different pattern. To gain a better understanding of Twitter interactions on such a relevant issue as climate change, this investigation can be continued in several ways:(i)An inspection of the evolution of temporal cooperation in the interaction networks can be performed using the conventional evolutionary game theory.(ii)An analysis of the diffusion mechanisms as well as an examination of the dynamics of opinion formation can be performed. The above will make it possible to study the extent to which an opinion can be manipulated by algorithmic procedures (bots) as well as the effects that the structure of the interaction networks could have on it.
The data used to support the findings of this study are available from the corresponding author upon request.
This research was carried out as a result of the Project: Hopper: Women, Society, Technology and Education which was granted in the internal call for research projects in 2021 at Universidad Francisco de Vitoria.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work was partially funded by Telefonica Chair at Francisco de Vitoria University. The authors thank Mari Luz Congosto Martínez for her help in training on the utilization of the T-Hoarder tool.
Supplementary Materials include (i) overview of T-Hoarder tool, (ii) description of similarity measures (local, global, and quasi-local methods), and (iii) tables related to modeling of the interaction networks. (Supplementary Materials)
K. Nguyen and D. Tran, Fitness-Based Generative Models for Power-Law Networks, 2012.View at: Publisher Site
G. Van Rossum and F. L. Drake, Python 3 Reference Manual, CreateSpace, Scotts Valley, CA, USA, 2009.
A. Hagberg, P. Swart, and D. S. Chult, “Exploring network structure, dynamics, andfunction using networkx,” Tech. rep, Los Alamos National Lab.(LANL), Los Alamos, NM USA, 2008.View at: Google Scholar
Study, Affinity Diagrams: Definition & Examples, Study.com, 2016, study.com/academy/lesson/affinitydiagramsdefinitionexamples.html.
M. E. Newman, “Newman. Assortative mixing in networks,” Physical Review Letters, vol. 89, no. 20, Article ID 208701, 2002.View at: Google Scholar
S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, New Jersey, NJ, USA, 2 edition, 1999.
G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New York, NY, USA, 1986.
P. Jaccard, “Étude comparative de la distribution florale dans une portion des Alpes et des,” Jura”” Bulletin de la Société Vaudoise des Sciences Naturelles, vol. 37, pp. 547–579, 1901.View at: Google Scholar
T. Sørensen, “A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons,” Biologiske Skrifter/Kongelige Danske Videnskabernes Selskab, vol. 5, pp. 1–34, 1948.View at: Google Scholar
P. Yu. Chebotarev and E. V. Shamis, “A matrix-forest theorem and measuring relations in small social group,” Automation and Remote Control, vol. 58, no. 9, pp. 1505–1514, 1997.View at: Google Scholar
I. M. Chakravarti, R. G. Laha, and J. Roy, Handbook of Methods of Applied Statistics, John Wiley & Sons, Hoboken, NJ, USA, vol. I, pp. 392–394, 1967.
E W D Eurostat, “The European Parliament declares climate emergency,” 2019, https://www.europarl.europa.eu/news/en/press-room/20191121IPR67110/the-european- parliament-declares-climate-emergency.View at: Google Scholar
Patten and M. Newhart, Understanding Research Methods: An Overview of the Essentials, Routledge, Oxfordshire, UK, tenth edition, 2017.View at: Publisher Site
S. L. Gortmaker, D. Hosmer, and S. Lemeshow, “Applied logistic regression,” Contemporary Sociology, vol. 23, 2013.View at: Google Scholar
X. Xiecs, “273P Machine Learning and Data Mining,” 2019, https://www.ics.uci.edu/xhx/courses/CS273P/04-linear-regression-273p.pdf.View at: Google Scholar
S. Shirazi, H. Baziyad, N. Ahmadi, and A. Albadvi, “A new application of louvain algorithm for identifying disease fields using big data techniques,” Journal of Biostatistics and Epidemiology, vol. 5, pp. 183–193, 2019.View at: Google Scholar
M. E. J. Newman, Networks: An Introduction, Oxford University Press, Oxford, UK, 2011.
EnergyGov W.D. U. S. Launches, “Net-Zero World Initiative to Accelerate Global Energy System Decarbonization,” https://www.energy.gov/articles/us-launches-net-zero-world-initiative-accelerate-global-energy-system-decarbonization.View at: Google Scholar