#### Abstract

Computing the robustness of a network, i.e., the capacity of a network holding its main functionality when a proportion of its nodes/edges are damaged, is useful in many real applications. The Monte Carlo numerical simulation is the commonly used method to compute network robustness. However, it has a very high computational cost, especially for large networks. Here, we propose a methodology such that the robustness of large real-world social networks can be predicted using machine learning models, which are pretrained using existing datasets. We demonstrate this approach by simulating two effective node attack strategies, i.e., the recalculated degree (RD) and initial betweenness (IB) node attack strategies, and predicting network robustness by using two machine learning models, multiple linear regression (MLR) and the random forest (RF) algorithm. We use the classic network robustness metric *R* as a model response and 8 network structural indicators (NSI) as predictor variables and trained over a large dataset of 48 real-world social networks, whose maximum number of nodes is 265,000. We found that the RF model can predict network robustness with a mean squared error (RMSE) of 0.03 and is 30% better than the MLR model. Among the results, we found that the RD strategy has more efficacy than IB for attacking real-world social networks. Furthermore, MLR indicates that the most important factors to predict network robustness are the scale-free exponent *α* and the average node degree <*k*>. On the contrary, the RF indicates that degree assortativity *a*, the global closeness, and the average node degree <*k*> are the most important factors. This study shows that machine learning models can be a promising way to infer social network robustness.

#### 1. Introduction

The study of the social network from a complexity science perspective has attracted much interest recently [1]. Especially, the study of dynamic processes that take place in these complex networks can have various applications. For example, the study of network robustness, i.e., “network robustness” is the capacity of a network to hold its functionality when a proportion of nodes/edges are removed, can help attack a network efficiently, or inversely design a more robust network structure in practice [2–7]. On the other hand, the study of epidemic processes that take place in the network can be used to spread the news [8–12], optimize vaccination strategy [13–15], or define a better social-distancing rule [16–19].

Besides a few simple model networks where analytical models can be developed [20–24], most of the studies rely on computer simulations. For example, for the study of the network’s robustness, node/edge removal Monte-Carlo simulations are usually employed. In such a process, nodes/edges are sequentially removed from the network using computer simulations. A “robustness” metric is then recorded at each step of the removal process. The most commonly used robustness metric is the largest connected component (LCC) of the remaining network [25].

The way nodes/edges are selected to be removed is called the removal strategy or attack strategy. One can classify attack strategies into two types, initial and recalculated attack strategies. For an initial attack strategy, nodes/edges are removed according to a node/edge ranking that is computed ahead of the removal simulation. In contrast for a recalculated attack strategy, the ranking is updated after each node/edge removal [4].

For node removal attack strategies, the node ranking is usually computed using node centrality measures such as degree [26, 27], closeness [4], and betweenness [7, 30]. It was found that for social networks, the recalculated betweenness node attack strategy (RB) is, on average, the most effective node attack strategy to dismantle the network [2, 7, 28, 29]. Other effective strategies are the recalculated degree (RD) and the initial betweenness (IB) [7, 28, 30].

Because of the sequential nature of the removal process, the node removal simulation is computationally costly, especially for recalculated strategies. For example, a simulation using an RD attack strategy has a time complexity of *O* (*N* × *E*), where *N* is the number of nodes and *E* is the number of edges of the network. The reason is that the node removal process has an *N* step, and at each step, a degree ranking is computed taking a time that scales with *E*. However, for RB, the computation of the whole network’s betweenness is known to be very computationally costly, due to the definition of the network’s node betweenness [31, 32]. The most efficient known algorithm for calculating network betweenness is the Brandes algorithm [33], which has a time complexity of O (*N* × *E*). In consequence, the whole node removal process using IB and RB attack strategies can have a time complexity of O (*N* × *E*) and O (*N*^{2} × E), respectively. Although the IB attack strategy has the same time complexity as the RD attack strategy, the RB’s time complexity is much higher. For illustration, in Figure 1, we present the total simulation time tIB and tRD for the corresponding attack strategies IB and RD, respectively, for all our studying social networks (48 networks see Section 2). In addition, we present the total simulation time tRB for the attack strategy RB for 4 networks (insert graph) as an example, as a function of the product* N* × *E*. We found a good linear relationship between *t*_{IB} and *t*_{RD} and *N* × *E* for all networks as expected, and tRB is about two orders of magnitude higher than *t*_{IB} and *t*_{RD}, for networks of equal *N* × *E*.

The simulation time can become an issue for the cases of social networks because their size can be extremely large. In fact, to our knowledge, most studies of dynamic processes on social networks that use an RB attack strategy only consider small-size real-world social networks of less than 100,000 nodes [7, 28, 30]. For very large social networks, the RB node attack strategy can take an unrealistic amount of time. Therefore, RB is not suitable for large social networks for an average computer station. One possibility is to use the alternative betweenness-based attack strategy with only one betweenness calculation, namely, the initial betweenness attack strategy IB, together with other recalculated strategies that use another node centrality metric that is less computationally costly. In consequence, in this work, we consider two candidate attack strategies for breaking large real-world social networks, IB and RD attack strategies. Besides the comparative study between different network node attack strategies, other works focused on the relationship between network robustness and network structural indicators (NSIs). Iyer et al. [4] studied network robustness as a function of the node clustering coefficient (or node transitivity). The research on model networks with tunable clustering coefficients demonstrates that networks with higher clustering coefficients are more robust, with the most important effect for the node degree and node betweenness attack [4]. Nguyen and Trang [34] studied Facebook social networks and found that those networks with higher modularity *Q* have lower robustness to node removal. The modularity indicator *Q* introduced by Newman and Girvan [35] measures how well a network breaks into communities, (i.e., a community or module in a network is a well-connected group of nodes that have sparser connections with nodes outside the group). In [29], the authors empirically analyzed how the modularity of scale-free models and real-world social networks affects their robustness and the relative efficacy of different node attack strategies. The abovementioned studies analyzed the relationship between network robustness and a single NSI.

On the other hand, machine learning (ML) is a technique that has seen a huge breakthrough in the last decade, beating state-of-the-art results in many prediction applications [36]. It initially solved technical problems in computer vision and natural language processing [37–39] and then expanded into many other fields such as health care, finance, manufacturing, energy, and environment. The key characteristic of an ML model is the ability to intelligently learn nonlinear relationships between the input and output without explicitly knowing them.

In this work, given such a complex relationship between network robustness and NSIs, we adopted a method from machine learning in order to learn such a complexity. Our main contribution is the application of the ML model to predict real-world social network robustness with acceptable errors. We develop ML models to predict network robustness under two main attack strategies, the IB and RD attack strategies, independently. We also implemented three popular ML models, single-variable linear regression, multiple-variable linear regression, and random forest models. Our results demonstrate that a data-driven method such as ML can be an efficient way to study the network’s complexity.

Our work comprises three steps: (1) collect a real-world network dataset and compute NSIs; (2) run Monte Carlo node attack simulations to estimate network robustness; (3) build and evaluate a model that predicts network robustness from their NSIs. The paper is organized as follows: in Section 2, we describe our dataset of 48 real-world social networks. In Section 3, we describe the network robustness Monte Carlo simulation method and three ML models for predicting the network robustness, i.e., simple and multiple linear regression (SLR and MLR, respectively) and random forest (RF) model. Section 4 presents the main results, and finally, we discuss and conclude in Section 5.

#### 2. Real-World Social Network Datasets and Robustness Estimation

Real-world social networks are downloaded from two sources: the Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/) and the Network Repository social networks (https://networkrepository.com/soc.php). We select 48 social networks with a node number (N) ranging over five orders of magnitude. The smallest network is the “Twitch user-user network of gamers who stream in Portugal” having *N* = 1,914, and the largest network is the “e-mail network from an EU research institution” with *N* = 265,216. However, the network with the largest number of edges (*E*) is the “BlogCatalog social blog” with *E* = 4,186,390. The social networks used in this study are unweighted (i.e., we do not take into account edge weights) and undirected (we do not consider edge directionality).

Table 1 summarizes 48 real-world social networks and their NSIs. Besides N and *E*, we also compute the following NSIs:(i)Network density <*k*> is the average node degree, i.e., the average number of edges per node.(ii)Fitted scaled-free exponent (*α*): we assume that all social network degree distributions follow a power law of *P*(*k*) ∼ *k*−*α* where *k* is the node degree. The power exponent value *α* is fitted using the ordinary least squared method. From this fitting, we also extract the fitting variance of *α*, denoted by *α*^{2}.(iii)Assortativity (*a*): the assortativity coefficient is a Pearson correlation coefficient of the degree between pairs of linked nodes [40], which varies between −1 and 1. A positive value of a indicates a preferential connection between nodes of a similar degree, while negative values indicate that nodes of different degree have more change to connect.(iv)Modularity (*Q*) : The modularity indicator *Q* calculates how a network can be partitioned into subnetworks (modules or communities): where *E* is the number of edges, *a*_{ij} is the element of the adjacency matrix A in the row *i* and column *j*, *k*_{i} is the degree of *i*, *k*_{j} is the degree of *j*, *c*_{i} is the module (or community) of *i*, *c*_{j} that of *j*, the sum goes over all *i* and *j* pairs of nodes, and *δ*(*x*, *y*) is 1 if *x* = *y* and 0 otherwise [13].(v)Global clustering coefficient (*C*): the global clustering coefficient (*C*) is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected edges. The global clustering coefficient is the number of closed triplets (or 3*x* triangles because a triangle comprises 3 overlapping triplets, each centered at one of the three nodes) over the total number of triplets (both open and closed). The formula is as follows: where *λ*_{closed} is the number of closed triplets and *λ*total is the total number of triplets in the network. The global clustering coefficient represents the overall probability for the network to have adjacent nodes interconnected, thus making more tightly connected modules [41].(vi)Average closeness (Cl) is the average of all network nodes’ closeness, where the closeness (or closeness centrality) of a node is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph [42, 43]: where *N* is the number of nodes and *d*(*i*, *j*) is the length of the shortest path between nodes *i* and *j*.

##### 2.1. Network Robustness Monte Carlo Simulation

For each network, we run two node removal processes using Monte-Carlo simulations. Nodes are removed consecutively following the ranking of initial betweenness (IB) and the ranking of the recalculated degree (RD). In the case of ties, e.g., nodes with an equal betweenness or degree score, we removed one of them at random. After each node removal, we compute the network robustness measure and the relative size of the largest connected component LCC, together with the accumulated proportion of nodes removed *q*. Finally, we obtain two curves LCC (*q*) corresponding to two node removal processes, IB and RD. The whole simulation is repeated 10 times, and the final curves LCC (*q*) are the average results.

In addition, we compute a single value defined as the network robustness (*R*), as performed by Bellingeri et al. [44], and the area below the normalized LCC curve during the removal process, *R* = . *R* therefore can be between two theoretical extremes, (absolute fragile network) and (absolute robust network). We denote RRD and RIB as the network robustness against RD and IB node attack strategies, respectively.

In summary, we collect 48 real-world social networks, and then, we compute 9 NSIs for each network as inputs. In parallel, we run Monte Carlo simulations and obtain the robustness represented by two metrics, RRD and RIB. The higher they are, the more robust the network is. Those two metrics are the output of each network and will be predicted using ML models.

#### 3. Machine Learning Approach

This section presents the details of SLR, MLR, and RF models.

##### 3.1. Simple Linear Regression Model (SLR)

Linear regression is the simplest model for prediction. The SLR model between the network robustness *R* and an NSI *x* is expressed by the linear equation:where *a*_{0} is the intercept and *a*_{1} is the slope. In (4), an ordinary least square (OLS) is applied for estimating coefficients by minimizing an appropriate loss function [45, 46]. Once the OLS process, which is also called the fitting process, is performed, we can use (1) to predict the robustness *R* of a new network for a given indicator *x*. In addition, we derive a statistics *t*-test from the OLS process with the null hypothesis H0: *a*_{1} = 0. A rejection of H0 means that there is a significant linear relationship between *R* and the NSI *x*.

We run the SLR model fit for all NSIs listed in Table 1 excluding *E* because it can be expressed in terms of two other NSIs : *E* = *N* < *k*>/2.

##### 3.2. Multiple Linear Regression Model

Multiple linear regression (MLR) is an extension of SLR for multidimension variables *x* = (*x*_{1}, *x*_{2}, …, *x*_{n}), where *x*_{1}, *x*_{2}, …, *x*_{n} are NSIs. The linear equation between network robustness *R* and NSIs is as follows:where *a*_{i} are coefficients obtained from the OLS method.

##### 3.3. Random Forest Model

The random forest (RF) belongs to the ensemble class of ML models, indicating that it aggregates the prediction from an ensemble of ML base models, here, decision tree regression (DTR) models. We briefly describe the DTR in the following section.

A DTR starts with the root of the tree containing all samples (48 networks in our case). It then splits into two different nodes by selecting samples whose value of a certain variable is higher or lower than a certain threshold value. Figure 2(a) represents a basic decision tree diagram for our dataset. The root node containing 48 networks splits into two other nodes by considering whether the variable (NSI in our case) scale-free exponent *α* is higher or lower than 2.5.

**(a)**

**(b)**

The DTR selects the variable, and its splitting value is based on information theory, in concrete considering the entropy concept. Entropy is a metric of uncertainty of a node. The DTR splits a node by maximizing the information gain, which is the weighted difference between the total entropy of two resulting nodes and the entropy of the initial node. The DTR successively splits until a stopping condition is reached, for example if the size of the current node is smaller than 20. The final node is also called a leaf node. In Figure 2(a), after the first split of the root, the left child node becomes a leaf node, while the right child node continues to split into two leaf nodes.

Once the final DTR is obtained, it can be used to predict the value of a new sample as follows. The new sample will be classified into one of the leaves, and its prediction value will be the average value of all the samples that are classified into the same leaf.

Finally, the RF model creates multiple decision trees randomly drawn from the data, usually several hundred, and averaging the results from all trees to output a new result often leads to strong predictions [47, 48].

The decision tree can fit nonlinear datasets because it can split the same NSI multiple times. However, decision tree is easy to be overfitting, i.e., it is too sensitive to the training data while failing to predict new coming (testing) data. In order to address this problem, a random forest (RF) model is obtained by creating multiple randomly drawn decision trees from data, usually several hundred. The final regression prediction will be the average prediction of all the decision trees [47–49] (in this work, we implement an RF with 300 DTRs). Using an RF, “feature importance” measurement can be derived to rank the NSI [50].

##### 3.4. Data Preparation, Validation, and Performance Evaluation

All NSIs can be computed from the network’s data, and thus, our dataset did not contain missing values. We also exclude *E* because of redundancy as mentioned above. The other 8 NSIs are normalized to avoid large differences in the indicators’ range:where is the value of the NSI *i* for observation (network) *j* and and are the mean and the standard deviation of the NSI *i*, respectively.

In the first step, we use the whole dataset to build ML models and compare the results between models and two target variables. However, due to overfitting problems in many ML models, the model’s performance for new data is not always coherent as that in the training step, and we need to validate models in the second step. We choose the leave-one-out validation [51]. In this way, we train each of the above models 48 times: each time the whole dataset excluding one observation is used to train the model, and then, the model is used to predict the target value of the remaining (hold-out) observations and repeats for each of 48 hold-out observations. The overall evaluation result is the average across all 48 regressions.

It is noted that for the SLR model, we only consider regression coefficients in order to analyze the dependence of robustness metrics with respect to each NSI. However, for MLR and RF models, we analyze the prediction of robustness metrics using four common evaluation metrics for regression problems, the root mean square error (RMSE) and the coefficient of determination (also named the explained variance ratio, *R*^{2}) as analytical metrics and the frequency distribution and the Q-Q plot of residual errors as graphical metrics.

RMSE is the square root of the summation of the squared difference between observed and predicted data points. The RMSE has the same unit as the target feature and is generally considered the model error. A lower RMSE value represents superior prediction results. The formula of the RMSE is provided bywhere *n* is the number of observations, *R*_{j} denotes the empirical (simulated) network robustness, and *R*_{predicted, j} is the predicted value of robustness for the observation *j*.

*R *^{2} is used to represent the general prediction performance of regression models. *R*^{2} is one minus the ratio of the remaining variance and the original variance. The formula of *R*^{2} is provided bywhere *n* is the number of observations, *R*_{j} is the simulation robustness, *R*_{predicted, j} denotes the predicted value for observation *j*, and is the average of all the simulation robustness. *R*^{2} varies between 0 (the model has no prediction ability) and 1 (the model correctly predicts all values).

The residual error, , is simply an error between the empirical (simulated) network robustness and the predicted value of robustness. The distribution histogram of is expected to be close to the origin. Furthermore, the most important assumption of a linear regression model is that residual errors are independent, and consequently, these errors are expected to be normally distributed.

The network is analyzed using the “graph-tool” library in *Python*. All data preparation, model building, and evaluation are written using the *Python* code. The hardware for numerical simulations is a PC with an i9-10850 Intel processor and 32 GB RAM.

#### 4. Results

##### 4.1. Network Robustness as a Function of the NSIs and SLR T-Test

The simulation robustness of each network RIB and RRD is represented in Table 2. Overall, we found that RRD is slightly smaller than RIB for most networks (43 out of 48 networks), with an average of 0.148 vs. 0.173, respectively. It suggests that the RD strategy has more efficacy than IB for attacking real-world social networks. The largest and sparsest network, Email-EuAll (*N* = 265,216 and <*k* ≥ 1.58), has the smallest robustness with an equal RIB and RRD of 0.001. In contrast, the gemsec_deezer_HR network, with *N* = 54,575 and <*k* ≥ 9.12, has the strongest robustness with an RIB and RRD of 0.375 and 0.338, respectively.

In Figure 3, we plot RRD and RIB as a function of 8 independent NSIs, and we found that RRD and RIB behave similarly in all cases. The SLR unveils some significant relationships between *R* and NSIs (Figure 3 and Table 3). For example, in Figure 3(a), we can see that both RRD and RIB slightly decrease with the network size *N*. This linear dependence between robustness RRD and RIB and *N* is tested by using the SLR model, and we found that it is statistically significant, with a confidence level of 95% ( value <0.05, Table 3).

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(g)**

**(h)**

Interestingly, RRD and RIB do not statistically linearly depend on the network density <*k*> as found previously in [4, 52] (Figure 3(b) and Table 3). This contrasting observation would suggest that network robustness also depends on other NSIs and that the network density alone cannot predict the whole network’s robustness as previously seen.

Besides *N*, the only other NSI that shows a significant linear relationship is the modularity *Q* (Figure 3(f)) in the case of RRD.

However, in Figure 3, we still observe some nonlinear dependencies. For example, in Figure 3(e), we show that network robustness decreases with the assortativity coefficient a when *a* > 0. However, it decreases faster when *a* is close to 0 and increases with a when *a* < 0.

Similarly, in Figure 3(g), we found that the relationship between RRD and RIB and the global clustering coefficient C follows an inverted u-shaped pattern. We ran a two-line statistical test [53] and found that two-line (or broken line) regression is significantly better than a single-line test. The breakpoint was found to be *C* = 0.115. Both RRD and RIB linearly increase with C (with a significance level of 95%) up to the breakpoint and linearly decrease with C (with a significance level of 95%). One possible explanation is that if the network is sparse, more triplets help increase the network’s connectivity and thus increase its robustness. However, above a certain value (when *C* = 0.115), more triplets may denote the presence of hubs or central nodes, which are likely to be the target of intentional node removal strategies such as RD and IB, consequently lowering network robustness.

##### 4.2. Machine Learning Prediction of Network Robustness

The results of the previous section suggest that the social network’s robustness depends on multiple NSIs in a highly complex, multidimensional, and nonlinear manner. To improve the model prediction, in this section, we use two multiple variable ML models, MLR and RF, to predict network robustness.

The results of multiple linear regression MLR are shown in Table 4. We found that both *R*_{IB} and *R*_{RD} have a positive overall linear regression coefficient with respect to *α*, *Q*, *Cl*, and <*k*> and a negative overall linear regression coefficient with respect to *α*^{2}, *a*, *C*, and *N*. Moreover, the MLR result indicate that *α*, *α*^{2}, and <*k*> are the most significant coefficients. A positive linear regression coefficient for the average node degree <*k*> suggests that networks are more robust when *k* is higher, while all other NSIs are fixed. This result agrees with previous outcomes demonstrating that denser networks may be more resistant to the attack [4, 52]. However, the different results between the MLR and SLR would suggest that there is a strong correlation between <*k*> and other NSIs. In addition, the MLR model predicts *R*_{IB} better than *R*_{RD}, with an *R*^{2} coefficient of 58.04% compared to 51.76%. Nevertheless, the RMSE was smaller for *R*_{RD}, with a value of 0.0657, compared to 0.0709 for RIB (this is because the standard deviation of RIB is higher than that of *R*_{RD}, as shown in Table 2 (bottom row)).

Because of the nonlinearity found in the previous section, we expect that the regression result using the RF model will be improved. Table 5 represents the regression result of the RF model. We found that *R*^{2} increases to 92.24% and 91.88% for *R*_{IB} and *R*_{RD} regressions, respectively. Interestingly, the RF model predicts *R*_{RD} roughly as well as *R*_{RD}, while MLR predicts *R*_{IB} better than *R*_{RD}, suggesting that *R*_{RD} may follow a stronger nonlinear relationship with NSIs than RIB. Additionally, the RMSE improved both for *R*_{IB} and *R*_{RD}, with a value of 0.0272 and 0.0241, respectively. Interestingly, the feature importance ranking in Table 5 shows that with an RF model, the assortativity a, the global closeness *C*, and the node number *N* are the most important NSIs. This result agrees with the exploratory observations shown in Figure 3 as discussed above.

In Figure 4, we compare network robustness *R*_{IB} and *R*_{RD} with the prediction value given by MLR and RF using a scatter plot. The scatter plots indicate that RF fit data significantly better than MLR, where the predicted actual data points are closer to the diagonal line *y* = *x*. Meanwhile, for MLR regression, we still found nonlinear dependency between the actual and predicted values. As a matter of fact, the MLR model was not able to capture the inherent nonlinearity dependency in the actual data. We also analyzed the residual errors of the above regression using the frequency histogram and QQ-plot and found that they follow a normal distribution relatively well (Figures 5–8).

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

Finally, we run leave-one-out regression for both models MLR and RF in order to avoid overfitting. The result is summarized in Table 6, and the scatter plots are shown in Figure 9. We found that the prediction result is less accurate than the above “in-sample” training with lower RMSEs in both MLR and RF models. We obtained an RMSE of 0.0812 and 0.0760 for *R*_{IB} and *R*_{RD} predictions using MLR, respectively, and an RMSE of 0.0733 and 0.0636 for *R*_{IB} and *R*_{RD} predictions using RF, respectively. Even though the regression results are less effective because we predict the single sample which is independent of the remaining samples used for training (building the ML model), residual errors still fit well to a normal distribution as shown in the histogram and QQ-plots (Figures 10–13).

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

**(a)**

**(b)**

#### 5. Discussion and Conclusion

In this work, we have analyzed the robustness of 48 real-world social networks with the node number ranging over five orders of magnitude, from 1,914 to 265,216. Using Monte Carlo simulations, we have run two commonly used node attack strategies, IB and RD strategies, whose computation time is within our hardware capability. We found that their corresponding simulation time, *t*_{IB} and *t*_{RD}, scales linearly with the product of the network’s node number and edge number, i.e., *N* × *E*. We also found that the two attack strategies IB and RD present similar efficacy when evaluated by the unique robustness metric *R*, with RD slightly better than IB (average *R*_{RD} is slightly smaller than average *R*_{IB}). It suggests that for the social networks used in this study, the RD strategy is the most efficient strategy to dismantle (breakdown) networks, both in terms of computational cost and breakdown efficiency.

To understand how the structure of a social network determines its robustness, we investigate the relationship between the metric *R* and a set of network structural indicators (NSIs) from the literature. The simple linear regression (SLR) between *R* and NSIs shows low goodness of fitting, and it is overall not able to produce significant prediction models. The low goodness of SLR would indicate that network robustness depends on NSIs in a nonlinear manner.

To improve fitting, we have developed two machine learning models to predict two robustness metrics *R*_{RD} and *R*_{RD} from the combination of 8 NSIs, multiple linear regression (MLR), and random forest (RF) model. The latter one is chosen as it can handle nonlinear data well and is built on a collection of base models, decision tree classifiers. We found clearly that the random forest model can predict network robustness better than the multiple linear regression model. In concrete, the RF model predicts network robustness with an RMSE of 0.0272 and 0.0241 for *R*_{IB} and *R*_{RD}, respectively. This result is encouraging to predict real-world social network robustness, although the error is about 16% (for *R*_{IB}, the RMSE is 0.0272 compared to an average *R*_{IB} of 0.173, and for *R*_{RD}, the RMSE is of 0.0241 compared to an average *R*_{RD} of 0.148). Meanwhile, when the leave-one-out evaluation is applied, the RMSE increases to 0.0733 and 0.0636 for *R*_{IB} and *R*_{RD}, respectively, which is about one-third of the average value.

Finally, MLR indicates that the most important factors to predict RIB are the exponent *α* and the average node degree <*k*>, for both *R*_{IB} and *R*_{RD}. In particular, a higher value of *α* is correlated with higher *R*_{IB} and *R*_{RD}. Higher absolute values of the exponent *α* denote a network with fewer hub nodes (highly connected nodes) [35]. In consequence, the RD and IB attack strategies cannot find large hub nodes whose removal may disintegrate the network faster, resulting in higher values of *R*_{RD} and *R*_{IB}. Additionally, MLR indicates that <*k*> is positively related to lower *R*_{IB} and *R*_{IB}. This last outcome agrees with previous results, demonstrating that networks with higher edge density may be more resistant to the attack [4, 52]. On the other hand, it confirms that SLR, which focuses on a single NSI, may not be able to predict the robustness of real-world social networks.

Our work demonstrates that the ML model can be used to predict network robustness with acceptable results. Therefore, it alleviates the need to run a full Monte Carlo simulation on a network when only approximate robustness is needed. Meanwhile, more network datasets are expected to improve the accuracy of ML models. This work also contributes to the understanding of the relationship between real-world social network robustness and its structural indicators. Finally, we have proved that using a data-driven approach to predict the outcome of the nonlinear and complex dynamic process, such as network robustness, is an appropriate approach [54–60].

#### Appendix

The histogram and the QQ-plot of residual errors of all regressions are given in Figures 5–8 and Figures 10–13.

#### Abbreviations

RD: | Recalculated degree node attack strategy |

IB: | Initial betweenness node attack strategy |

RB: | Recalculated betweenness node attack strategy |

t_{IB}: | Total simulation time for the attack strategy IB |

t_{RD}: | Total simulation time for the attack strategy RD |

t_{RB}: | Total simulation time for the attack strategy RB |

SLR: | Simple linear regression model |

MLR: | Multiple linear regression model |

RF: | The random forest model |

DTR: | Decision tree regression model |

NSI: | Network structural indicator |

RMSE: | Mean squared error |

R^{2}: | Coefficient of determination (also named the explained variance ratio) |

a_{0}: | Intercept coefficient of SLR |

a_{1}: | Slope coefficient of SLR |

OLS: | Ordinary least square method |

: | Error between the empirical (simulated) network robustness and the predicted value of robustness |

α: | Fitted scale-free exponent |

k: | Node degree |

<k>: | Average node degree |

a: | Degree assortativity |

Cl: | Global closeness |

C: | Global clustering coefficient |

LCC: | Largest connected component |

N: | Number of nodes |

E: | Number of edges |

Q: | Modularity indicator |

α^{2}: | Fitting variance of α |

q: | Accumulated proportion of nodes removed |

R: | Network robustness |

R_{RD}: | Network robustness against RD node attack strategies |

R_{IB}: | Network robustness against IB node attack strategies. |

#### Data Availability

All the 48 real-world social networks are downloaded from the Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/) and the Network Repository social networks (https://networkrepository.com/soc.php).

#### Conflicts of Interest

The authors declare no conflicts of interest.

#### Authors’ Contributions

QN conceived analyses. NKKN, HHP, TTL, and TMN performed simulations. QN, NKKN, FS, RA, MB, and DC wrote the paper.

#### Acknowledgments

This work was supported by Vietnam’s Ministry of Science and Technology (MOST) under the Vietnam-Italy Scientific and Technological Cooperation Program for the period of 2021–2023 and by the Vietnam National University Ho Chi Minh City (VNU-HCM), Ho Chi Minh city, Vietnam under grant nos. B2017-42-01. This research was funded by a grant from the Italian Ministry of Foreign Affairs and International Cooperation. This project received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 Research and Innovation Programme (grant agreement No. (816313)). The authors are greatly thankful to Van Lang University, Vietnam, for providing the budget for this study.