#### Abstract

The identification of region of interests (ROIs) in wireless networks holds the potential to resolve the challenging problems of resource allocation and network traffic prediction for large scale traffic data generated by mobile applications. The rationale is that ROIs are capable of gathering single regions that share similar network characteristics, which promotes better network traffic prediction performance. Previous studies show that spatiotemporal information in network traffic data, such as user behaviors and network status, is nontrivial to ROI identification. However, the modeling between these clues regarding spatiotemporal information is not yet fully explored. To this end, we propose a random matrix theory-based ROI identification (RRI) approach. By observing the intensification or diminution of network characteristic differences, i.e., divergence, between adjacent single regions, the ROIs can be identified. Firstly, we leverage the spatiotemporal information of area network traffic data with a spike model which can be described as a zero mean random matrix with a deterministic perturbation matrix. Then, we put forward an average divergence capacity model for ROI identification by estimating the divergent degree of adjacent regions. Case studies on three real-world network traffic datasets demonstrate the effectiveness of our proposed RRI method. The ROI identification greatly improves the network traffic prediction performance, yielding a decrease of root mean square error and mean absolute error by and , respectively.

#### 1. Introduction

In the upcoming 2030s+, wireless network services and scenarios will become more diversified, and user needs will be more personalized than ever [1]. Meanwhile, data generated by the use of extremely heterogeneous networks, diverse communication scenarios, and large numbers of devices have undergone an exponential expansion to an unprecedented scale [2]. In particular, due to the increasingly diversified and complex networks, we have ushered in an era of big data with 77.5 exabytes of wireless network traffic data produced per month by 2022 [3]. 6G networks are expected to enable on-demand services for better user satisfactions [4].

A more accurate network traffic prediction of diverse region of interests (ROIs) with similar network traffic characteristics can help network operators understand the diversified network status, optimize the resource allocation, improve users’ quality of experience (QoE), and reduce the capital expenditure (CAPEX) and operating expenditure (OPEX) [5–7]. Yet the pervasive and exponentially increasing multidimensional and highly correlated data impose imminent challenges on area network traffic characteristic modeling and prediction in diverse regions [8, 9].

The area network traffic data have become increasingly correlated in time and space [10, 11]. Big data modeling and analysis of the multidimensional and highly correlated wireless network data plays a pivotal role in predicting the network traffic and understanding the network characteristics of ROIs [12–14]. Data-driven network traffic understanding and prediction have attracted great attention and produced fruitful results [15–17]. For example, a data-driven framework for network behavior analysis in cellular networks for Industry is proposed in [18]. Human mobility patterns using spatiotemporal correlated urban big data are provided for vehicular social networks in [15, 19].

Great progress in network traffic prediction has been achieved by neural network-based methods. For example, Long Short-Term Memory (LSTM) [17], Gated Recurrent Units (GRU) [20], and Stacked Autoencoders (SAEs) [9] have reported better performance in predicting time series data than statistically based methods. While these methods study traffic time series for each individual location, recent studies further utilize spatial information. An attention-based neural network is proposed in [6] for traffic prediction, and a deep learning method for wireless network traffic prediction is put forward in [13] with temporal and spatial characteristics of wireless network traffic data modeled for prediction.

However, these neural network-based researches mainly focus on prediction in isolated single regions, which overlooks the spatiotemporal information of adjacent regions, which thus may lead to inaccurate prediction results.

Intuitively, with more data obtained from adjacent regions with similar network traffic characteristics, a higher prediction accuracy can be achieved. The main challenges to this hypothesis are as follows: (1) how to shape the network traffic characteristics with a comprehensive data model, (2) how to evaluate the network traffic characteristic differences of adjacent regions, and (3) how to aggregate the adjacent regions with similar network traffic characteristics as an identified ROI.

Network traffic data can be considered time series for prediction [21, 22]. AutoRegressive Integrated Moving Average (ARIMA) [23] and Support Vector Regression (SVR) [24] are the representative approaches to time series modeling. The ARIMA model tends to focus on the mean value of the past data regardless of the nonlinear variations underlying the traffic flow [25]. The limitation of SVR lies in the difficulty to determine the key parameters [25]. Notably, the excessive dependence on historical data with spatial information ignored, in particular that of adjacent regions, may lead to unsatisfying prediction performance [26].

To this end, we propose a spike model to describe the spatiotemporal information of adjacent regions with random matrix theory (RMT) spectral verifications. By revealing the differences of data structure among multidimensional datasets with the spectral analysis, RMT is able to analyze the divergent degree of different datasets [27–29]. This paper is an extension of our previous work which utilizes RMT for anomaly detection in wireless networks [30]. In [30], we apply RMT to distinguish anomalous data from normal data by observing the eigenvalue distribution, but a deeper investigation of the spectral distribution is lacking. In this paper, we propose a data model and derive its spectral distribution for area network traffic characterization and a new capacity model for divergence degree evaluation in ROI identification. The correctly identified ROI can promote better network traffic prediction performance and higher resource allocation efficiency in the upcoming 6G networks. To summarize, the main contributions of our work are as follows: (i)We propose a novel method of RMT-based ROI identification (RRI), to identify the ROIs by evaluating the network traffic differences of adjacent regions modeled by a spike model(ii)The spike model is a zero mean random matrix with a deterministic perturbation matrix utilized for modeling the network traffic characteristics. The RMT spectral analysis is employed to theoretically verify the model, showing that the empirical spectral distribution of the spike model confirms the raw eigenvalue distribution(iii)An average divergence capacity model is proposed to identify the ROIs by evaluating the divergent degree of adjacent single regions modeled by the spike model. We aggregate the adjacent single regions with shrinking divergence as an identified ROI(iv)Numerical results show that the proposed RRI approach can identify ROIs with ground truth verifications. Moreover, with the aid of RRI, the performance of prediction in ROIs can be improved with a decrease of root mean square error and mean absolute error

The rest of the paper is organized as follows. Section 2 presents the data description and some preliminary data analysis. The background knowledge about the RMT spectral analysis and RMT-based theoretical verification for the spike modeling method is laid out in Section 3. In addition, a real-world area network traffic dataset is employed to validate the effectiveness of the proposed model. In Section 4, an average divergence capacity model for evaluating the divergent degree of adjacent regions is presented for the RRI method. Case studies of ROI identification and network traffic predictions are carried out in Section 5. Section 6 concludes the paper.

#### 2. Data Description and Preliminary Data Analysis

As the spatiotemporal correlated data accumulate to an enormous scale, the network traffic differences of diverse regions are no longer static, and thus, the network traffic prediction for isolated regions is not applicable to the fulfillment of on-demand network in the era of big data [26].

ROI identification in wireless network can contribute to a more accurate network traffic prediction. An appropriate data model that can describe the network characteristics with spatiotemporal information preserved is a good start to begin with. In this aspect, a universal data model for network traffic characteristic difference evaluation can greatly facilitate the ROI identification and further improve the prediction performance. Table 1 summarizes the notations used in this paper.

##### 2.1. Dataset Description

In order to model the wireless area network traffic for the RMT analysis, we first present a description of the dataset. It is a real-world spatiotemporal correlated network traffic dataset that is comprised of computation over the Call Detail Records (CDRs) consisting of the network traffic data collected from a real LTE network of Telecom Italia at Milan, Italy. This public dataset was officially provided to the Big Data Challenge 2014 competition [31]. It was collected from 3,450 base stations (BSs), which logged the network traffic of each base station over two months, from November to December, 2013. The dataset includes SMS activity, call activity, and Internet traffic activity, which can be considered key performance indicators (KPIs) of the region characteristics. To facilitate the data analysis, the Milan region is divided into grids named as Milan Grids, with all the BSs mapped into individual grids, or single regions. When there are several BSs in a single region, all the traffic loads are aggregated into one traffic load [32]. Although these data were recorded nearly a decade ago, due to the fact that they truly included the characteristics of spatiotemporal information in real geographic scenarios, they have been widely utilized for network traffic analyses in recent years [17, 30, 33, 34].

For expository purposes, we select an area with regions (grids) as depicted in Figure 1(a). The area includes three typical social function regions, which are the Convention Center (Grid 5848), Shopping Center (Grid 5849), and Central Park (Grids 5748 and 5749). Figure 1(b) depicts the statistic results of the accumulated spatially correlated network traffic data of the single regions within hours. The network traffic volume of adjacent regions shows great similarities, but notably, the network traffic volume of the regions adjacent to the Shopping Center and Convention Center is much higher than that of the regions adjacent to Central Park. The observation is consistent with the ground truth that Central Park consumes much less network resource than the Convention Center and the Shopping Center [33, 34].

**(a)**The selected regions (Grids)**(b) Statistical results of the selected regions**

##### 2.2. Preliminary Data Analysis

For a preliminary analysis of the characteristics of the three different regions, the statistical results are presented with each network traffic dataset grouped into a matrix, whose rows represent the individual traffic of specific regions and the columns indicate the sampling time. Assume the number of KPIs is and the total sampling time is . Without loss of generality, for different KPI at the sampling time , we model the raw KPI volume as . All the sampled KPI can be treated as a vector .

Figure 2 describes the network traffic data of three adjacent different single regions within a duration of two days, which exhibits strong diverse time series characteristics. Whereas the data consumed by Central Park peak at around 10:00 a.m. and the Convention Center reaches its maximal data consumption at around 20:00 p.m., the Shopping Center displays a plateau of data consumption during the daytime. With respect to the different network traffic characteristics of diverse regions, we can draw the conclusion that the data are also spatially correlated. If we can aggregate the adjacent regions with similar network traffic characteristics, the ROIs can be identified accordingly. Therefore, before we utilize the network traffic characteristic differences for ROI identification, it is necessary to model the spatiotemporal information of individual regions.

#### 3. Network Traffic Data Modeling

RMT has been widely applied to the analysis of highly correlated big wireless network data that contain a number of random variables [27, 33, 35]. Most researches pertaining to RMT utilize it as a benchmark for anomaly detection by simply observing the eigenvalue distribution, yet lack a mathematical intrinsic modeling investigation [33, 36, 37]. In [38], RMT is employed to analyze the time series data for anomaly detection, which extends the RMT applications to a non-Gaussian distribution scenario. In terms of a thorough analysis of the network traffic data differences, pioneering works in [6, 30, 33] have proposed to apply the RMT spectral analysis to anomaly detection. In this section, we extend the application of RMT to the modeling of network traffic characterization.

##### 3.1. Data Modeling

Wireless network traffic data can be decomposed into regular components and residual components [21, 33]. But we present a more intuitive data model hypothesis of the network traffic volume with the raw data decomposed into two parts as shown in where represents the deterministic network traffic pattern in one region, is an independent identically distributed (i.i.d.) random variable with zero mean and unit variance, and is the variance. Thereby, the raw data can be considered a random variable with nonzero mean by the probability, and the sampling matrix of the network traffic dataset from a specific region can be considered a random matrix as formulated in where stands for the number of KPIs, denotes the total sampling times, the matrix is a non-Hermitian random matrix with i.i.d. zero-mean Gaussian distribution entries , and is the deterministic matrix of a specific ROI with all single valued entries.

##### 3.2. Theoretical Verification for the Data Model

Since the area traffic dataset has been constructed as (2), which is a multidimensional and highly correlated random matrix, RMT can be applied as a mathematical tool to theoretically verify the model with spectral analysis. The RMT spectral analysis can reveal the intrinsic data structure information from the perspective of eigenvalue distribution. Therefore, we focus on investigating the eigenvalue properties of the data model in this section.

In the light of the random matrix in (2), its covariance matrix can be derived as where stands for the matrix transpose. The matrix of a specific ROI can be formulated with the data model proposed in (2). Then, the covariance matrix of the raw data matrix in (2) can be denoted as

Having obtained the covariance matrix , the asymptotic spectrum of the data model can be derived with the empirical spectral distribution (ESD) given in Definition 1 for mathematical verification, which is an important metric to describe the eigenvalue distribution of a matrix.

*Definition 1 (empirical spectral distribution [39]). *Consider an Hermitian matrix , the ESD of the matrix is defined as
where is an indicator function over a set and denotes the eigenvalues of .

By the definition of ESD , the average eigenvalues that are smaller than a particular variable constitute a cumulative density function, based on which the eigenvalue distribution of can be derived. As illustrated in Figure 3, the ESD of shows two components, the bulk and the spike. The bulk mainly arises from the random noise or fluctuations of the stochastic part in (4), and the spike represents the unusual network traffic volumes or anomalies in the deterministic part in (4). This kind of data model can be analogically and mathematically considered a spike model in RMT [40].

Generally speaking, the ESD of a random matrix is difficult to be deduced, especially after basic elementary mathematical calculations. So a diversion is necessary before the derivation of ESD. Stieltjes transform is an elementary but indispensable transformation in RMT given in Definition 2.

*Definition 2 (Stieltjes transform [39]). *Consider as a spectral distribution of a given matrix; its Stieltjes transform is defined as
where and denotes the imaginary part.

A correspondence exists between the spectral distribution and the Stieltjes transform, which can be described as the convergence characteristics of finite measures [40, 41]. For any distribution function , the inversion of Stieltjes transform can be defined by where is the imaginary unit.

Although Stieltjes transform is a way to deduce the ESD of a given matrix, in practical scenarios, only some simple structured random matrices can be derived with such an explicit expression. For example, the classic Marchenko-Pastur Law (M-P Law) is a close-form ESD of one particular type of random matrix [40]. The M-P Law offers a deeper insight into the correspondence between ESD and its Stieltjes transform, which has become the foundation to derive the ESD of complex matrices, as illustrated by the red line in Figure 3. The M-P Law has been commonly applied as a benchmark for anomaly detection in wireless networks [28, 33]. The asymptotic theoretical spectrum of the spike model can be obtained with Theorem 3.

Theorem 3 (spike model [40]). *Given a matrix defined as in (4) and a non-Hermitian random matrix in (2) with i.i.d. zero-mean and unit variance Gaussian distribution entries, such that the ESD of converges to the function with and the Stieltjes transform . Denote and assume , positive and finite; then, the ESD of converges almost surely to a limit distribution with the Stieltjes transformation derivable from
*

The solution of satisfies the conditions of the Stieltjes transformation is . The theorem presents the Stieltjes transformation of given in (4) with an implicit equation. Notably, the deterministic matrix of in Theorem 3 can be generalized to any given matrix, and the rank of the matrix remains uncertain, which means the spike model can be generalized to a variety of data analysis scenarios.

In a practical scenario such as that shown in Figure 3, the bulk of the eigenvalue distribution mainly arises from the random noise or fluctuations of the stochastic component in (4), and the spike is usually originated from the deterministic component of in (4). With only one spike spotted in Figure 3, we can deduce that there is only one non-zero eigenvalue in the deterministic matrix , and its rank is . It is due to the fact that the network traffic exhibits identical network behavior characteristics in a same ROI. Once we can obtain the Stieltjes transform of the deterministic matrix from (2) and the variance of the random component, the ESD of the covariance matrix can be derived.

Firstly, let us compute the Stieltjes transform of the deterministic matrix in (2). Since the rank of is , the only nonzero eigenvalue of the covariance matrix of can be denoted as with the probability , while the other eigenvalues with the probability . Thereby, the Stieltjes transform of can be derived as

By means of the numerical operation of substituting (9) into (8), the solution of satisfies the conditions of the Stieltjes transformation constitutes . In turn, the ESD of the spike model covariance matrix can be derived by substituting the obtained into the inversion formula for the Stieltjes transform given in (7).

##### 3.3. Numerical Verification for the Data Model

In this subsection, we present validations for the data model by comparing the theoretical ESD with the practical eigenvalue distributions in three adjacent ROIs as depicted in Figure 1(a). The Stieltjes transform calculations are repeated for times with averaged results.

Figure 4 shows the validations of the RMT estimation of three adjacent regions, which are Convention Center, Shopping Center, and Central Park, respectively. The solid red line illustrates the theoretical RMT estimation of the ESD of the spike model covariance matrix, and the green histograms indicate the practical eigenvalue distributions of the raw data source. The theoretical ESD of the RMT estimation of the spike model can also be separated into two components, the bulk and the spike, which theoretically and practically converge to the proposed model.

**(a) Eigenvalues of Convention Center**

**(b) Eigenvalues of Shopping Center**

**(c) Eigenvalues of Central Park**

Figure 4(a) is the eigenvalue distribution of the Convention Center (Grid 5848), with the bulk more centralized. The theoretical RMT estimation corresponds to the empirical network traffic volume of the Convention Center with more perturbations. Figure 4(b) demonstrates the eigenvalue distribution of the Shopping Center (Grid 5849) with a more regular network status routine. Similarly, the deviations between the bulk and the spike grow larger from November to December for both the Convention Center and the Shopping Center, which suggests their ESD difference enlarges with time advancement.

Figure 4(c) demonstrates the eigenvalue distribution of the Central Park (Grids 5748 and 5749) with the most regular network status routine, as the number of people that go to the park remains almost constant. The deviations of the Central Park between the bulk and the spike almost stay static in November and December implying that the ESD difference of the Central Park hardly changes in the two months.

The verifications have proved the convergence of the spike model. The gaps between the theoretical RMT estimation and the empirical eigenvalue distribution are primarily caused by the estimation of and the limitation of the data size.

##### 3.4. Support of the ESD

A step further, we investigate the ESD separation phenomenon of the spike model. Generally speaking, the raw dataset can be influenced by various factors, which results in the separation of ESD to different components. As , the ESD deviation of the bulk and the spike can be deduced by deriving their support [42], which is denoted as and given in (11) and (12), respectively. The ESD separation of the covariance matrix is given in Lemma 4.

Lemma 4 (spike model support [42]). *Considering an eigenvalue , then equation (10) holds with probability :
and , where is the largest eigenvalue of the model if . Yet if , the corresponding support of the bulk and the spike can be derived as
where represents the approximate value and is defined as
*

From Lemma 4, we can deduce that the support intervals of the bulk given in (11) and (12) are largely dominated by the parameters of and in in (2). The spectral distribution of the matrix in (4) will give rise to a spike with a large enough . On the other hand, if is much smaller, the ESD will be a bulk. The deviation between the bulk and the spike is closely correlated to the deterministic matrix in (2), which can be utilized to evaluate the difference between different datasets.

#### 4. Divergent Region Difference Evaluation for RRI

Since the theoretical ESD of the spike model matches empirical eigenvalue distribution of the raw network traffic matrix, we will identify ROIs by utilizing the spike model to reconstruct the network traffic data. An average divergence capacity model is proposed to mathematically quantify the divergent degree of adjacent regions for RRI.

##### 4.1. Average Divergence Capacity Model for RRI

In order to numerically quantify the divergent degree of adjacent regions with different datasets, the network traffic volume difference of adjacent regions is defined as (14) according to the spike model we proposed in Section 3. where and represent different adjacent regions. Moreover, (14) can be further expressed as where , and denote variances of adjacent regions and stands for the randomness that follows the Gaussian distribution with zero mean and unit variance entries. is defined as , which is the deterministic matrix that indicates the network characteristic difference of two adjacent regions.

We present an average divergence capacity model to numerically quantify the different divergent degrees of adjacent regions for ROI identification. Inspired by the definition of the channel capacity, we consider defined in (15) as a signal running through an additive white Gaussian noise channel.Thus, the average divergence capacity model can be analogically defined as where is an identity matrix and stands for the matrix transpose. The model can quantify the uncertainty of the data with a unit of bits from the perspective of information theory [43], thereby providing a numerical quantification measurement for the ROI identification problem. The evaluation model is a mapping of multidimensional raw support to evaluation results , which can be expressed as .

##### 4.2. Parameter Estimation

Before we can employ the proposed average divergence capacity model to analyze different datasets, we need to compute the unknown parameter in (16). We apply a large dimensional approach (LDA) to calculation. The classical LDA assumes that the samples are numerous, i.e., , so it can accommodate much more diversities in the total samplings. As , the ratio of the matrix dimensions , and the distribution of the largest eigenvalue of converges almost surely to , where , and is the covariance. Whereas the distribution of the rest eigenvalues of converges almost surely to the parameter .

However, in practice, the parameter is most likely unknown, so we use the smaller eigenvalues to estimate . Thereby, the estimation of can be derived as where and are the eigenvalues of in an ascending order.

The same results can also be obtained from (11) and (12). As , the eigenvalue set of and can be simplified to (18) and (19), respectively,

As , , and converge to , we hence obtain another result , which is the same estimation as (17).

#### 5. Experiments on RRI and Network Traffic Prediction

In this section, we conduct experiments on ROI identification using RRI and network traffic prediction with real-world datasets described in Section 2.1. Figure 5 displays the architecture of our proposed ROI identification method.

We present three comprehensive case studies of different ROIs by evaluating the average divergence capacity of adjacent single regions and derived from (16). The experiments on network traffic predictions of the three identified ROIs are given in Section 5.1. The three ROIs are Convention Center, Shopping Center, and Central Park.

##### 5.1. Case Studies of the RRI

In order to prove the effectiveness of the proposed ROI identification method, experiments on adjacent single regions are conducted.

###### 5.1.1. Identification for the ROI of Convention Center

The ROI identification starts with the Convention Center in region Grid 5848 as depicted in Figure 1(a), which has been verified with Google Map [33, 34]. By evaluating the average divergence capacity of Grid 5848 with adjacent single regions, the ROI of the Convention Center can be identified. Figure 6(a) illustrates the average divergence capacity of region Grid 5848 with adjacent single regions.

**(a) ROI of Convention Center**

**(b) ROI of Shopping Center**

**(c) ROI of Central Park**

The black solid line indicates the average divergence capacity between the Convention Center (Grid 5848) and the adjacent single region (Grid 5847), which decreases gradually with time advancement. It suggests that the divergent nature of the regional boundary between Grid 5848 and 5847 is growing blurry; in other words, the area network traffic characteristics between the two adjacent single grids are becoming more similar. The red dashed line and the blue dotted line stand for the average divergence capacities of the Convention Center and the adjacent regions of Grids 5849 and 5748, which is intensified as time advances. The intensification of the regional differences (between Grids 5848 and 5849, 5748) indicates that the area network traffic characteristics of the three adjacent regions are becoming more diversified. Therefore, the ROI of the Convention Center can be modified to a bigger area, with Grids 5848 and 5847 aggregated.

Similar operations are performed on the other two regions of the Shopping Center and Central Park, and their ROIs can be identified as well.

###### 5.1.2. Identification for the ROI of Shopping Center

The ROI identification begins with the Shopping Center in the region of Grid 5849 as denoted by the orange shade in Figure 1(a), which has also been verified with Google Map. We conduct similar computations of the average divergence capacity of Grid 5849 with adjacent single regions (Grids 5850 and 5749) to identify the ROI of the Shopping Center. The results are illustrated in Figure 6(b), from which we can observe that the average divergence capacity of Grids 5849 and 5850 is fast declining; thus, the two adjacent single regions can be aggregated into one ROI. Meanwhile, we note that the average divergence capacity of Grid 5849 with 5848 and 5749 is enhanced with time in contrast, which provides the evidence that the latter two Grids do not belong to the ROI of the Shopping Center.

###### 5.1.3. Identification for the ROI of Central Park

The ROI identification initiates with the Central Park in the regions of Grids 5748 and 5749 as indicated by the green shade in Figure 1(a), which has been verified by Google Map as well. Similarly, we performed computation of the average divergence capacity of Grids 5748 and 5749 with adjacent single regions (Grids 5648 and 5649) to identify the ROI of Central Park. The results are displayed in Figure 6(c). Again, two contrastive tendencies are clearly observable. The average divergence capacities of Grids 5748 and 5749 with Grids 5648 and 5649 decline substantially with time advancement in a similar pattern, which indicates that the four adjacent single regions can be aggregated into one ROI. On the other hand, the average divergence capacities of Grids 5748 and 5749 with Grids 5848 and 5849 are intensified over time despite at slightly different speeds, which suggests that the former two Grids (5748 and 5749) do not belong to the same ROI as the latter two.

The aggregated adjacent ROIs obtained from RRI are presented in Figure 7, with the blue shade denoting the ROI of the Convention Center, the orange that of the Shopping Center, and the green that of Central Park.

##### 5.2. Case Studies on Network Traffic Prediction in ROIs Identified by RRI

Accurate and timely network traffic prediction plays a pivotal role in intelligent resource allocation [17]. When the divergence between adjacent regions substantially decreases, the aggregation of adjacent regions to one dataset contributes to the improvement of the network traffic prediction performance in the identified ROI.

In order to demonstrate the strength of the aggregated ROI, we apply three neural network-based schemes to predict the network traffic volumes hour by hour, including LSTM, GRU [20], and SAEs [9]. The three prediction methods share the same parameter settings with hidden layers and as the sigmoid activation function in performance evaluation.

We apply two metrics to evaluate the effectiveness of the prediction performance of the three schemes on aggregated and single ROIs. The first one is root mean square error (RMSE), which measures the difference between the predicted network traffic volumes and the ground truth volumes as defined in where is the total time, is the observed network traffic volume, and is the predicted network traffic volume.

The second evaluation index is mean absolute error (MAE), which measures the average of absolute differences between the predicted volumes and the ground truth volumes as defined in

###### 5.2.1. ROI of Convention Center

The identified ROI of the Convention Center is the aggregation of Grids 5847 and 5848 as denoted by the blue shade in Figure 5. The data used for the prediction of the Convention Center ROI come from the CDR dataset of the two Grids (5847 and 5848) [31]. Specifically, the training data consist of the aggregated CDR dataset of Grids 5847 and 5848, while the testing data comprise the CDR dataset of Grid 5848 from December 14th to 20th. The prediction results of the three methods on the identified ROI of the Convention Center with aggregated single regions are presented in Figure 8(a).

**(a) Prediction for ROI of Convention Center**

**(b) Prediction for ROI of Shopping Center**

**(c) Prediction for ROI of Central Park**

###### 5.2.2. ROI of Shopping Center

The identified ROIs of the Shopping Center with adjacent single regions are Grids 5849 and 5850, as indicated by the orange shade in Figure 5. The aggregated ROI data of Grids 5849 and 5850 are utilized as the training set, while the data of Grid 5850 from December 14th to 20th are selected as the testing set. The prediction results of the three prediction methods on the identified ROI of the Shopping Center with aggregated single regions are shown in Figure 8(b).

###### 5.2.3. ROI of Central Park

The identified Central Park ROIs with adjacent single regions are Grids 5748, 5749, 5648, and 5649, as denoted by the green shade in Figure 5. The data used for the prediction training are comprised of the aggregated CDR dataset of Grids 5748, 5749, 5648, and 5649 [31], while the data of Grids 5748 and 5749 from December 14th to 20th constitute the testing set. The prediction results of the three prediction methods on the identified ROI of Central Park with aggregated single regions are illustrated in Figure 8(c).

##### 5.3. Discussion on the Prediction Performance of Identified ROIs

We evaluate the performance of the three prediction methods on the identified ROIs by means of the RMSE and MAE metrics in order to demonstrate the strengths of ROI identification with aggregated adjacent single regions.

The comparative results by the RMSE and MAE metrics on the single versus aggregated region prediction performance of the three different schemes on the identified ROIs are displayed in Table 2. A general pattern that emerges from Table 2 is that for all three prediction schemes both of the RMSE and MAE results on identified ROIs with aggregated regions are much smaller than those with single regions. In particular, the largest RMSE difference is found with the GRU prediction scheme on the identified Convention Center ROI, which is a decrease of percent from to . And the largest MAE difference is observed with the SAE prediction scheme on the identified Convention Center ROI, which is a decline of percent from to . Based on the above evidence we can conclude that the identified ROIs can substantially improve the performance of network traffic prediction.

#### 6. Conclusion

In this paper, we propose a novel method of RRI which utilizes RMT to analyze the dynamic network traffic characteristics between adjacent regions for ROI identification. By means of RMT, we are able to derive the empirical spectral distribution of the covariance matrix to prove the validity of the spike model. In order to evaluate the divergent degree of identified ROIs, we employ an average divergence capacity model to illustrate the ideological differences with respect to time and region, from which we conclude that the diversity of network traffic in different regions varies with time advancement, and an aggregated ROI can be identified with diminishing diversity between adjacent regions. With our proposed RRI method, we are able to provide more accurate predictions of the network traffic in identified ROIs, which will contribute to the improvement of the system performance, in particular pertaining to energy efficiency and resource allocation.

#### Data Availability

Data used to support the findings of this study are available at https://doi.org/10.7910/DVN/QJWLFU.

#### Disclosure

A preprint has previously been published in [30].

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was supported by the National Key R&D Program of China (2020YFB1806602).