#### Abstract

In this paper, the optimal bandwidth parameter is investigated in the GPH algorithm. Firstly, combining with the stylized facts of financial time series, we generate long memory sequences by using the ARFIMA (1, *d*, 1) process. Secondly, we use the Monte Carlo method to study the impact of the GPH algorithm on existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of the long memory parameter. The results show that the accuracy of above three factors in the long memory test reached a relatively high level within the bandwidth parameter interval of 0.5 < *a* < 0.7. For different lengths of time series, bandwidth parameter *a* = 0.6 can be used as the optimal choice of the GPH estimation. Furthermore, we give the calculation accuracy of the GPH algorithm on existence, persistence or antipersistence of long memory, and long memory parameter *d* when *a* = 0.6.

#### 1. Introduction

Long-term memory widely exists in the fields of biology, medicine, geology, hydrology, climate, and social science fields [1–3]. It refers to the fact that observations depend on each other in a long term and the autocorrelation function of a sequence decays slowly. In a system with long memory, some important historical events influence the future in long time spans, which contribute to the formation of long memory. For example, it is shown that the big price rise and fall in stock markets and extreme high and low temperatures have an impact on the long memory of their corresponding sequences [4, 5]. According to the relationship between long memory and approximate entropy revealed by Pincus and Kalman, the stronger the long memory of the sequence is, the better its predictability will be [6]. In addition, Baillie found that if a time series has a long memory, it is difficult to characterize the internal structural features with short memory models, such as ARMA model. Additionally, the simulative and predictive accuracy of those models are comparatively low [7]. Therefore, the research on long memory is of great importance for the theoretical and practical applications [8, 9]. There are two types of long memory in time series. One is the persistent long memory, which means that the development trend of time series will keep in line with the current movement direction in future. The corresponding long memory parameter is . Contrary to the persistent long memory, the other is antipersistent long memory, indicating that the motion in the future will be opposite to that of present state, and its long memory parameter is . It is generally believed that the British hydrologist Hurst was the first to study the long memory characteristics in a system. He used Hurst index () to depict the long memory strength of a time series [10]. The relationship between Hurst index and long memory parameters is [11, 12]. When (or ), it implies that persistent long memory of a time series is strong. When (or ), it indicates that the antipersistent long memory of a time series is strong. When (or ), the time series appears as random walk, suggesting its unpredictability in theory.

So far, there are no fewer than ten methods for calculating long memory parameters, which can be roughly divided into three categories. The first algorithm is estimated in the time domain, such as the Aggregated Variance, Differencing the Variance [13], Higuchi [14], R/S Analysis, and Detrended Fluctuation Analysis (DFA) [15]. The second algorithm is the frequency-domain estimation, such as Whittle and Averaged Periodogram Estimation [16, 17]. The third algorithm is the wavelet-domain estimation methods, such as Wavelet Maximum Estimator and Wavelet-Based Estimation [18, 19]. In addition, based on the above algorithms, the researchers proposed many improved estimation methods, such as modified Rescaled Range [20], exact local Whittle method, and modified local Whittle estimation [21, 22]. However, as far as the time-domain estimation method is concerned, it is difficult to judge the significance of long memory since the statistic distribution of long memory parameters cannot be given. For the wavelet-domain estimation algorithm, the requirements of structural features in a sequence are often too harsh to correctly extract the modulus, which sometimes makes the results different from qualitative analysis. For the full-parameter estimation method in the frequency domain, it requires that the perturbed random item has a Gaussian distribution, such as Whittle estimation, and involves integral operation, which is difficult to meet in practice. For example, it is well known that the distribution curve of return time series in stock markets possesses a sharp peak and heavy tail feature [23, 24]. If the estimation methods in the time and the wavelet domains are regarded as a nonparametric method, the research of the semiparametric method is gradually developed as a compromise between full-parameter and nonparametric method in the frequency domain. Different from the previous method, the GPH proposed by Geweke and Porter-Hudak [25] has better advantages in semiparametric estimation, such as it reduces the normality requirement of random items in estimation and its statistical distribution of estimators is provided within a certain range. Based on the framework of the GPH method, several improved algorithms to estimate long memory of sequences were proposed [26, 27], which extend the application of GPH method to different long time memory sequences, simplify the concept, and improve the computational speed. Robinson establishes the asymptotic normality of the GPH estimator, and the results show that it is suitable for stationary and reversible Gaussian vector sequences [28]. Hurvich et al. established the asymptotic properties of the GPH estimation method and derived the expressions of asymptotic bias, variance, and mean square error of estimators, effectively evaluating the accuracy of asymptotic theory for the mean square error of finite sample size [29]. On this basis, Velasco generalized Robinson’s results, showing that, with sufficient data cones, the revised estimates of any *d* (including nonstationary and irreversible processes) are consistent and asymptotically normally distributed [30]. In addition, Velasco proved the consistency of the logarithmic periodic graph regression estimates of the long memory parameters of the series when studying the long-range dependent linear time series and obtained the asymptotic distribution of the asymptotic periodic graph estimates of the long-range dependent time series under possibly non-Gaussian observations [31]. However, on the one hand, from the application point of view, the GPH method is still a basic estimation method [32, 33]. On the other hand, the new method requires programming in the operation, while the GPH algorithm has been implemented with menu-based operation on some metrology software, such as OX software and R software. Therefore, the GPH method is an indispensable method for estimating long memory in terms of the universality and maneuverability of application. However, there are three problems when using GPH to test the long memory of time series (financial data) [34–37]. Firstly, for the parameter of bandwidth (*N* is the sequence length), most of the studies mainly choose 0.5, 0.6, 0.7, and 0.8 or directly select to estimate the long memory, which is formed by subjectivity of the authors. Secondly, we have no clear understanding of the action mechanism of how the bandwidth parameter influences the existence of long memory, persistence or antipersistence of long memory, and the accuracy of the estimated parameter of long memory, which are calculated by utilizing the GPH algorithm. Thirdly, as Jeong et al. [38] proposed, there is a common problem with GPH and other methods to estimate long-term memory of sequences, that is, few authors conducted a simulation analysis close to the actual sequence to test the accuracy of parameter .

In this paper, based on the ARFIMA (1, *d*, 1) process and some typical features of financial time series, we use the Monte Carlo method to test the impact of parameter on the existence of long memory, persistence or antipersistence of long memory, and estimation accuracy of the long memory parameter , so as to give the optimal bandwidth in the GPH algorithm.

The structure of this paper is arranged as follows: Section 2 is the introduction of the GPH method. Section 3 gives the Monte Carlo simulation method and validation rules. Section 4 is the analysis of simulation results. The conclusion is summarized in Section 5.

#### 2. The GPH Semiparametric Method of Long Memory Estimation

Many scholars’ studies [39] show that the basis of GPH semiparametric method is that the data process is a fractional white noise process. Therefore, the fractional white noise process satisfies , where is a stationary process. If the is the spectral density function of , then the spectral density function of can be expressed as

Discretizing the logarithmic form of equation (1),where , , and . is the length of the sequence , and is called the harmonic frequency of the sample data. Geweke and Porter-Hudak proved that the last term in equation (2) is negligible or close to a constant in the sufficiently small harmonic frequency coordinates. Therefore, the Ordinary Least Squares (OLS) algorithm can be performed on equation (2) to estimate the long memory parameter .

Besides, when , Geweke and Porter-Hudak illustrated that the estimator of equation (2) has an approximate distribution:where and . For , equation (3) is verified empirically, and its theoretical proof is still an open question. However, for actual sequences, it is difficult to know the true value of in advance. As the is estimated by equation (2), the existence of long memory can be validated by judging whether the estimator is significantly different from 0. Ignoring the part, equation (2) is transformed as follows:where is the period gram, i.e., the square of the magnitude of the spectral density function. Porter-Hudak proved that obeyed the Gumbel distribution with a negative Euler constant, −0.57721 mean, and variance. Hence, equation (3) is further simplified towhere and under the large sample scenario. The can be estimated by equation (5). Testing the existence of the long memory in sequence can be judged as follows:where , , and . When the sample size is large, the *t* distribution approximates normal distribution and the statistical test of the estimator by equation (6) can be approximately equal to that of equation (3). Setting a confidence level , we can check the existence of the long memory parameter . In this paper, taking into account the robust characteristics of the GPH algorithm in estimating the long memory, we will let . For a large sample, Agiakloglou et al. and Sowell mentioned that equation (5) can still estimate the long memory of the sequence, even if there are short-term components in the sequence, such as the ARFIMA process [39, 40]. In addition, Geweke and Porter-Hudak demonstrated the relationship between long memory parameter and the Hurst exponent by the structured method, i.e., .

#### 3. Simulation Method and Validation Rules

##### 3.1. Simulation Method

In empirical financial research, it is generally believed that the first-order model can adequately depict autocorrelation and fluctuation in financial time series [41]. Combined with the typical features of financial time series, such as sharp peak, heavy tail, asymmetric distribution, and long memory, this paper uses ARFIMA(1,*d*,1) model with Skew Student’s *t* Distribution (SKST) to generate simulation data close to the actual sequence, so as to test the impact of bandwidth parameter of the GPH algorithm on long memory estimation. The ARFIMA (1, *d*, 1) model is expressed aswhere , , and . and are the skewness coefficient and the freedom degree of the biased student’s *t* distribution. We randomly select from (-3, 3) and set = 4 in this paper. and are the autoregressive coefficient (AR) and the moving average coefficient (MA), respectively. It is found that most of the autoregressive and moving average coefficients in financial time series models with first order are between -1 and 1. Thus, and are taken from (-1, 1) randomly. We generate nine types of data with long memory parameters *d* = −0.4, −0.3, −0.2, −0.1, 0, 0.1, 0.2, 0.3, and 0.4 by equation (7). Given that the GPH algorithm is subject to sequence length in estimating long memory, we generate 5000 sequences with length *N* = 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 5000, 10000, and 50000 for each long memory parameter. Hence, there are 9145000 = 630000 sequences in total. Figure 1 shows one simulated sequence and its probability distribution. It can be seen that the simulated data has a sharp peak, a heavy tail, and asymmetric characteristics. When applying the GPH algorithm to estimate sequence having long memory, most of the literature suggests bandwidth parameters , , and [42, 43]. In order to fully understand the influence of different bandwidth parameters on the GPH estimation under different long memory parameters and sequence lengths, this paper takes and the discretization step size is taken as 0.01, where .

**(a)**

**(b)**

##### 3.2. Validation Rules

It is mainly from three aspects to analyze the impact of the bandwidth parameter *a* on testing of the long memory in sequences, including the existence of long memory, the persistence and antipersistence of long memory, and the calculation accuracy of long memory parameters with different sequence lengths. The validation of the first two aspects should be a progressive relationship. The long memory existence of a sequence is firstly tested. If a sequence has a long memory, we judge the persistence or antipersistence of the long memory. However, based on our simulation results, it is found that if we analyze the effect of bandwidth parameter on examining long memory as described above, a lot of useful information would be lost. The optimal range of the parameter *a* derived from the existence judgment of long memory may be a very small interval, or even only a point. Hence, it is difficult to fully investigate the impact of different bandwidth parameters on the persistence or antipersistence judgment of long memory and estimation accuracy of long memory parameter , which is not conducive to finding out the optimal bandwidth parameter *a*. To this end, we set the following rules to select the optimal bandwidth parameter .

*Rule 1. *Based on Monte Carlo simulation and the GPH algorithm, the optimal bandwidth parameter set on the existence test, persistence or antipersistence judgment of long memory, and estimation accuracy of long memory parameter are recorded as , , and .

*Rule 2. *According to the different requirements on testing long memory of sequences by the GPH algorithm, three related subsets of optimal parameters *a* are constructed, that is, , , and . denotes the optimal parameter set which satisfies the existence test and persistence or antipersistence judgment of long memory synchronously. represents the optimal parameter *a* set which satisfies the existence test, persistence or antipersistence judgment of long memory, and estimation accuracy of the long memory parameter. Based on the above definition, the judging accuracy for the existence test under parameter belonging to set is higher than other sets. The same meaning is suitable for and sets.

*Rule 3. *Given a time series, we assume that the probability of its long memory and no long memory is equal, i.e., 0.5. And, for time series with long memory, the probability of different long memory parameters is equal.

###### 3.2.1. The Existence Test of Long Memory

When , we set . Additionally, when , we assume that , , , and . is the *t* statistic of under the bandwidth parameter . represents the number of estimated values rejecting the null hypothesis under the bandwidth parameter , which is equivalent to the number of ; similarly, represents the number of estimated values that do not reject the null hypothesis under the bandwidth parameter , which is equivalent to the number of . is the judging accuracy for the existence test of long memory using the GPH algorithm under bandwidth parameter and long memory parameter . Obviously, the closer the approaches to 1, the more accurate the GPH algorithm is. is employed for measuring the accuracy of the existence test on sequences with no long memory. In order to comprehensively judge the ability of the GPH algorithm to estimate sequences with different long memory, we give the mean of (), i.e., . As , we set for comparative analysis. In the actual analysis, for a sequence , it is impossible to know whether it has long memory in advance. Therefore, the is constructed to test whether the long memory of sequence exists or not under different bandwidth parameters .

###### 3.2.2. The Persistence or Antipersistence Judgment of Long Memory

As and denote the persistence and antipersistence of long memory parameter, respectively, we construct the judging accuracies and to study the impact of the GPH algorithm with different parameters on the long memory test.

Setting ,where denotes the number of estimated and actual with the same plus or minus sign under the bandwidth parameter . is the judging accuracy of persistence or antipersistence for long memory under long memory parameter and bandwidth parameter . and are judging accuracy of persistence and antipersistence long memory, respectively. is the comprehensive judging accuracy of persistence and antipersistence long memory.

###### 3.2.3. The Estimation Accuracy of the Long Memory Parameter

According to the simulation results, if the error rate is used to verify the precision of the GPH algorithm, it is found that there are several orders of magnitude between big and small . It is not conducive to finding out the optimal range of bandwidth parameter . In this paper, some rules are made as follows. If the estimated falls within neighborhood of the truth value , i.e., , it is considered that the estimated by the GPH algorithm is valid under the accuracy . A basic selection principle of is that the neighborhood of different parameters does not overlap with each other. The estimation efficiency of the GPH algorithm under long memory parameter and bandwidth parameter is defined as follows:

refers to the number of the estimated parameter falling in the neighborhood . is the average estimation efficiency of the GPH algorithm with different long memory parameters. The larger the is, the higher the estimation efficiency of the GPH algorithm under the estimation accuracy is. Generally, the smaller the choice of , the farther the distance between neighborhoods of different parameters and the fewer the number of estimated falling into , so that the discrimination degree of may decline, which is not conducive to finding out the optimal range of bandwidth parameter . According to the long memory parameter in Monte Carlos simulation, we set in this paper.

#### 4. Result Analysis

##### 4.1. The Existence Judgment of Long Memory

As can be seen in Figure 2, the judging accuracy of the existence test on long memory increases gradually in the bandwidth parameter , while the judging accuracy on that of no long memory decreases gradually. To be more specific, within the range of the bandwidth parameter , no matter how long the sequence is (simulated length in this paper), if the sequence has a long memory, the judging accuracy is less than 0.3. When , the judging accuracy is only 0.1. However, if a sequence has no long memory, the judging accuracy of the GPH algorithm can approximately reach 0.9. It is impossible to know whether the long memory of the time series exists in advance, so it is difficult to distinguish the long memory from the no long memory reasonably by the GPH algorithm with bandwidth parameter . In the range of bandwidth parameter , for sequences with different lengths, if the sequence has long memory, the judging accuracy of the GPH algorithm is about 0.9. For sequences with no long memory, the judging accuracy is less than 0.2. Hence, with bandwidth parameter , it is not suitable to distinguish the long memory from no long memory. Further analysis shows that it is not suitable to use the GPH method to estimate the existence of long memory in time series on the bandwidth parameter . According to the judging accuracy curves of the existence test on long memory and no long memory, the intersection of two curves is . And, at the left side of the point, the judging accuracy of the no long memory is low, and on the right side of the point, the judging accuracy of the long memory is also low. Therefore, the judging accuracy of the no long memory and the long memory reaches a relatively high level at this point, which is beneficial to estimating the long memory by the GPH algorithm. In addition, the longer the time series length is, the higher the judging accuracy corresponding to the point is. In order to find out the optimal range of bandwidth parameter *a*, we plot the judging accuracy curves in Figure 3. When the sequence length is 2000 or more, it is seen that the is more than 0.75 around .

##### 4.2. The Persistence or Antipersistence Judgment of Long Memory

As seen in Figure 4, for time series with different lengths, the judging accuracy curves of persistence or antipersistence for long memory present the same parabola shape. Within the bandwidth parameter , the judging accuracy reaches a relatively high value. Figure 5 gives the comprehensive judging accuracy of persistence and antipersistence. Within the bandwidth parameter , it is seen that the judging accuracy increases gradually with the increase in the time series length, and when the length of the sequence is above 2000, the judging accuracy is over 0.9.

##### 4.3. Estimation of Long Memory Parameters

In Figure 6, for the time series with different lengths and long memory parameters *d*, the estimation accuracy curves with long memory parameters exhibit a similar shape. Within the bandwidth parameter , the accuracy reaches a relatively high value. In Figure 7, the average estimation accuracy of the GPH algorithm is given. Within the bandwidth parameter , it can be seen that with the increase in the time series length, the average estimation accuracy increases gradually under , which indicates that the probability of the estimated value falling into the neighborhood increases with the increase in the sequence length and the GPH algorithm is effective.

In order to make the optimal parameters *a* suitable for three branches of the long memory test together, ten bandwidth parameters corresponding to the high judging accuracy of the GPH estimation under different sequence lengths are recorded as the optimal bandwidth parameter range, as seen in Table 1.

According to Table 1, we use intersection which refers to the common part of different sets to find out the optimal bandwidth parameter ranges in several scenarios. Without considering the sequence length, we can select [0.59, 0.62] as the optimal bandwidth range of the GPH algorithm for existence test and persistence or antipersistence judgment of long memory, while the optimal bandwidth parameters of the estimation accuracy on long memory parameter *d* belong to [0.6, 0.67]. By taking the intersection of the optimal bandwidth parameters for the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter, the following conclusions can be drawn:① If the GPH algorithm is only needed to estimate the existence test of long memory, the desirable range of bandwidth parameters is [0.59, 0.62].②If the existence test and persistence or antipersistence judgment of long memory are estimated synchronously, the desirable range is [0.59, 0.61].③In addition to the above two aspects of long memory, if the estimation accuracy of the long memory parameter is further required, the desirable bandwidth parameter range can vary in [0.6, 0.61]. Taking into account the sequence length, we can apply the desirable bandwidth parameters range of [0.58, 0.61] to the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter with sequence length below 1000. Given the operational convenience, is recommended as the optimal bandwidth parameter for estimating the long memory by the GPH algorithm.

Table 2 provides the calculation accuracy of the GPH algorithm for estimating long memory with the bandwidth parameter . denotes the calculation accuracy of the GPH algorithm for the existence test of long memory, which is equal to . is the calculation accuracy for satisfying the existence test and persistence or antipersistence judgment of long memory synchronously.

The calculation step of is to take the ratio of the minimum number among the existence test and persistence or antipersistence judgment of long memory to the total simulation number, which is similar to the calculation of , i.e.,

is the calculation accuracy for satisfying the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter synchronously. The calculation process of is

As seen from Table 2, with the increase in the sequence length, the , , and are gradually enlarged. When the sequence length is more than 700, exceeds 0.7, and is over 0.7 when the sequence length is more than 1000. However, for , when the sequence length is 5000, its value is only 0.4822, which is mainly caused by the poor accuracy of the GPH algorithm in estimating the long memory parameter. In Figure 7, when the sequence length is short, such as 300, the judging accuracy is only 0.2 under the estimation accuracy , implying that about 80% of estimated parameter falls outside the neighborhood of true value . The result is consistent with the conclusion in [26]. Therefore, the GPH algorithm has certain defects when estimating long memory parameters. Only when the sequence length is over 10000, the estimation result is effective.

#### 5. Conclusions

In this paper, we use the Monte Carlo simulation method to generate long memory sequences with different lengths by using the ARFIMA (1, *d*, 1) process, so as to study the impact of the GPH algorithm on existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter. Within the bandwidth parameter , for the time series with different lengths, the judging accuracy of the GPH algorithm for the existence test, persistence or antipersistence judgment of long memory, and the estimation accuracy of long memory parameter all reaches a relatively high level. can be selected as the optimal bandwidth parameter in application. With the length of time series increasing from 100 to 50000, the accuracy rate of the GPH algorithm for estimating the existence test of long memory increases from 0.5612 to 0.8786. The calculation accuracy of the GPH algorithm for persistence or antipersistence judgment of long memory is from 0.4697 to 0.8673. The calculation accuracy for satisfying the existence test and persistence or antipersistence judgment of long memory is from 0.0623 to 0.6624. The rules used in the analysis of long memory estimation by the GPH algorithm are gradually discussed from the experimental results. It is a practical and novel method, which can be used as a reference for other methods in testing the long memory.

#### Data Availability

The data used in this study are available on request from the corresponding author.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was funded by the National Natural Science Foundation of China (Grant nos. 71701024 and 82073339) and the National Social Science Fund (nos. 20&ZD128, 20CRK018, 20CTJ015, and 21BJY007).